RAG chunk staging

Markdown Chunk Inspector

Paste cleaned Markdown and inspect how it splits before embedding, retrieval, or agent-context storage.

Heading, token, paragraph, and code/table presetsToken estimates, line ranges, and warningsMarkdown, JSON, and JSONL export

Input

Inspect cleaned Markdown before chunking

Markdown only

Start chunks at Markdown headings, then split long sections conservatively.

Use Tool A first for public pages or raw HTML, then paste the cleaned Markdown here before embedding or storing retrieval context.

Output

Chunk inspection report

Chunk previews, warnings, and exports appear here.

Paste cleaned Markdown, choose a preset, then inspect heading context, token estimates, code/table boundaries, and JSONL-ready chunk records.

Short answer

Markdown Chunk Inspector checks whether cleaned Markdown is ready for RAG.

Use Markdown Chunk Inspector after HTML cleanup and before embedding. It reviews chunk boundaries, heading context, token estimates, overlap previews, table and code safety, and export shape so retrieval records are easier to inspect before they enter a vector database or agent context store.

Workflow fit

A review layer between cleaned Markdown and retrieval storage.

Use this when a page is already clean enough to read, but not yet inspected enough to trust as retrieval context.

Why this follows HTML cleanup

Tool A removes page chrome. The chunk inspector checks whether the cleaned Markdown still splits into retrieval-friendly sections.

What the warnings catch

The first pass flags missing heading context, tiny chunks, oversized chunks, Markdown tables, fenced code, and overlap previews.

What exports are for

Use Markdown for human QA, JSON for app pipelines, and JSONL when each chunk should become one retrieval or agent-context record.

WarningWhat it meansBest next fix
Missing heading contextA chunk may be technically valid but hard to interpret when retrieved alone.Preserve the nearest useful H2/H3 path or split from a cleaner section boundary.
Oversized chunkA dense section may exceed the target window before the retrieval layer sees it.Use paragraph-safe or token-window chunking, then check that the split keeps meaning intact.
Code or table boundary riskSplitting inside a table or fenced code block can damage the retrieved answer.Use the code/table-safe preset and review the exported JSONL record before ingestion.

FAQ

Scope and limits

Does this embed my chunks?

No. It only inspects and exports chunks so you can review them before using a vector database, RAG framework, or agent context store.

Should I use this before or after HTML to Markdown for AI?

Use HTML to Markdown for AI first when the source is a public page or raw HTML. Then paste the cleaned Markdown here to inspect chunk boundaries.

Are the token counts exact?

No. They are deterministic estimates for planning and comparison. Always leave margin for the tokenizer used by your final model or retrieval stack.