Preset comparison
Different chunking presets protect different parts of the source.
The right preset depends on what would make a retrieved answer fail: missing headings, oversized chunks, broken tables, split code, or context-free fragments.
Heading-based chunks
Best for: Docs, guides, and knowledge-base pages with useful section headings.
Example output:
Chunk 1: # Widget API
Includes the title, intro, and Authentication section.
Heading path: Widget API > Authentication
Chunk 2: ## Rate limits
Keeps the rate-limit table attached to its heading.
Heading path: Widget API > Rate limits
Chunk 3: ## Example request
Keeps the code sample and Common errors section reviewable.
Watch for: Weak headings produce weak chunks. If a page has no useful headings, this preset may need manual cleanup first.
Token-window chunks
Best for: Long pages where roughly even chunk size matters more than exact document sections.
Example output:
Chunk 1: target 180 tokens
Widget API intro, Authentication, and part of Rate limits.
Chunk 2: target 180 tokens
Remaining Rate limits, Example request, and Common errors.
Watch for: Token windows can split tables or code if the source is dense. Review boundaries before exporting to JSONL.
Paragraph-safe chunks
Best for: Articles, explainers, and source notes where paragraph meaning matters more than exact token packing.
Example output:
Chunk 1:
Title, intro, and Authentication paragraph.
Chunk 2:
Rate limits heading plus the complete table.
Chunk 3:
Example request code block plus Common errors list.
Watch for: Paragraph-safe chunks can vary in size. Use token estimates to catch sections that grow too large.
Code/table-safe chunks
Best for: Developer docs where broken tables or split fenced code blocks can damage retrieval quality.
Example output:
Chunk 1:
Widget API intro and Authentication.
Chunk 2:
Rate limits heading with the full Markdown table intact.
Chunk 3:
Example request heading with the full fenced bash block intact.
Chunk 4:
Common errors list.
Watch for: This preset may create more chunks, but it protects the source shapes that are most painful to reconstruct later.