Guide

HTML to Markdown
for LangChain
and LlamaIndex

A RAG pipeline is only as clean as the source material it loads. PromptStage fits before ingestion: clean the page into Markdown first, then load the result into your retrieval workflow.

Clean first, load secondInspect the source before it becomes chunks and embeddingsUseful for both manual and early automated RAG pipelines

Ingestion path

PromptStage belongs before the document loader, not instead of it.

LangChain and LlamaIndex both give you ways to load and transform documents, but a noisy web page can still carry navigation, boilerplate, repeated links, and layout text into your chunks. Cleaning first makes the next stage easier to reason about.

Suggested flow

public page or copied HTML
-> PromptStage cleanup
-> inspected Markdown
-> document loader
-> splitter or ingestion pipeline
-> embeddings or retrieval store

Why clean first

Chunking noisy HTML usually preserves the wrong things.

If chunking starts from raw browser payloads, chunks can begin with navigation, repeat site-wide boilerplate, or split the useful article body in awkward places. Cleaning first lets you confirm that headings, code, and tables survived before they become retrieval inputs.

Manual staging workflow

Start with a few clean source documents before you automate harder.

The main goal here is not to build the perfect ingestion stack in one move. It is to prove that the source shape is useful before you scale the pipeline.

Step 1

Run the public page or copied HTML through PromptStage.

Step 2

Copy or download the cleaned Markdown.

Step 3

Save it in your project as a source document.

Step 4

Load the Markdown with the loader or reader you already use.

Step 5

Split by headers or sections when the document structure supports it.

Step 6

Keep source URL, page title, and capture date in metadata.

Example handoff shape

Keep metadata close to the cleaned source.

Source URL, title, and capture date make later QA, refresh work, and debugging much easier. That matters more than using one specific loader API.

Example Markdown document

---
source_url: https://example.com/docs/widget-api
title: Widget API
captured_at: 2026-04-29
prepared_with: PromptStage HTML to Markdown for AI
---

# Widget API

## Authentication

...

What to test

Trust the pipeline only after you inspect the retrieval outcomes.

A cleaner source should make the chunks easier to understand and the retrieval results easier to believe. If it does not, the source shape still needs work.

Test 1

Do retrieval results quote the main content instead of navigation?

Test 2

Do chunks start under meaningful headings?

Test 3

Did code blocks or tables survive in usable form?

Test 4

Can a human inspect the Markdown and understand the source?

Test 5

Does metadata preserve where the page came from?

Related paths

Use this page as the developer-facing integration branch in the Tool A cluster.

This guide sits between the broader format pages and the actual RAG implementation work. It is where the practical ingestion handoff gets explained.

Should I use raw HTML or Markdown for RAG?

Use raw HTML when DOM structure matters. Use Markdown when you want readable content structure without most browser page noise.

Can PromptStage replace my document loader?

No. PromptStage is a cleanup step before loading. Your RAG stack still needs its loader, splitter, embeddings, and retrieval layer.

Should I build a crawler first?

For a small source set, usually no. Start with a few cleaned pages and inspect the retrieval quality before you automate more aggressively.

Broader retrieval guide

Read HTML to Markdown for RAG for the format and chunking framing behind this implementation path.

Automation cousin

Continue into HTML to Markdown for n8n when the same source-cleanup logic needs to feed a workflow tool instead of a code-first stack.