Keep the semantic core
A clean source keeps the title, headings, paragraphs, lists, links, code, and tables that the model can actually use.
PromptStage
AI workflow staging tools
Guide
HTML is not the enemy. Noisy page shells are. Large language models can read HTML, but browser pages often include scripts, layout wrappers, repeated navigation, and interface text that add tokens without adding much task value.
Core idea
In AI workflows, clean HTML means source material that has been reduced to the parts a model can actually use. PromptStage converts that source into Markdown because Markdown is easier for humans to inspect and easier to hand between prompts, agents, retrieval systems, and automation steps.
A clean source keeps the title, headings, paragraphs, lists, links, code, and tables that the model can actually use.
Navigation, footers, sidebars, cookie banners, scripts, and promotional modules usually add tokens without adding much task value.
An LLM-ready intermediate should be something a human can read and verify before it is passed into a prompt, agent, or retrieval layer.
Why raw HTML gets expensive
The problem is rarely that the page contains too much meaning. It is that the page wraps the meaning in a lot of browser-oriented implementation detail.
Class names and layout wrappers increase token count.
Repeated navigation and footer blocks dilute the main content.
Scripts and style blocks rarely help prompt tasks.
Browser-only UI text can distract from the source body.
The source gets harder to review before reuse.
LLM-ready checklist
If you cannot quickly tell whether the cleaned source begins in the right place and preserves the right structures, the intermediate is not ready yet.
Starts with the real page title or main content section.
Preserves useful heading hierarchy where it helps comprehension.
Keeps code blocks and tables readable when they matter.
Keeps important links with meaningful anchor text.
Removes repeated navigation and promotional modules.
Stays compact enough to inspect before passing downstream.
Does not pretend to bypass client-rendered app states or login walls.
Format choice
Raw HTML is useful when the DOM matters. Plain text is useful when only the words matter. Markdown is the middle format for a lot of AI prep work: it keeps structure without carrying most browser implementation detail.
Workflow
Treat cleanup as a staging step. A better intermediate format usually makes the next model-facing step cheaper, clearer, and easier to debug.
Start with the source page or copied HTML.
Convert it into Markdown.
Inspect the first lines for leftover page chrome.
Check whether headings, code, lists, and tables survived in useful form.
Use the cleaned source in the next model-facing step.
FAQ
The main tool does the cleanup. This page explains what "LLM-ready" actually means and where the cleanup is buying you something.
Yes, but raw browser HTML often carries more noise than the task needs. Cleaning helps when the model should focus on the content rather than the page implementation.
No. Keep HTML when the DOM structure itself matters. Use Markdown when the content is the object of analysis and a cleaner intermediate is more useful.
It often helps, especially when the original page includes repeated chrome or large non-content regions. The main win is making chunks easier to inspect before retrieval.
Open HTML to Markdown for AI when you want the cleaned Markdown output itself.
Compare HTML vs Markdown for AI when you want the broader format-choice framing behind this cleanup path.
Continue into HTML to Markdown for RAG or HTML to Markdown for LangChain and LlamaIndex for retrieval-specific workflow guidance.