RAG RAG Ingestion Layer

Stop Building Custom Scrapers for RAG

Feed your RAG stack one clean Markdown file per source. SiteToMarkdown handles the crawling, cleaning, and formatting so you can focus on your retrieval pipeline.

Input: https://docs.stripe.com Ready
Crawling
Cleaning
Ingesting
from langchain_community.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader("./stripe.md")
docs = loader.load()
Diagram: What is RAG - Retrieval augmented generation flow

RAG combines retrieval from a knowledge base with generation by an LLM.

What is Retrieval-Augmented Generation (RAG)?

RAG is a technique that enhances Large Language Models (LLMs) by retrieving relevant data from your external knowledge base before generating an answer. This allows AI to answer questions about your specific documentation, private data, or real-time information that wasn't in its training set.

The "Custom Scraper" Trap

Building a RAG pipeline is hard enough without having to maintain a custom scraper for every data source.

  • Brittle Selectors: One HTML change breaks your entire ingestion pipeline.
  • JS Rendering: Simple requests or curl won't work on modern docs sites.
  • Noise: Navbars, footers, and ads pollute your vector store embeddings.

The SiteToMarkdown Solution

We treat documentation sites as a single unit of knowledge.

  • One URL → One File: Give us the root URL, get back a comprehensive Markdown file.
  • LLM-Optimized: Clean Markdown structure preserves hierarchy for better chunking.
  • Universal Format: Works with LangChain, LlamaIndex, Haystack, and more.

Integration Guide

How to load SiteToMarkdown datasets into your favorite RAG frameworks.

LangChain

LangChain

Use the UnstructuredMarkdownLoader from the community package to load the file into Document objects.

from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./stripe_docs.md")
docs = loader.load()

# Ready for splitting and embedding
print(f"Loaded {len(docs)} documents")
Haystack

Haystack

Use the MarkdownToDocument converter component in your pipeline.

from haystack.components.converters import MarkdownToDocument

converter = MarkdownToDocument()
results = converter.run(sources=["./stripe_docs.md"])
documents = results["documents"]

# Ready for indexing
RAGatouille

RAGatouille

Read the file content and pass it directly to the trainer or indexer.

from ragatouille import RAGTrainer

with open("./stripe_docs.md", "r") as f:
    markdown_content = f.read()

trainer = RAGTrainer(model_name="MyColBERT", pretrained_model_name="colbert-ir/colbertv2.0")
trainer.prepare_training_data(raw_data=[markdown_content], data_out_path="./data/")
RF

RagFlow

Upload the Markdown file directly to your Knowledge Base. RagFlow's Deep Document Understanding engine handles the parsing and chunking automatically.

RagFlow Chunking

Why Markdown is Best for RAG

Raw HTML is full of noise. PDF is hard to parse. Markdown is the sweet spot for LLMs.

Structure Preservation

Headings (#, ##) naturally define semantic chunks, making it easier for splitters to keep related context together.

Token Efficiency

No <div> soup. Markdown is dense and information-rich, saving you money on embedding and storage costs.

Code Friendly

Code blocks are preserved with language tags, which is critical for technical documentation RAG.

Ready to streamline your RAG pipeline?