LangChain document loader for pdfmux -- self-healing PDF extraction for RAG pipelines.
Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:
- Smart routing -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
- Confidence scoring -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
- Self-healing -- retries with alternative extractors when the primary one returns low-quality output
pip install langchain-pdfmuxfrom langchain_pdfmux import PDFMuxLoader
docs = PDFMuxLoader("report.pdf").load()Each Document includes metadata with extraction quality signals:
loader = PDFMuxLoader("report.pdf", quality="high")
for doc in loader.lazy_load():
print(doc.metadata)
# {
# "source": "report.pdf",
# "title": "Q4 Results",
# "page_start": 1,
# "page_end": 3,
# "tokens": 820,
# "confidence": 0.94
# }# Quality presets: "fast", "standard" (default), "high"
loader = PDFMuxLoader("report.pdf", quality="high")
# Load all PDFs in a directory
loader = PDFMuxLoader("./papers/")
# Custom glob pattern
loader = PDFMuxLoader("./papers/", glob="**/*.pdf")
# Streaming with lazy_load
for doc in PDFMuxLoader("large.pdf").lazy_load():
process(doc)MIT