Skip to content

NameetP/langchain-pdfmux

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

langchain-pdfmux

PyPI version Python versions License: MIT

LangChain document loader for pdfmux -- self-healing PDF extraction for RAG pipelines.

Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

  • Smart routing -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
  • Confidence scoring -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
  • Self-healing -- retries with alternative extractors when the primary one returns low-quality output

Install

pip install langchain-pdfmux

Usage

from langchain_pdfmux import PDFMuxLoader

docs = PDFMuxLoader("report.pdf").load()

Each Document includes metadata with extraction quality signals:

loader = PDFMuxLoader("report.pdf", quality="high")
for doc in loader.lazy_load():
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }

Options

# Quality presets: "fast", "standard" (default), "high"
loader = PDFMuxLoader("report.pdf", quality="high")

# Load all PDFs in a directory
loader = PDFMuxLoader("./papers/")

# Custom glob pattern
loader = PDFMuxLoader("./papers/", glob="**/*.pdf")

# Streaming with lazy_load
for doc in PDFMuxLoader("large.pdf").lazy_load():
    process(doc)

License

MIT

About

LangChain document loader for pdfmux — self-healing PDF extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages