langchain-pdfmux

LangChain document loader for pdfmux -- self-healing PDF extraction for RAG pipelines.

Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

Smart routing -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
Confidence scoring -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
Self-healing -- retries with alternative extractors when the primary one returns low-quality output

Install

pip install langchain-pdfmux

Usage

from langchain_pdfmux import PDFMuxLoader

docs = PDFMuxLoader("report.pdf").load()

Each Document includes metadata with extraction quality signals:

loader = PDFMuxLoader("report.pdf", quality="high")
for doc in loader.lazy_load():
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }

Options

# Quality presets: "fast", "standard" (default), "high"
loader = PDFMuxLoader("report.pdf", quality="high")

# Load all PDFs in a directory
loader = PDFMuxLoader("./papers/")

# Custom glob pattern
loader = PDFMuxLoader("./papers/", glob="**/*.pdf")

# Streaming with lazy_load
for doc in PDFMuxLoader("large.pdf").lazy_load():
    process(doc)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src/langchain_pdfmux		src/langchain_pdfmux
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langchain-pdfmux

Why pdfmux?

Install

Usage

Options

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

langchain-pdfmux

Why pdfmux?

Install

Usage

Options

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages