Description
Add a PDFLoader that reads a PDF file and returns one Document per page (or one Document for the whole file, configurable).
Motivation
PDF is the most common document format in enterprise and research settings. Without it, RAG Framework cannot process the majority of real-world document corpora.
Acceptance criteria
Files to touch
ragframework/document/loaders.py — add PDFLoader
ragframework/document/__init__.py — export it
tests/test_document/test_loaders.py — add tests
Resources
- pypdf docs
- Existing loaders (
TextFileLoader, MarkdownLoader) in ragframework/document/loaders.py as reference
Description
Add a
PDFLoaderthat reads a PDF file and returns oneDocumentper page (or oneDocumentfor the whole file, configurable).Motivation
PDF is the most common document format in enterprise and research settings. Without it, RAG Framework cannot process the majority of real-world document corpora.
Acceptance criteria
PDFLoaderinragframework/document/loaders.pyDocumentLoaderfromragframework/base.pypypdf(already listed in the[pdf]optional extra inpyproject.toml)LoaderError(fromragframework/exceptions.py) on failurepypdfimport guarded with a helpful error message pointing topip install ragframework[pdf]tests/test_document/test_loaders.pyragframework/document/__init__.pyCHANGELOG.mdupdated under[Unreleased]Files to touch
ragframework/document/loaders.py— addPDFLoaderragframework/document/__init__.py— export ittests/test_document/test_loaders.py— add testsResources
TextFileLoader,MarkdownLoader) inragframework/document/loaders.pyas reference