A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.
- Multi-format Support: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
- Intelligent OCR: Layout-aware PDF text extraction with OCR fallback for images
- Semantic Chunking: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
- LLM Summarization: Generate concise document summaries (with fallback)
- Meilisearch Integration: Built-in support for indexing to Meilisearch
- Flexible API: Use components individually or as a unified pipeline
pip install docprocessorpip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.gitgit clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"For OCR functionality, install system packages:
Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utilsmacOS:
brew install tesseract popplerfrom docprocessor import DocumentProcessor
# Initialize processor
processor = DocumentProcessor()
# Process a document
result = processor.process(
file_path="document.pdf",
extract_text=True,
chunk=True,
summarize=False # Requires LLM client
)
print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")from docprocessor import DocumentProcessor
# Your LLM client (must have a complete_chat method)
class MyLLMClient:
def complete_chat(self, messages, temperature):
# Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
return {"content": "Generated summary here"}
llm_client = MyLLMClient()
processor = DocumentProcessor(
llm_client=llm_client,
summary_target_words=500
)
result = processor.process(
file_path="document.pdf",
summarize=True
)
print(f"Summary: {result.summary}")from docprocessor import DocumentProcessor, MeiliSearchIndexer
# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)
# Index to Meilisearch
indexer = MeiliSearchIndexer(
url="http://localhost:7700",
api_key="your_master_key",
index_prefix="dev_" # Optional environment prefix
)
# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)
# Index chunks
indexer.index_chunks(
chunks=search_docs,
index_name="document_chunks"
)
# Search
results = indexer.search(
query="artificial intelligence",
index_name="document_chunks",
limit=10
)processor = DocumentProcessor(
chunk_size=1024, # Larger chunks
chunk_overlap=100, # More overlap
min_chunk_size=200 # Higher minimum
)
chunks = processor.chunk_text(
text="Your long document text here...",
filename="document.txt"
)processor = DocumentProcessor()
extraction = processor.extract_text("document.pdf")
print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")# Index to multiple environments
environments = {
"dev": {
"url": "http://localhost:7700",
"api_key": "dev_key",
"prefix": "dev_"
},
"prod": {
"url": "https://search.production.com",
"api_key": "prod_key",
"prefix": "prod_"
}
}
for env_name, config in environments.items():
indexer = MeiliSearchIndexer(
url=config["url"],
api_key=config["api_key"],
index_prefix=config["prefix"]
)
indexer.index_chunks(search_docs, "document_chunks")
print(f"Indexed to {env_name}")Main class for document processing.
Parameters:
ocr_enabled(bool): Enable OCR for PDFs/images. Default:Truechunk_size(int): Target chunk size in tokens. Default:512chunk_overlap(int): Overlap between chunks. Default:50min_chunk_size(int): Minimum chunk size. Default:100summary_target_words(int): Target summary length. Default:500llm_client(Optional[Any]): LLM client for summarizationllm_temperature(float): LLM temperature. Default:0.3
Methods:
process(): Full pipeline (extract, chunk, summarize)extract_text(): Extract text from documentchunk_text(): Chunk text into segmentssummarize_text(): Generate summarychunks_to_search_documents(): Convert chunks for indexing
Interface for Meilisearch operations.
Parameters:
url(str): Meilisearch server URLapi_key(str): Meilisearch API keyindex_prefix(Optional[str]): Prefix for index names
Methods:
index_chunks(): Index multiple documentsindex_document(): Index single documentsearch(): Search an indexdelete_document(): Delete by IDdelete_documents_by_filter(): Delete by filtercreate_index(): Create new index
Data class representing a text chunk.
Attributes:
chunk_id(str): Unique identifierfile_id(str): Source file identifieroutput_id(str): Output identifierproject_id(int): Project identifierfilename(str): Source filenamechunk_number(int): Chunk sequence numbertotal_chunks(int): Total chunks in documentchunk_text(str): The chunk text contenttoken_count(int): Number of tokenspages(List[int]): Page numbers (for PDFs)metadata(Dict): Additional metadata
DocProcessor consists of several independent components:
- ContentExtractor: Extracts text from various file formats
- DocumentChunker: Splits text into semantic segments
- DocumentSummarizer: Generates LLM-based summaries
- MeiliSearchIndexer: Indexes documents to Meilisearch
Each component can be used independently or through the unified DocumentProcessor API.
Python: 3.10+ (tested on 3.10, 3.11, 3.12)
Core Dependencies:
- pdfminer.six - PDF text extraction
- pdf2image - PDF to image conversion
- pytesseract - OCR engine
- scikit-image - Image preprocessing
- Pillow - Image handling
- python-docx - DOCX extraction
- python-pptx - PPTX extraction
- langchain-text-splitters - Semantic chunking
- tiktoken - Token counting
Optional:
- meilisearch - Search engine integration
See the examples/ directory for more usage examples:
basic_usage.py- Simple document processingmulti_environment.py- Indexing to multiple environmentscustom_chunking.py- Advanced chunking options
The easiest way to start developing:
- Click the Code button on GitHub
- Select Codespaces → Create codespace on main
- Wait for the environment to build (includes all dependencies)
- Start coding!
The devcontainer automatically installs:
- Python 3.11
- All system dependencies (Tesseract, Poppler)
- Python dependencies in editable mode
- Pre-commit hooks
- VS Code extensions (Black, isort, flake8, etc.)
# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler
# Install Python dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run tests with coverage
pytest --cov=docprocessorWe use automated tools to maintain code quality:
# Format code
black docprocessor tests
# Sort imports
isort docprocessor tests
# Lint
flake8 docprocessor tests
# Type check
mypy docprocessor
# Or run all checks with pre-commit
pre-commit run --all-files# Run all tests
pytest
# With coverage report
pytest --cov=docprocessor --cov-report=html
# Run specific test file
pytest tests/test_processor.py -v
# Run tests matching pattern
pytest -k "test_extract" -vWe love contributions! Please see CONTRIBUTING.md for details on:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- Issue reporting
Quick tips:
- Use the devcontainer for consistent environment
- Write tests for new features
- Follow PEP 8 and use pre-commit hooks
- Update documentation for API changes
- Add entries to CHANGELOG.md
See CHANGELOG.md for version history and release notes.
MIT License - see LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
If you use docprocessor in your research or project, please cite:
@software{docprocessor2025,
title = {docprocessor: Intelligent Document Processing Library},
author = {Knowledge Innovation Centre},
year = {2025},
url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}Made with ❤️ by Knowledge Innovation Centre