DocProcessor

A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.

Features

Multi-format Support: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
Intelligent OCR: Layout-aware PDF text extraction with OCR fallback for images
Semantic Chunking: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
LLM Summarization: Generate concise document summaries (with fallback)
Meilisearch Integration: Built-in support for indexing to Meilisearch
Flexible API: Use components individually or as a unified pipeline

Installation

From PyPI (Coming Soon)

pip install docprocessor

From GitHub

pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git

For Development

git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"

System Dependencies

For OCR functionality, install system packages:

Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Quick Start

Basic Usage

from docprocessor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
result = processor.process(
    file_path="document.pdf",
    extract_text=True,
    chunk=True,
    summarize=False  # Requires LLM client
)

print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")

With LLM Summarization

from docprocessor import DocumentProcessor

# Your LLM client (must have a complete_chat method)
class MyLLMClient:
    def complete_chat(self, messages, temperature):
        # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
        return {"content": "Generated summary here"}

llm_client = MyLLMClient()

processor = DocumentProcessor(
    llm_client=llm_client,
    summary_target_words=500
)

result = processor.process(
    file_path="document.pdf",
    summarize=True
)

print(f"Summary: {result.summary}")

With Meilisearch Indexing

from docprocessor import DocumentProcessor, MeiliSearchIndexer

# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)

# Index to Meilisearch
indexer = MeiliSearchIndexer(
    url="http://localhost:7700",
    api_key="your_master_key",
    index_prefix="dev_"  # Optional environment prefix
)

# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)

# Index chunks
indexer.index_chunks(
    chunks=search_docs,
    index_name="document_chunks"
)

# Search
results = indexer.search(
    query="artificial intelligence",
    index_name="document_chunks",
    limit=10
)

Advanced Usage

Custom Chunking Parameters

processor = DocumentProcessor(
    chunk_size=1024,      # Larger chunks
    chunk_overlap=100,    # More overlap
    min_chunk_size=200    # Higher minimum
)

chunks = processor.chunk_text(
    text="Your long document text here...",
    filename="document.txt"
)

Extract Text Only

processor = DocumentProcessor()

extraction = processor.extract_text("document.pdf")

print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")

Multi-Environment Indexing

# Index to multiple environments
environments = {
    "dev": {
        "url": "http://localhost:7700",
        "api_key": "dev_key",
        "prefix": "dev_"
    },
    "prod": {
        "url": "https://search.production.com",
        "api_key": "prod_key",
        "prefix": "prod_"
    }
}

for env_name, config in environments.items():
    indexer = MeiliSearchIndexer(
        url=config["url"],
        api_key=config["api_key"],
        index_prefix=config["prefix"]
    )

    indexer.index_chunks(search_docs, "document_chunks")
    print(f"Indexed to {env_name}")

API Reference

DocumentProcessor

Main class for document processing.

Parameters:

ocr_enabled (bool): Enable OCR for PDFs/images. Default: True
chunk_size (int): Target chunk size in tokens. Default: 512
chunk_overlap (int): Overlap between chunks. Default: 50
min_chunk_size (int): Minimum chunk size. Default: 100
summary_target_words (int): Target summary length. Default: 500
llm_client (Optional[Any]): LLM client for summarization
llm_temperature (float): LLM temperature. Default: 0.3

Methods:

process(): Full pipeline (extract, chunk, summarize)
extract_text(): Extract text from document
chunk_text(): Chunk text into segments
summarize_text(): Generate summary
chunks_to_search_documents(): Convert chunks for indexing

MeiliSearchIndexer

Interface for Meilisearch operations.

Parameters:

url (str): Meilisearch server URL
api_key (str): Meilisearch API key
index_prefix (Optional[str]): Prefix for index names

Methods:

index_chunks(): Index multiple documents
index_document(): Index single document
search(): Search an index
delete_document(): Delete by ID
delete_documents_by_filter(): Delete by filter
create_index(): Create new index

DocumentChunk

Data class representing a text chunk.

Attributes:

chunk_id (str): Unique identifier
file_id (str): Source file identifier
output_id (str): Output identifier
project_id (int): Project identifier
filename (str): Source filename
chunk_number (int): Chunk sequence number
total_chunks (int): Total chunks in document
chunk_text (str): The chunk text content
token_count (int): Number of tokens
pages (List[int]): Page numbers (for PDFs)
metadata (Dict): Additional metadata

Architecture

DocProcessor consists of several independent components:

ContentExtractor: Extracts text from various file formats
DocumentChunker: Splits text into semantic segments
DocumentSummarizer: Generates LLM-based summaries
MeiliSearchIndexer: Indexes documents to Meilisearch

Each component can be used independently or through the unified DocumentProcessor API.

Requirements

Python: 3.10+ (tested on 3.10, 3.11, 3.12)

Core Dependencies:

pdfminer.six - PDF text extraction
pdf2image - PDF to image conversion
pytesseract - OCR engine
scikit-image - Image preprocessing
Pillow - Image handling
python-docx - DOCX extraction
python-pptx - PPTX extraction
langchain-text-splitters - Semantic chunking
tiktoken - Token counting

Optional:

meilisearch - Search engine integration

Examples

See the examples/ directory for more usage examples:

basic_usage.py - Simple document processing
multi_environment.py - Indexing to multiple environments
custom_chunking.py - Advanced chunking options

Development

Using GitHub Codespaces (Recommended)

The easiest way to start developing:

Click the Code button on GitHub
Select Codespaces → Create codespace on main
Wait for the environment to build (includes all dependencies)
Start coding!

The devcontainer automatically installs:

Python 3.11
All system dependencies (Tesseract, Poppler)
Python dependencies in editable mode
Pre-commit hooks
VS Code extensions (Black, isort, flake8, etc.)

Local Development Setup

# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor

# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

# Install Python dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run tests with coverage
pytest --cov=docprocessor

Code Quality

We use automated tools to maintain code quality:

# Format code
black docprocessor tests

# Sort imports
isort docprocessor tests

# Lint
flake8 docprocessor tests

# Type check
mypy docprocessor

# Or run all checks with pre-commit
pre-commit run --all-files

Running Tests

# Run all tests
pytest

# With coverage report
pytest --cov=docprocessor --cov-report=html

# Run specific test file
pytest tests/test_processor.py -v

# Run tests matching pattern
pytest -k "test_extract" -v

Contributing

We love contributions! Please see CONTRIBUTING.md for details on:

Development setup
Code style guidelines
Testing requirements
Pull request process
Issue reporting

Quick tips:

Use the devcontainer for consistent environment
Write tests for new features
Follow PEP 8 and use pre-commit hooks
Update documentation for API changes
Add entries to CHANGELOG.md

Changelog

See CHANGELOG.md for version history and release notes.

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

Citation

If you use docprocessor in your research or project, please cite:

@software{docprocessor2025,
  title = {docprocessor: Intelligent Document Processing Library},
  author = {Knowledge Innovation Centre},
  year = {2025},
  url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}

Made with ❤️ by Knowledge Innovation Centre

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.devcontainer		.devcontainer
.github		.github
docprocessor		docprocessor
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
PR_SUMMARY.md		PR_SUMMARY.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

Knowledge-Innovation-Centre/doc-processor

Folders and files

Latest commit

History

Repository files navigation

DocProcessor

Table of Contents

Features

Installation

From PyPI (Coming Soon)

From GitHub

For Development

System Dependencies

Quick Start

Basic Usage

With LLM Summarization

With Meilisearch Indexing

Advanced Usage

Custom Chunking Parameters

Extract Text Only

Multi-Environment Indexing

API Reference

DocumentProcessor

MeiliSearchIndexer

DocumentChunk

Architecture

Requirements

Examples

Development

Using GitHub Codespaces (Recommended)

Local Development Setup

Code Quality

Running Tests

Contributing

Changelog

License

Support

Citation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages