DocLing - Enhanced Document Q&A Application

An intelligent document Q&A application that leverages AI to extract insights from various document types. This application combines document processing capabilities with vector database storage, advanced table extraction, mathematical formula rendering, and OCR to create a comprehensive Retrieval Augmented Generation (RAG) system.

Key Features

Document Processing: Upload and process PDF, DOCX, and image files
Table Extraction: Extract tables from PDF documents with accuracy scores
Table Visualization: Interactive visualization of extracted tables with Plotly
Mathematical Formula Support: Parse and render LaTeX-style mathematical formulas
OCR Integration: Extract text from images using Tesseract OCR
Multilingual Support: Interface available in multiple languages
Vector Search: Find information based on meaning using semantic search
Internet Search Integration: Enhance answers with web information (optional)

Installation

For detailed installation instructions, see INSTALL.md.

Quick setup:

# Make the install script executable
chmod +x install.sh

# Run the install script
./install.sh

Checking Dependencies

If you're having issues with missing dependencies, you can use the dependency checker:

# Install colorama for better formatting (optional)
pip install colorama

# Run the dependency checker
python check_dependencies.py

Or use the quick installer script to automatically install missing dependencies:

python quick_install.py

1. Clone and Set Up

# Clone the repository
git clone https://github.com/yourusername/docling.git
cd docling

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python packages
pip install -r requirements.txt

2. System Dependencies

For full functionality, install the following system dependencies:

Table Extraction (Camelot):

macOS: brew install ghostscript
Ubuntu/Debian: sudo apt-get install ghostscript python3-tk
Windows: Download from Ghostscript Website

OCR Support (Pytesseract):

macOS: brew install tesseract
Ubuntu/Debian: sudo apt-get install tesseract-ocr
Windows: Download from UB-Mannheim/tesseract

3. Environment Setup

Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_api_key
TAVILY_API_KEY=your_tavily_api_key  # Optional, for web search

Security Note: Keep your .env file private and add it to .gitignore.

Usage

Start the application:
```
streamlit run chat.py
```
Upload Documents: Use the sidebar to upload PDF, DOCX, or image files
Ask Questions: Type your questions in the chat interface
View Tables: Tables extracted from documents are displayed interactively
Test Formula Rendering: Use the formula input field to test mathematical formula display
Configure Settings: Adjust search settings and interface language in the sidebar

Features in Detail

Table Extraction and Visualization

The application extracts tables from PDF documents using Camelot, providing:

Table structure preservation
Accuracy scores for extraction quality
Interactive table visualization with Plotly
CSV export option for extracted tables

Mathematical Formula Support

Mathematical formulas are rendered using MathJax:

Support for LaTeX syntax with $$ delimiters
Formula parsing and rendering in chat responses
Formula testing input field for experimentation

OCR for Images

Image text extraction using Tesseract OCR:

Convert image text to searchable content
Support for various image formats
Integration with the document database

Troubleshooting

Missing Dependencies: If you encounter module import errors, check the system dependencies section
OCR Issues: Ensure Tesseract is properly installed for image processing
Table Extraction Problems: Verify Ghostscript installation for Camelot functionality

License

This project is licensed under the MIT License.

Acknowledgments

OpenAI for the GPT API
Streamlit for the web framework
Camelot for table extraction capabilities
Tesseract for OCR functionality
All the open-source libraries that make this project possible

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
data		data
utils		utils
.gitignore		.gitignore
4-search.py		4-search.py
DEPLOY.md		DEPLOY.md
INSTALL.md		INSTALL.md
README.md		README.md
annotation.py		annotation.py
cache.py		cache.py
chat.py		chat.py
check_dependencies.py		check_dependencies.py
check_env.py		check_env.py
chunking.py		chunking.py
comparison.py		comparison.py
embedding.py		embedding.py
extraction.py		extraction.py
install.sh		install.sh
install_plotly.sh		install_plotly.sh
metadata.py		metadata.py
next_features.txt		next_features.txt
packages.txt		packages.txt
quick_install.py		quick_install.py
requirements.txt		requirements.txt
search.py		search.py
styles.py		styles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocLing - Enhanced Document Q&A Application

Key Features

Installation

Checking Dependencies

1. Clone and Set Up

2. System Dependencies

3. Environment Setup

Usage

Features in Detail

Table Extraction and Visualization

Mathematical Formula Support

OCR for Images

Troubleshooting

License

Acknowledgments

About

Releases

Packages

Languages

fakoliba/docling-deploy-

Folders and files

Latest commit

History

Repository files navigation

DocLing - Enhanced Document Q&A Application

Key Features

Installation

Checking Dependencies

1. Clone and Set Up

2. System Dependencies

3. Environment Setup

Usage

Features in Detail

Table Extraction and Visualization

Mathematical Formula Support

OCR for Images

Troubleshooting

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages