Skip to content

A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.

Notifications You must be signed in to change notification settings

stonedseeker/Drac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DRAC: Dynamic Retrieval Across Content

A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.

Core Capabilities

  • Multi-format document processing (TXT, PDF, PNG, JPG, JPEG, DOCX, XLSX)
  • Dense vector search using OpenAI embeddings
  • Sparse retrieval using BM25
  • Hybrid search combining dense and sparse methods
  • Reranking for improved results
  • OCR for image and PDF text extraction
  • Intelligent text chunking
  • Metadata management
  • Caching for performance
  • LLM traceability
  • Input/output guardrails

Features

  • Batch document upload
  • Async processing
  • Query expansion
  • Cross-modal retrieval
  • RESTful API with FastAPI
  • Interactive web interface
  • Comprehensive logging
  • Unit tests

Prerequisites

  • Python 3.11.9
  • Tesseract OCR
  • OpenAI API key
  • Windows 11 / Linux / macOS

Installation

1. Clone Repository

git clone https://github.com/stonedseeker/Drac.git
cd Drac

2. Create Conda Environment

conda create -n Drac python=3.11.9 -y
conda activate Drac

3. Install Dependencies

pip install -r requirements.txt

4. Install Tesseract OCR

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki Default path: C:\Program Files\Tesseract-OCR\tesseract.exe

Linux:

sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

5. Configure Environment

Edit .env:

OPENAI_API_KEY=your_openai_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe

Usage

Start the API Server

cd Drac
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Server will be available at: http://localhost:8000

API Documentation

Interactive API docs: http://localhost:8000/docs

API Endpoints

Upload Document

curl -X POST "http://localhost:8000/api/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

Batch Upload

curl -X POST "http://localhost:8000/api/upload/batch" \
  -F "files=@doc1.pdf" \
  -F "files=@doc2.txt" \
  -F "files=@image.png"

Query Documents

curl -X POST "http://localhost:8000/api/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "top_k": 10,
    "enable_reranking": true
  }'

Health Check

curl http://localhost:8000/health

Sample Queries

import requests

response = requests.post('http://localhost:8000/api/query', json={
    'query': 'Find documents about sales data',
    'top_k': 5,
    'enable_reranking': True,
    'file_types': ['pdf', 'xlsx']
})

results = response.json()

Testing

Run all tests:

pytest tests/ -v

Run specific test file:

pytest tests/test_ingestion.py -v

Run with coverage:

pytest tests/ --cov=app --cov-report=html

Future Enhancements

  • Support for audio/video files
  • Multi-language support
  • Document summarization
  • Conversation memory
  • Advanced analytics
  • User authentication
  • Cloud deployment
  • GPU acceleration

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

About

A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages