A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.
- Multi-format document processing (TXT, PDF, PNG, JPG, JPEG, DOCX, XLSX)
- Dense vector search using OpenAI embeddings
- Sparse retrieval using BM25
- Hybrid search combining dense and sparse methods
- Reranking for improved results
- OCR for image and PDF text extraction
- Intelligent text chunking
- Metadata management
- Caching for performance
- LLM traceability
- Input/output guardrails
- Batch document upload
- Async processing
- Query expansion
- Cross-modal retrieval
- RESTful API with FastAPI
- Interactive web interface
- Comprehensive logging
- Unit tests
- Python 3.11.9
- Tesseract OCR
- OpenAI API key
- Windows 11 / Linux / macOS
git clone https://github.com/stonedseeker/Drac.git
cd Dracconda create -n Drac python=3.11.9 -y
conda activate Dracpip install -r requirements.txtWindows:
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Default path: C:\Program Files\Tesseract-OCR\tesseract.exe
Linux:
sudo apt-get install tesseract-ocrmacOS:
brew install tesseractEdit .env:
OPENAI_API_KEY=your_openai_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.execd Drac
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Server will be available at: http://localhost:8000
Interactive API docs: http://localhost:8000/docs
curl -X POST "http://localhost:8000/api/upload" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.pdf"curl -X POST "http://localhost:8000/api/upload/batch" \
-F "files=@doc1.pdf" \
-F "files=@doc2.txt" \
-F "files=@image.png"curl -X POST "http://localhost:8000/api/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"top_k": 10,
"enable_reranking": true
}'curl http://localhost:8000/healthimport requests
response = requests.post('http://localhost:8000/api/query', json={
'query': 'Find documents about sales data',
'top_k': 5,
'enable_reranking': True,
'file_types': ['pdf', 'xlsx']
})
results = response.json()Run all tests:
pytest tests/ -vRun specific test file:
pytest tests/test_ingestion.py -vRun with coverage:
pytest tests/ --cov=app --cov-report=html- Support for audio/video files
- Multi-language support
- Document summarization
- Conversation memory
- Advanced analytics
- User authentication
- Cloud deployment
- GPU acceleration
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request