DocuQuery AI is an advanced document question-answering system that leverages Retrieval-Augmented Generation (RAG) to provide intelligent responses from various document sources. Built with state-of-the-art NLP techniques and modern AI technologies, it offers a streamlined interface for extracting and querying information from multiple document formats.
- Multiple Format Support: Process documents in various formats:
- PDF files
- Microsoft Word documents (DOCX)
- Plain text files (TXT)
- Web URLs
- Smart Text Extraction: Efficient text extraction with automatic content parsing
- Chunk-based Processing: Intelligent document chunking for optimal processing
- Vector-based Search: FAISS-powered vector store for efficient similarity search
- Advanced Embeddings: Utilizes HuggingFace's sentence transformers for text embeddings
- Intelligent QA: Leverages Meta's Llama model for generating accurate responses
- Interactive Web Interface: Built with Streamlit for a seamless user experience
- Real-time Processing: Dynamic document processing and question answering
- Session Management: Maintain context across multiple queries
- Error Handling: Robust error handling with informative feedback
- Python 3.8 or higher
- Git
- Hugging Face API key
- Clone the repository:
git clone https://github.com/your-username/docuquery-ai.git
cd docuquery-ai
- Create and activate a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Create
secret_api_keys.py
in the project root:
huggingface_api_key = "your-api-key-here"
docuquery-ai/
├── app.py # Complete code of the application
├── requirements.txt
├── README.md
└── secret_api_keys.py # API keys (not tracked in git)
- Start the application:
streamlit run app.py
-
Select input type:
- Upload a document (PDF/DOCX/TXT)
- Paste a URL
- Enter text directly
-
Process the input and wait for confirmation
-
Ask questions about your document in natural language
Key parameters can be adjusted in src/config.py
:
MAX_FILE_SIZE
: Maximum allowed file size (default: 10MB)CHUNK_SIZE
: Text chunk size for processingCHUNK_OVERLAP
: Overlap between chunksEMBEDDING_MODEL
: HuggingFace model for embeddingsLLM_MODEL
: Language model for question answering
streamlit
: Web interfacelangchain
: Document processing and QA chainsfaiss-cpu
: Vector similarity searchPyPDF2
: PDF processingpython-docx
: DOCX processinghuggingface-hub
: AI model accesssentence-transformers
: Text embeddings
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- HuggingFace for providing the model infrastructure
- Streamlit for the web framework
- FAISS for vector similarity search
- LangChain for document processing capabilities
For questions and support, please open an issue in the GitHub repository.