EquiChat is a sophisticated financial document analysis system that extracts comprehensive metrics from equity research reports and financial PDFs using OpenAI's advanced language models. Built for scalability and accuracy, it provides both programmatic APIs and an interactive chat interface for querying financial data.
- π Comprehensive Data Extraction: Extracts ALL financial metrics from PDFs using GPT-5/GPT-4o
- π Multi-Modal Analysis: Processes both tables and text with intelligent parsing
- π¬ Natural Language Queries: Chat interface for asking questions like "What is Hindalco's revenue in FY24?"
- π― Structured Extraction: Converts unstructured PDFs into queryable facts database
- π Batch Processing: Parallel PDF ingestion with smart deduplication
- π REST API: FastAPI-based web service for integration
- πΎ Storage: DuckDB for fast analytical queries
- π Vector Search: Semantic search across document content
EquiChat consists of two main components:
Processes PDFs and stores data in dual formats for different query types.
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β PDF Files βββββΆβ OpenAI GPT-5 βββββΆβ Structured Facts β
β (Equity Reports)β β Extraction β β (DuckDB Tables) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β β
β βΌ
β βββββββββββββββββββββββ
βββββββββββββββΆβ Vector Embeddings β
β (FAISS Index) β
βββββββββββββββββββββββ
Routes queries to appropriate data sources based on query type.
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β Chat UI βββββΆβ Smart Router βββββΆβ Structured Facts β
β Natural Languageβ β Query Engine β β (SQL Query) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Vector Search β
β (Semantic Retrieval)β
βββββββββββββββββββββββ
-
PDF Processing (
src/equichat/ingest.py)- OpenAI GPT-5 extraction with structured prompts
- Fallback to traditional table/text parsing
- Intelligent unit normalization and confidence scoring
-
Dual Storage System
- Structured Facts (
src/equichat/store.py): DuckDB tables for metrics, ratios, financial data - Vector Index (
src/equichat/vector_store.py): FAISS embeddings for semantic text search
- Structured Facts (
-
Smart Router (
src/equichat/router.py&src/equichat/router_llm.py)- Query Classification: Determines if query needs structured data or semantic search
- Intent Detection: Metrics lookup vs. text-based questions
- Route Selection: SQL queries vs. vector similarity search
-
Query Execution
- Structured Queries (
src/equichat/facts_query.py): SQL generation for metrics/ratios - Semantic Search (
src/equichat/vector_search.py): FAISS retrieval for contextual questions
- Structured Queries (
-
API Server (
scripts/api.py)- FastAPI-based REST endpoints
- File upload and processing
- Real-time query handling via router
- Python 3.9+
- OpenAI API key
- (Optional) uv for fast dependency management
-
Clone the repository
git clone <repository-url> cd answer_chatbot
-
Install dependencies
# Using uv (recommended) uv pip install -r requirements.txt # Or using pip pip install -r requirements.txt
-
Set environment variables
export OPENAI_API_KEY="your-openai-api-key" export EQUICHAT_DB_PATH="./data/equichat.duckdb" export EQUICHAT_OPENAI_MODEL="gpt-5" # or gpt-4o-mini for cost efficiency
Traditional document ingestion:
# Ingest documents with legacy parsing (fast, less comprehensive)
uv run python scripts/ingest_documents.py data --ignore-last-pageOpenAI-powered extraction (recommended):
# Single PDF with caching
python scripts/ingest_batch.py --folder ./data --limit 1
# Process multiple PDFs in parallel
python scripts/ingest_batch.py \
--folder ./data \
--limit 10 \
--workers 3 \
--cache-dir ./cache/extractions
# High-performance batch ingestion (recommended)
uv run python scripts/ingest_batch.py \
--folder data \
--workers 20 \
--limit 20 \
--force \
--cache-dir ./my_cache# Build FAISS vector index for semantic search
uv run python - <<'PY'
from equichat.vector_search import build_faiss_from_duckdb
print(build_faiss_from_duckdb(out_dir="./data/vec_index"))
PY# Start FastAPI server
python scripts/api.py
# Or with auto-reload for development
EQUICHAT_RELOAD=true python scripts/api.pyInteractive chat interface:
# Start chat interface
python scripts/chat.pyDirect LLM queries:
# Query using LLM router (recommended for testing)
uv run python scripts/query_llm.py "What are all the companies in bangalore?"# Upload and process PDF
curl -X POST "http://localhost:8000/ingest" \
-F "file=@financial_report.pdf" \
-F "company_hint=Company Name"
# Query extracted data
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"query": "What is Hindalco revenue in FY24?"}'
# List all extracted facts
curl "http://localhost:8000/facts?limit=100"| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
required | OpenAI API key for extraction |
EQUICHAT_DB_PATH |
:memory: |
DuckDB database path |
EQUICHAT_OPENAI_MODEL |
gpt-5 |
Model for extraction |
EQUICHAT_USE_OPENAI_EXTRACTION |
true |
Enable OpenAI extraction |
EQUICHAT_CONFIDENCE_THRESHOLD |
0.7 |
Min confidence for results |
EQUICHAT_API_HOST |
0.0.0.0 |
API server host |
EQUICHAT_API_PORT |
8000 |
API server port |
Create .env file in project root:
OPENAI_API_KEY=your-key-here
EQUICHAT_DB_PATH=./data/equichat.duckdb
EQUICHAT_OPENAI_MODEL=gpt-5
EQUICHAT_USE_OPENAI_EXTRACTION=true
EQUICHAT_IGNORE_LAST_PAGE=trueanswer_chatbot/
βββ src/equichat/ # Core library
β βββ config.py # Configuration management
β βββ store.py # Database layer (DuckDB)
β βββ ingest.py # PDF processing & extraction
β βββ router.py # Query routing & NLU
β βββ tools.py # Query execution tools
β βββ schemas.py # Data models
β βββ facts_query.py # SQL query builders
β βββ router_llm.py # LLM-based routing
β βββ vector_store.py # Vector search (optional)
βββ scripts/ # CLI tools & servers
β βββ ingest_batch.py # Batch PDF processing
β βββ api.py # FastAPI server
β βββ chat.py # Interactive chat interface
β βββ extract_with_openai.py # Direct OpenAI extraction
β βββ query_llm.py # LLM query testing
βββ data/ # Data directory
β βββ equichat.duckdb # Main database
β βββ *.pdf # Source PDFs
βββ cache/ # Extraction cache
β βββ extractions/ # Cached OpenAI results
βββ tests/ # Test suite
βββ requirements.txt # Python dependencies
Modify the extraction schema in src/equichat/ingest.py to capture specific metrics:
# Add custom metrics to the extraction prompt
CUSTOM_METRICS = {
"working_capital": ["working capital", "net working capital"],
"capex": ["capital expenditure", "capex", "capital investments"],
"free_cash_flow": ["free cash flow", "fcf"],
}EquiChat intelligently routes queries to the appropriate data source:
Metrics, ratios, financial calculations:
- "What is Hindalco's revenue in FY24?" β
SELECT value FROM facts WHERE company='Hindalco' AND metric_key='revenue' - "Top 5 banks by market cap" β
SELECT company, value FROM facts WHERE industry='Banking' ORDER BY value DESC - "Companies with ROE > 15%" β
SELECT company FROM facts WHERE metric_key='roe' AND value > 15
Explanatory, contextual, qualitative questions:
- "Why did Hindalco's margins improve?" β Vector search through earnings call transcripts
- "What are the key risks for pharmaceutical companies?" β Semantic retrieval from risk sections
- "Explain the company's growth strategy" β Text similarity matching in management commentary
-
OpenAI API Errors
# Check API key echo $OPENAI_API_KEY # Test API access curl -H "Authorization: Bearer $OPENAI_API_KEY" \ https://api.openai.com/v1/models
-
Database Connection Issues
# Check database path ls -la ./data/equichat.duckdb # Reset database rm ./data/equichat.duckdb
-
Extraction Quality Issues
- Try different models:
gpt-5>gpt-4o>gpt-4o-mini - Increase confidence threshold in config
- Check PDF quality (text-based vs scanned)
- Try different models:
- Parallel Processing
# Increase workers for CPU-bound tasks python scripts/ingest_batch.py --workers 8
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Add tests:
pytest tests/ - Submit pull request
# Install development dependencies
pip install -r requirements.txt
# Run tests
pytest tests/ -v
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for powerful language models
- DuckDB for fast analytical database
- FastAPI for modern web framework
- Rich for beautiful CLI interfaces
- OpenAI/Cluade/Gemini for patches to the code
