📊 SEC Financial Statement QA Chatbot (RAG + Local LLM)

This project builds a retrieval-augmented generation (RAG) pipeline over SEC 10-K/10-Q filings, enabling a local language model to answer questions like:

"What was the company’s revenue in 2019?"
"Summarize the management discussion section from the latest 10-K."

It uses chunked embedding search with FAISS and a local LLM (e.g., TinyLlama) to generate grounded answers.

📂 Project Workflow

Step	Script	Description
1️⃣	`1_sec_json_data_retriever.py`	Download SEC `companyfacts.zip` and `submissions.zip` metadata
2️⃣	`2_sec_text_data_retriever.py`	Use metadata to fetch full 10-K / 10-Q HTML files
3️⃣	`3_sec_text_data_processor.py`	Extract, clean, and chunk filings with metadata
4️⃣	`4_sec_text_embedding_generator.py`	Generate SBERT embeddings for each chunk
5️⃣	`5_embedding_indexer.py`	Build FAISS index and save metadata mapping
6️⃣	`6_sec_local_llm_chat.py`	Query using local LLM + retrieved context

🛠️ Setup Instructions

1. Install Dependencies

pip install -r requirements.txt

You’ll need:

transformers
sentence-transformers
faiss-cpu
tqdm
beautifulsoup4
huggingface-hub

🔐 IMPORTANT: Add Your Email to SEC Requests

Before running script 1_sec_json_data_retriever.py, open the file and update the User-Agent header with your own email address.

This is required by the SEC to access public data.

headers = {
    # TODO: Update the User-Agent with your email!
    'User-Agent': '[email protected]'
}

📌 If this is not set, the SEC server will reject your request with a 403 or 400 error.

2. Run the Full Data Pipeline

# Download JSON bulk metadata
python 1_sec_json_data_retriever.py

# Fetch actual HTML filings
python 2_sec_text_data_retriever.py

# Clean and chunk documents
python 3_sec_text_data_processor.py

# Embed each chunk
python 4_sec_text_embedding_generator.py

# Index embeddings with FAISS
python 5_embedding_indexer.py

3. Ask Questions Using a Local Model

python 6_sec_local_llm_chat.py "What is the revenue in 2019?"

The script will:

Embed the query
Retrieve relevant chunks from FAISS
Send the context to a local model (TinyLlama by default)
Print the generated answer

💡 You can switch to GPT-4 or Claude with a few lines of modification if needed.

📷 Example: Script 6 Output (Chat with Local LLM)

After you run:

python 6_sec_local_llm_chat.py "What is the revenue in 2019?"

You should see output like this:

This shows the retrieved SEC filing chunks and the final answer generated by the local LLM.

🔍 How It Works

Query → SBERT embedding
FAISS retrieves top relevant chunks
Prompt is constructed with context
LLM generates an answer based on the retrieved evidence

📁 Data Folder Overview

Folder	Description
`data/xbrl_data/`	JSON company facts from SEC bulk download
`data/submissions_data/`	SEC submission history for each CIK
`data/filings_html/`	Downloaded raw 10-K/10-Q HTML files
`data/cleaned_filings/`	Chunked JSONs with metadata (year, form, section)
`data/embeddings_chunked/`	Chunk-level SBERT embeddings
`data/faiss_index_chunked/`	FAISS index and lookup metadata

🧠 Default Local Model

The default model is:

TinyLlama/TinyLlama-1.1B-Chat-v1.0

If you want to use larger or gated models:

Get access on HuggingFace
Run: huggingface-cli login
Replace the model ID in 6_sec_local_llm_chat.py

🚀 Future Improvements

OpenAI/Claude fallback option
Streamlit or Gradio chat interface
Incremental updates from EDGAR
Financial summarization or comparison engine
CSV or PDF export of Q&A answers

🤝 Contact

Created by Lawrence Shieh 📧 [email protected]

PRs and suggestions welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 SEC Financial Statement QA Chatbot (RAG + Local LLM)

📂 Project Workflow

🛠️ Setup Instructions

1. Install Dependencies

🔐 IMPORTANT: Add Your Email to SEC Requests

2. Run the Full Data Pipeline

3. Ask Questions Using a Local Model

📷 Example: Script 6 Output (Chat with Local LLM)

🔍 How It Works

📁 Data Folder Overview

🧠 Default Local Model

🚀 Future Improvements

🤝 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
output		output
.gitignore		.gitignore
1_sec_json_data_retriever.py		1_sec_json_data_retriever.py
2_sec_text_data_retriever.py		2_sec_text_data_retriever.py
3_sec_text_data_processor.py		3_sec_text_data_processor.py
4_sec_text_embedding_generator.py		4_sec_text_embedding_generator.py
5_embedding_indexer.py		5_embedding_indexer.py
6_sec_local_llm_chat.py		6_sec_local_llm_chat.py
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📊 SEC Financial Statement QA Chatbot (RAG + Local LLM)

📂 Project Workflow

🛠️ Setup Instructions

1. Install Dependencies

🔐 IMPORTANT: Add Your Email to SEC Requests

2. Run the Full Data Pipeline

3. Ask Questions Using a Local Model

📷 Example: Script 6 Output (Chat with Local LLM)

🔍 How It Works

📁 Data Folder Overview

🧠 Default Local Model

🚀 Future Improvements

🤝 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages