This project builds a retrieval-augmented generation (RAG) pipeline over SEC 10-K/10-Q filings, enabling a local language model to answer questions like:
"What was the company’s revenue in 2019?"
"Summarize the management discussion section from the latest 10-K."
It uses chunked embedding search with FAISS and a local LLM (e.g., TinyLlama) to generate grounded answers.
| Step | Script | Description |
|---|---|---|
| 1️⃣ | 1_sec_json_data_retriever.py |
Download SEC companyfacts.zip and submissions.zip metadata |
| 2️⃣ | 2_sec_text_data_retriever.py |
Use metadata to fetch full 10-K / 10-Q HTML files |
| 3️⃣ | 3_sec_text_data_processor.py |
Extract, clean, and chunk filings with metadata |
| 4️⃣ | 4_sec_text_embedding_generator.py |
Generate SBERT embeddings for each chunk |
| 5️⃣ | 5_embedding_indexer.py |
Build FAISS index and save metadata mapping |
| 6️⃣ | 6_sec_local_llm_chat.py |
Query using local LLM + retrieved context |
pip install -r requirements.txtYou’ll need:
transformerssentence-transformersfaiss-cputqdmbeautifulsoup4huggingface-hub
Before running script 1_sec_json_data_retriever.py, open the file and update the User-Agent header with your own email address.
This is required by the SEC to access public data.
headers = {
# TODO: Update the User-Agent with your email!
'User-Agent': '[email protected]'
}📌 If this is not set, the SEC server will reject your request with a 403 or 400 error.
# Download JSON bulk metadata
python 1_sec_json_data_retriever.py
# Fetch actual HTML filings
python 2_sec_text_data_retriever.py
# Clean and chunk documents
python 3_sec_text_data_processor.py
# Embed each chunk
python 4_sec_text_embedding_generator.py
# Index embeddings with FAISS
python 5_embedding_indexer.pypython 6_sec_local_llm_chat.py "What is the revenue in 2019?"The script will:
- Embed the query
- Retrieve relevant chunks from FAISS
- Send the context to a local model (TinyLlama by default)
- Print the generated answer
💡 You can switch to GPT-4 or Claude with a few lines of modification if needed.
After you run:
python 6_sec_local_llm_chat.py "What is the revenue in 2019?"You should see output like this:
This shows the retrieved SEC filing chunks and the final answer generated by the local LLM.
- Query → SBERT embedding
- FAISS retrieves top relevant chunks
- Prompt is constructed with context
- LLM generates an answer based on the retrieved evidence
| Folder | Description |
|---|---|
data/xbrl_data/ |
JSON company facts from SEC bulk download |
data/submissions_data/ |
SEC submission history for each CIK |
data/filings_html/ |
Downloaded raw 10-K/10-Q HTML files |
data/cleaned_filings/ |
Chunked JSONs with metadata (year, form, section) |
data/embeddings_chunked/ |
Chunk-level SBERT embeddings |
data/faiss_index_chunked/ |
FAISS index and lookup metadata |
The default model is:
TinyLlama/TinyLlama-1.1B-Chat-v1.0
If you want to use larger or gated models:
- Get access on HuggingFace
- Run:
huggingface-cli login - Replace the model ID in
6_sec_local_llm_chat.py
- OpenAI/Claude fallback option
- Streamlit or Gradio chat interface
- Incremental updates from EDGAR
- Financial summarization or comparison engine
- CSV or PDF export of Q&A answers
Created by Lawrence Shieh 📧 [email protected]
PRs and suggestions welcome!
