This repository contains my solution for a FinanceBench assignment focused on building, evaluating, and improving a Retrieval-Augmented Generation (RAG) pipeline for financial question answering.
The project includes:
- a naive generation baseline
- document chunking, embedding, and FAISS indexing
- a RAG pipeline for grounded answering
- evaluation using correctness, faithfulness, and retrieval page-hit metrics
- improvement experiments
- a bonus experiment on multi-scale chunking
The main goal of the assignment was to understand how retrieval quality affects downstream answer quality in a financial QA setting, and to evaluate the pipeline beyond final-answer correctness alone.
In particular, the assignment explored:
- how a naive model compares against retrieval-augmented answering
- how chunking and indexing choices affect retrieval
- how to measure correctness, faithfulness, and retrieval quality separately
- how controlled experiments can improve or fail to improve a RAG system
I first evaluated a naive baseline with no retrieval, where the generation model answered directly from its internal knowledge.
This established a simple point of comparison before introducing retrieval.
I loaded the relevant FinanceBench PDF filings, attached standardized metadata to each page (doc_name, company, doc_period, page_number), split the pages into chunks, embedded them using BAAI/bge-small-en-v1.5, and stored them in a FAISS vector index.
This created the retrieval layer used by the later RAG pipeline.
I built a basic RAG pipeline that:
- receives a user query
- retrieves relevant chunks from the FAISS vector store
- formats those chunks into context
- sends the question and context to the generation model
- returns a grounded answer
The generation step was instructed to answer only from the provided context and explicitly say when the answer was not supported by the retrieved evidence.
I evaluated the RAG pipeline using three complementary measures:
-
Correctness
Whether the final answer matches the dataset ground truth. -
Faithfulness
Whether the answer is supported by the retrieved context. -
Page-hit@k
Whether retrieval surfaced the correct evidence page within the top-k retrieved chunks.
This allowed me to separate retrieval failures from generation failures.
I ran several controlled experiments, changing one variable at a time from the baseline:
- increasing retrieval depth (
k) - using a stricter generation prompt
- adding a reranker (
BAAI/bge-reranker-base)
Each experiment was evaluated again using the same metrics.
In the bonus section, I tested the hypothesis that no single chunk size is optimal for all queries.
I built multiple FAISS indices with different chunk sizes and compared their page-hit@5 performance across the FinanceBench dataset.
This was used to test whether chunk-size effectiveness is stable or query-dependent.
The main conclusion from the project was that the pipeline was relatively faithful to the retrieved context, but correctness was mainly limited by retrieval precision rather than generation quality.
In other words:
- the model often stayed grounded in the retrieved context
- but the retrieved context was frequently not precise enough to support the correct final answer
- increasing retrieval
kslightly improved broader retrieval coverage, but did not dramatically solve the core retrieval bottleneck - a stricter generation prompt made the model more conservative, but did not improve retrieval quality
- adding a reranker did not improve results in this setup
- in the bonus experiment, chunk size
1000performed best on average, but chunk-size effectiveness still varied across questions
-
assignment2_rag_financebench.ipynb
Main notebook containing the full solution. -
outputs/assignment2_naive_generation.xlsx
Naive baseline results. -
outputs/assignment2_run_and_compare.xlsx
Side-by-side comparison of naive vs. RAG answers. -
outputs/assignment2_evaluation.xlsx
Evaluation results for correctness, faithfulness, and page-hit metrics. -
outputs/assignment2_improvement_cycles.xlsx
Results of the controlled improvement experiments. -
outputs/bonus_multi_scale_chunking_results.xlsx
Detailed per-question results for the bonus chunk-size experiment. -
outputs/bonus_multi_scale_chunking_summary.xlsx
Summary table for the bonus experiment.
- Python
- Jupyter Notebook
- Pandas
- Hugging Face models
- FAISS
- LangChain components
- OpenAI-compatible API client
- Nebius-hosted LLMs
- Ragas
This assignment showed me that building a useful RAG system is not just about connecting retrieval and generation.
A strong pipeline depends on:
- chunking decisions
- retrieval quality
- careful evaluation
- controlled experimentation
- and a clear understanding of where the real bottleneck is
One of the most important lessons from this project was that a system can be relatively faithful yet still not be very correct, if retrieval fails to surface the right evidence.
This repository contains the notebook and output artifacts needed to review the work and results.
Large intermediate artifacts such as raw PDF caches, FAISS index folders, and local environment files were intentionally excluded from version control to keep the repository clean and lightweight.