FinanceBench RAG Assignment

Overview

This repository contains my solution for a FinanceBench assignment focused on building, evaluating, and improving a Retrieval-Augmented Generation (RAG) pipeline for financial question answering.

The project includes:

a naive generation baseline
document chunking, embedding, and FAISS indexing
a RAG pipeline for grounded answering
evaluation using correctness, faithfulness, and retrieval page-hit metrics
improvement experiments
a bonus experiment on multi-scale chunking

Project Goals

The main goal of the assignment was to understand how retrieval quality affects downstream answer quality in a financial QA setting, and to evaluate the pipeline beyond final-answer correctness alone.

In particular, the assignment explored:

how a naive model compares against retrieval-augmented answering
how chunking and indexing choices affect retrieval
how to measure correctness, faithfulness, and retrieval quality separately
how controlled experiments can improve or fail to improve a RAG system

Main Components

1. Naive Baseline

I first evaluated a naive baseline with no retrieval, where the generation model answered directly from its internal knowledge.

This established a simple point of comparison before introducing retrieval.

2. Document Indexing

I loaded the relevant FinanceBench PDF filings, attached standardized metadata to each page (doc_name, company, doc_period, page_number), split the pages into chunks, embedded them using BAAI/bge-small-en-v1.5, and stored them in a FAISS vector index.

This created the retrieval layer used by the later RAG pipeline.

3. RAG Pipeline

I built a basic RAG pipeline that:

receives a user query
retrieves relevant chunks from the FAISS vector store
formats those chunks into context
sends the question and context to the generation model
returns a grounded answer

The generation step was instructed to answer only from the provided context and explicitly say when the answer was not supported by the retrieved evidence.

4. Evaluation

I evaluated the RAG pipeline using three complementary measures:

Correctness
Whether the final answer matches the dataset ground truth.
Faithfulness
Whether the answer is supported by the retrieved context.
Page-hit@k
Whether retrieval surfaced the correct evidence page within the top-k retrieved chunks.

This allowed me to separate retrieval failures from generation failures.

5. Improvement Cycles

I ran several controlled experiments, changing one variable at a time from the baseline:

increasing retrieval depth (k)
using a stricter generation prompt
adding a reranker (BAAI/bge-reranker-base)

Each experiment was evaluated again using the same metrics.

6. Bonus: Multi-scale Chunking

In the bonus section, I tested the hypothesis that no single chunk size is optimal for all queries.

I built multiple FAISS indices with different chunk sizes and compared their page-hit@5 performance across the FinanceBench dataset.

This was used to test whether chunk-size effectiveness is stable or query-dependent.

Key Findings

Main finding

The main conclusion from the project was that the pipeline was relatively faithful to the retrieved context, but correctness was mainly limited by retrieval precision rather than generation quality.

In other words:

the model often stayed grounded in the retrieved context
but the retrieved context was frequently not precise enough to support the correct final answer

Additional findings

increasing retrieval k slightly improved broader retrieval coverage, but did not dramatically solve the core retrieval bottleneck
a stricter generation prompt made the model more conservative, but did not improve retrieval quality
adding a reranker did not improve results in this setup
in the bonus experiment, chunk size 1000 performed best on average, but chunk-size effectiveness still varied across questions

Repository Contents

assignment2_rag_financebench.ipynb
Main notebook containing the full solution.
outputs/assignment2_naive_generation.xlsx
Naive baseline results.
outputs/assignment2_run_and_compare.xlsx
Side-by-side comparison of naive vs. RAG answers.
outputs/assignment2_evaluation.xlsx
Evaluation results for correctness, faithfulness, and page-hit metrics.
outputs/assignment2_improvement_cycles.xlsx
Results of the controlled improvement experiments.
outputs/bonus_multi_scale_chunking_results.xlsx
Detailed per-question results for the bonus chunk-size experiment.
outputs/bonus_multi_scale_chunking_summary.xlsx
Summary table for the bonus experiment.

Tech Stack

Python
Jupyter Notebook
Pandas
Hugging Face models
FAISS
LangChain components
OpenAI-compatible API client
Nebius-hosted LLMs
Ragas

What I Learned

This assignment showed me that building a useful RAG system is not just about connecting retrieval and generation.

A strong pipeline depends on:

chunking decisions
retrieval quality
careful evaluation
controlled experimentation
and a clear understanding of where the real bottleneck is

One of the most important lessons from this project was that a system can be relatively faithful yet still not be very correct, if retrieval fails to surface the right evidence.

Notes

This repository contains the notebook and output artifacts needed to review the work and results.
Large intermediate artifacts such as raw PDF caches, FAISS index folders, and local environment files were intentionally excluded from version control to keep the repository clean and lightweight.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
outputs		outputs
.gitignore		.gitignore
README.md		README.md
assignment2_rag_financebench.ipynb		assignment2_rag_financebench.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinanceBench RAG Assignment

Overview

Project Goals

Main Components

1. Naive Baseline

2. Document Indexing

3. RAG Pipeline

4. Evaluation

5. Improvement Cycles

6. Bonus: Multi-scale Chunking

Key Findings

Main finding

Additional findings

Repository Contents

Tech Stack

What I Learned

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinanceBench RAG Assignment

Overview

Project Goals

Main Components

1. Naive Baseline

2. Document Indexing

3. RAG Pipeline

4. Evaluation

5. Improvement Cycles

6. Bonus: Multi-scale Chunking

Key Findings

Main finding

Additional findings

Repository Contents

Tech Stack

What I Learned

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages