Skip to content

moshe19909090/financebench-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

FinanceBench RAG Assignment

Overview

This repository contains my solution for a FinanceBench assignment focused on building, evaluating, and improving a Retrieval-Augmented Generation (RAG) pipeline for financial question answering.

The project includes:

  • a naive generation baseline
  • document chunking, embedding, and FAISS indexing
  • a RAG pipeline for grounded answering
  • evaluation using correctness, faithfulness, and retrieval page-hit metrics
  • improvement experiments
  • a bonus experiment on multi-scale chunking

Project Goals

The main goal of the assignment was to understand how retrieval quality affects downstream answer quality in a financial QA setting, and to evaluate the pipeline beyond final-answer correctness alone.

In particular, the assignment explored:

  • how a naive model compares against retrieval-augmented answering
  • how chunking and indexing choices affect retrieval
  • how to measure correctness, faithfulness, and retrieval quality separately
  • how controlled experiments can improve or fail to improve a RAG system

Main Components

1. Naive Baseline

I first evaluated a naive baseline with no retrieval, where the generation model answered directly from its internal knowledge.

This established a simple point of comparison before introducing retrieval.


2. Document Indexing

I loaded the relevant FinanceBench PDF filings, attached standardized metadata to each page (doc_name, company, doc_period, page_number), split the pages into chunks, embedded them using BAAI/bge-small-en-v1.5, and stored them in a FAISS vector index.

This created the retrieval layer used by the later RAG pipeline.


3. RAG Pipeline

I built a basic RAG pipeline that:

  1. receives a user query
  2. retrieves relevant chunks from the FAISS vector store
  3. formats those chunks into context
  4. sends the question and context to the generation model
  5. returns a grounded answer

The generation step was instructed to answer only from the provided context and explicitly say when the answer was not supported by the retrieved evidence.


4. Evaluation

I evaluated the RAG pipeline using three complementary measures:

  • Correctness
    Whether the final answer matches the dataset ground truth.

  • Faithfulness
    Whether the answer is supported by the retrieved context.

  • Page-hit@k
    Whether retrieval surfaced the correct evidence page within the top-k retrieved chunks.

This allowed me to separate retrieval failures from generation failures.


5. Improvement Cycles

I ran several controlled experiments, changing one variable at a time from the baseline:

  • increasing retrieval depth (k)
  • using a stricter generation prompt
  • adding a reranker (BAAI/bge-reranker-base)

Each experiment was evaluated again using the same metrics.


6. Bonus: Multi-scale Chunking

In the bonus section, I tested the hypothesis that no single chunk size is optimal for all queries.

I built multiple FAISS indices with different chunk sizes and compared their page-hit@5 performance across the FinanceBench dataset.

This was used to test whether chunk-size effectiveness is stable or query-dependent.


Key Findings

Main finding

The main conclusion from the project was that the pipeline was relatively faithful to the retrieved context, but correctness was mainly limited by retrieval precision rather than generation quality.

In other words:

  • the model often stayed grounded in the retrieved context
  • but the retrieved context was frequently not precise enough to support the correct final answer

Additional findings

  • increasing retrieval k slightly improved broader retrieval coverage, but did not dramatically solve the core retrieval bottleneck
  • a stricter generation prompt made the model more conservative, but did not improve retrieval quality
  • adding a reranker did not improve results in this setup
  • in the bonus experiment, chunk size 1000 performed best on average, but chunk-size effectiveness still varied across questions

Repository Contents

  • assignment2_rag_financebench.ipynb
    Main notebook containing the full solution.

  • outputs/assignment2_naive_generation.xlsx
    Naive baseline results.

  • outputs/assignment2_run_and_compare.xlsx
    Side-by-side comparison of naive vs. RAG answers.

  • outputs/assignment2_evaluation.xlsx
    Evaluation results for correctness, faithfulness, and page-hit metrics.

  • outputs/assignment2_improvement_cycles.xlsx
    Results of the controlled improvement experiments.

  • outputs/bonus_multi_scale_chunking_results.xlsx
    Detailed per-question results for the bonus chunk-size experiment.

  • outputs/bonus_multi_scale_chunking_summary.xlsx
    Summary table for the bonus experiment.


Tech Stack

  • Python
  • Jupyter Notebook
  • Pandas
  • Hugging Face models
  • FAISS
  • LangChain components
  • OpenAI-compatible API client
  • Nebius-hosted LLMs
  • Ragas

What I Learned

This assignment showed me that building a useful RAG system is not just about connecting retrieval and generation.

A strong pipeline depends on:

  • chunking decisions
  • retrieval quality
  • careful evaluation
  • controlled experimentation
  • and a clear understanding of where the real bottleneck is

One of the most important lessons from this project was that a system can be relatively faithful yet still not be very correct, if retrieval fails to surface the right evidence.


Notes

This repository contains the notebook and output artifacts needed to review the work and results.
Large intermediate artifacts such as raw PDF caches, FAISS index folders, and local environment files were intentionally excluded from version control to keep the repository clean and lightweight.

About

RAG pipeline for FinanceBench with retrieval, evaluation, improvement cycles, and chunk-size experiments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors