Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

This repository contains the implementation for the paper "Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks" (FSE 2026).

The study systematically evaluates RAG design choices across 3 SE tasks (code generation, summarization, repair) on 6 benchmarks, covering:

4 query processing techniques
7 retrieval models (sparse, dense, hybrid)
4 context refinement methods (reranking + compression)
6 LLM generators

Repository Structure

RAG-Empirical-SE/
├── compress_answers.py         # Compress answers/ dirs to .tar.gz
├── restore_answers.py          # Restore answers/ dirs from archives
│
├── dataset/                    # Benchmark datasets (compressed as .xz)
│   ├── code_gen/
│   ├── code_repair/
│   └── code_sum/
│
├── Rag_Class/                  # Core RAG component classes
│   ├── embed.py                # Embedder (multi-provider)
│   ├── generate.py             # Generator (unified LLM interface)
│   ├── prompt.py               # PromptBuilder (task-aware prompting)
│   ├── rerank.py               # Reranker (cross-encoder & generative)
│   ├── retrieve.py             # Retriever (FAISS top-k)
│   └── EmpiricalRules.md       # Empirical findings → pipeline rules
│
└── rag_technique/              # Experiment scripts and results
    ├── config.json             # Central configuration
    ├── manage_answers.py       # Compress/restore/inspect answer dirs
    │
    ├── retriever_bm25.py       # Sparse retrieval
    ├── retriever_embed.py      # Dense retrieval + embedding
    ├── retriever_hybrid.py     # Hybrid retrieval
    ├── rag_generation.py       # Standard RAG generation
    ├── parse_solutions.py      # Extract code from LLM output
    ├── corpus_source.py        # Corpus source analysis
    ├── cost_analysis.py        # Token cost analysis
    ├── standardize_name.py     # Normalize task_id across JSONL files
    │
    ├── framework.ipynb         # Adaptive RAG – APPS code generation
    ├── framework_2.ipynb       # Adaptive RAG – Move Method refactoring
    ├── framework_3.ipynb       # Adaptive RAG – Repo-level code generation
    ├── framework_4.ipynb       # Adaptive RAG – API-guided code generation
    ├── framework_5.ipynb       # Adaptive RAG – Unit test generation
    ├── framework.md            # Framework design documentation
    │
    ├── querytrans/             # Query transformation experiments
    ├── rerank/                 # Reranking experiments
    ├── compression/            # Context compression experiments
    ├── no_embedder/            # Zero-shot and golden-context baselines
    │
    ├── bge_m3/                 # Answer directories (per embedding model)
    ├── bm25/
    ├── gte_multilingual_base/
    ├── sfr_em_code_400m/
    ├── jina_em_v2_base_code/
    ├── multilingual_e5_small/
    ├── granite_em_278_multi/
    ├── rag_hybrid/
    │
    ├── results_apps/
    ├── results_debugbench/
    ├── results_humanevalpack/
    ├── results_lcb-gen/
    ├── results_lcb-repair/
    ├── results_lcb-sum/
    └── results_codexglue/

Setup

conda create -n rag_empirical python=3.12
conda activate rag_empirical
pip install -r requirements.txt

# Make Rag_Class importable
export PYTHONPATH=$(pwd):$PYTHONPATH

See Rag_Class/readme.md for detailed dependency instructions.

Configuration

Edit rag_technique/config.json before running any experiments:

base_dir_prefix: local paths for each embedding model's output directory
llm_providers → api_key: your API key (OpenAI-compatible endpoint)
datasets: paths to benchmark JSONL files

Datasets

Benchmarks are included as compressed .xz archives. Decompress before use:

xz -dk dataset/code_gen/apps.jsonl.xz

Directory	Benchmarks
`code_gen/`	APPS, LiveCodeBench-Gen, HumanEval, MBPP
`code_repair/`	DebugBench, HumanEvalPack, LiveCodeBench-Repair, QuixBugs
`code_sum/`	CodeXGLUE, LiveCodeBench-Sum

Corpus files (Stack Overflow, Python API docs, LeetCode, CodeSearchNet, Code-Contests) are not included due to size. Set their paths in config.json under corpus_config.

Running Experiments

The pipeline runs in five stages. Each stage's output feeds the next. See the individual scripts for configuration options.

1. Embeddings — rag_technique/retriever_embed.py

2. Retrieval — retriever_embed.py (dense), retriever_bm25.py (sparse), retriever_hybrid.py (hybrid)

3. Context Refinement (optional)

Reranking: rerank/rerank_top_k.py
Compression: compression/llmlingua_compression.py, compression/recomp_reranking.py + recomp_compression.py
Query transformation: querytrans/query_transformation.py + querytrans_retrieval.py

4. Generation — rag_generation.py (standard); variant scripts in querytrans/, rerank/, compression/, no_embedder/

5. Parse & Evaluate — parse_solutions.py then evaluation scripts under each results_*/ directory

Managing Answer Directories

The answers/ subdirectories can grow to ~2.4 GB total. Use the provided scripts to manage them:

# Compress all answers/ dirs (deletes originals by default)
python compress_answers.py

# Restore from archives
python restore_answers.py

# Or use the legacy manage_answers.py inside rag_technique/
python rag_technique/manage_answers.py status

Adaptive RAG Framework

A key contribution is an LLM-driven adaptive framework that configures an optimal RAG pipeline for any new task. The framework*.ipynb notebooks demonstrate this across five SE scenarios:

Notebook	Scenario
`framework.ipynb`	Code generation (APPS)
`framework_2.ipynb`	Move Method refactoring
`framework_3.ipynb`	Repository-level code generation
`framework_4.ipynb`	API-guided code generation
`framework_5.ipynb`	Unit test generation

The framework works in two steps: (1) a frontier LLM profiles the task into a structured YAML Task Profile, then (2) reasons over Rag_Class/EmpiricalRules.md to derive a concrete pipeline configuration. Set REASONING_LLM_KEY in each notebook to any model key defined in config.json.

Citation

@inproceedings{10.1145/3808190,
  title     = {Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks},
  booktitle = {Proceedings of the ACM International Conference on the Foundations of Software Engineering (FSE)},
  year      = {2026},
  doi       = {10.1145/3808190},
  url       = {https://doi.org/10.1145/3808190}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

Repository Structure

Setup

Configuration

Datasets

Running Experiments

Managing Answer Directories

Adaptive RAG Framework

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Rag_Class		Rag_Class
dataset		dataset
rag_technique		rag_technique
.gitignore		.gitignore
README.md		README.md
compress_answers.py		compress_answers.py
restore_answers.py		restore_answers.py

Folders and files

Latest commit

History

Repository files navigation

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

Repository Structure

Setup

Configuration

Datasets

Running Experiments

Managing Answer Directories

Adaptive RAG Framework

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages