This repository contains the implementation for the paper "Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks" (FSE 2026).
The study systematically evaluates RAG design choices across 3 SE tasks (code generation, summarization, repair) on 6 benchmarks, covering:
- 4 query processing techniques
- 7 retrieval models (sparse, dense, hybrid)
- 4 context refinement methods (reranking + compression)
- 6 LLM generators
RAG-Empirical-SE/
├── compress_answers.py # Compress answers/ dirs to .tar.gz
├── restore_answers.py # Restore answers/ dirs from archives
│
├── dataset/ # Benchmark datasets (compressed as .xz)
│ ├── code_gen/
│ ├── code_repair/
│ └── code_sum/
│
├── Rag_Class/ # Core RAG component classes
│ ├── embed.py # Embedder (multi-provider)
│ ├── generate.py # Generator (unified LLM interface)
│ ├── prompt.py # PromptBuilder (task-aware prompting)
│ ├── rerank.py # Reranker (cross-encoder & generative)
│ ├── retrieve.py # Retriever (FAISS top-k)
│ └── EmpiricalRules.md # Empirical findings → pipeline rules
│
└── rag_technique/ # Experiment scripts and results
├── config.json # Central configuration
├── manage_answers.py # Compress/restore/inspect answer dirs
│
├── retriever_bm25.py # Sparse retrieval
├── retriever_embed.py # Dense retrieval + embedding
├── retriever_hybrid.py # Hybrid retrieval
├── rag_generation.py # Standard RAG generation
├── parse_solutions.py # Extract code from LLM output
├── corpus_source.py # Corpus source analysis
├── cost_analysis.py # Token cost analysis
├── standardize_name.py # Normalize task_id across JSONL files
│
├── framework.ipynb # Adaptive RAG – APPS code generation
├── framework_2.ipynb # Adaptive RAG – Move Method refactoring
├── framework_3.ipynb # Adaptive RAG – Repo-level code generation
├── framework_4.ipynb # Adaptive RAG – API-guided code generation
├── framework_5.ipynb # Adaptive RAG – Unit test generation
├── framework.md # Framework design documentation
│
├── querytrans/ # Query transformation experiments
├── rerank/ # Reranking experiments
├── compression/ # Context compression experiments
├── no_embedder/ # Zero-shot and golden-context baselines
│
├── bge_m3/ # Answer directories (per embedding model)
├── bm25/
├── gte_multilingual_base/
├── sfr_em_code_400m/
├── jina_em_v2_base_code/
├── multilingual_e5_small/
├── granite_em_278_multi/
├── rag_hybrid/
│
├── results_apps/
├── results_debugbench/
├── results_humanevalpack/
├── results_lcb-gen/
├── results_lcb-repair/
├── results_lcb-sum/
└── results_codexglue/
conda create -n rag_empirical python=3.12
conda activate rag_empirical
pip install -r requirements.txt
# Make Rag_Class importable
export PYTHONPATH=$(pwd):$PYTHONPATHSee Rag_Class/readme.md for detailed dependency instructions.
Edit rag_technique/config.json before running any experiments:
base_dir_prefix: local paths for each embedding model's output directoryllm_providers→api_key: your API key (OpenAI-compatible endpoint)datasets: paths to benchmark JSONL files
Benchmarks are included as compressed .xz archives. Decompress before use:
xz -dk dataset/code_gen/apps.jsonl.xz| Directory | Benchmarks |
|---|---|
code_gen/ |
APPS, LiveCodeBench-Gen, HumanEval, MBPP |
code_repair/ |
DebugBench, HumanEvalPack, LiveCodeBench-Repair, QuixBugs |
code_sum/ |
CodeXGLUE, LiveCodeBench-Sum |
Corpus files (Stack Overflow, Python API docs, LeetCode, CodeSearchNet, Code-Contests) are not included due to size. Set their paths in config.json under corpus_config.
The pipeline runs in five stages. Each stage's output feeds the next. See the individual scripts for configuration options.
1. Embeddings — rag_technique/retriever_embed.py
2. Retrieval — retriever_embed.py (dense), retriever_bm25.py (sparse), retriever_hybrid.py (hybrid)
3. Context Refinement (optional)
- Reranking:
rerank/rerank_top_k.py - Compression:
compression/llmlingua_compression.py,compression/recomp_reranking.py+recomp_compression.py - Query transformation:
querytrans/query_transformation.py+querytrans_retrieval.py
4. Generation — rag_generation.py (standard); variant scripts in querytrans/, rerank/, compression/, no_embedder/
5. Parse & Evaluate — parse_solutions.py then evaluation scripts under each results_*/ directory
The answers/ subdirectories can grow to ~2.4 GB total. Use the provided scripts to manage them:
# Compress all answers/ dirs (deletes originals by default)
python compress_answers.py
# Restore from archives
python restore_answers.py
# Or use the legacy manage_answers.py inside rag_technique/
python rag_technique/manage_answers.py statusA key contribution is an LLM-driven adaptive framework that configures an optimal RAG pipeline for any new task. The framework*.ipynb notebooks demonstrate this across five SE scenarios:
| Notebook | Scenario |
|---|---|
framework.ipynb |
Code generation (APPS) |
framework_2.ipynb |
Move Method refactoring |
framework_3.ipynb |
Repository-level code generation |
framework_4.ipynb |
API-guided code generation |
framework_5.ipynb |
Unit test generation |
The framework works in two steps: (1) a frontier LLM profiles the task into a structured YAML Task Profile, then (2) reasons over Rag_Class/EmpiricalRules.md to derive a concrete pipeline configuration. Set REASONING_LLM_KEY in each notebook to any model key defined in config.json.
@inproceedings{10.1145/3808190,
title = {Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks},
booktitle = {Proceedings of the ACM International Conference on the Foundations of Software Engineering (FSE)},
year = {2026},
doi = {10.1145/3808190},
url = {https://doi.org/10.1145/3808190}
}