Skip to content

security-pride/RAG-Empirical-SE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

This repository contains the implementation for the paper "Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks" (FSE 2026).

The study systematically evaluates RAG design choices across 3 SE tasks (code generation, summarization, repair) on 6 benchmarks, covering:

  • 4 query processing techniques
  • 7 retrieval models (sparse, dense, hybrid)
  • 4 context refinement methods (reranking + compression)
  • 6 LLM generators

Repository Structure

RAG-Empirical-SE/
├── compress_answers.py         # Compress answers/ dirs to .tar.gz
├── restore_answers.py          # Restore answers/ dirs from archives
│
├── dataset/                    # Benchmark datasets (compressed as .xz)
│   ├── code_gen/
│   ├── code_repair/
│   └── code_sum/
│
├── Rag_Class/                  # Core RAG component classes
│   ├── embed.py                # Embedder (multi-provider)
│   ├── generate.py             # Generator (unified LLM interface)
│   ├── prompt.py               # PromptBuilder (task-aware prompting)
│   ├── rerank.py               # Reranker (cross-encoder & generative)
│   ├── retrieve.py             # Retriever (FAISS top-k)
│   └── EmpiricalRules.md       # Empirical findings → pipeline rules
│
└── rag_technique/              # Experiment scripts and results
    ├── config.json             # Central configuration
    ├── manage_answers.py       # Compress/restore/inspect answer dirs
    │
    ├── retriever_bm25.py       # Sparse retrieval
    ├── retriever_embed.py      # Dense retrieval + embedding
    ├── retriever_hybrid.py     # Hybrid retrieval
    ├── rag_generation.py       # Standard RAG generation
    ├── parse_solutions.py      # Extract code from LLM output
    ├── corpus_source.py        # Corpus source analysis
    ├── cost_analysis.py        # Token cost analysis
    ├── standardize_name.py     # Normalize task_id across JSONL files
    │
    ├── framework.ipynb         # Adaptive RAG – APPS code generation
    ├── framework_2.ipynb       # Adaptive RAG – Move Method refactoring
    ├── framework_3.ipynb       # Adaptive RAG – Repo-level code generation
    ├── framework_4.ipynb       # Adaptive RAG – API-guided code generation
    ├── framework_5.ipynb       # Adaptive RAG – Unit test generation
    ├── framework.md            # Framework design documentation
    │
    ├── querytrans/             # Query transformation experiments
    ├── rerank/                 # Reranking experiments
    ├── compression/            # Context compression experiments
    ├── no_embedder/            # Zero-shot and golden-context baselines
    │
    ├── bge_m3/                 # Answer directories (per embedding model)
    ├── bm25/
    ├── gte_multilingual_base/
    ├── sfr_em_code_400m/
    ├── jina_em_v2_base_code/
    ├── multilingual_e5_small/
    ├── granite_em_278_multi/
    ├── rag_hybrid/
    │
    ├── results_apps/
    ├── results_debugbench/
    ├── results_humanevalpack/
    ├── results_lcb-gen/
    ├── results_lcb-repair/
    ├── results_lcb-sum/
    └── results_codexglue/

Setup

conda create -n rag_empirical python=3.12
conda activate rag_empirical
pip install -r requirements.txt

# Make Rag_Class importable
export PYTHONPATH=$(pwd):$PYTHONPATH

See Rag_Class/readme.md for detailed dependency instructions.

Configuration

Edit rag_technique/config.json before running any experiments:

  • base_dir_prefix: local paths for each embedding model's output directory
  • llm_providersapi_key: your API key (OpenAI-compatible endpoint)
  • datasets: paths to benchmark JSONL files

Datasets

Benchmarks are included as compressed .xz archives. Decompress before use:

xz -dk dataset/code_gen/apps.jsonl.xz
Directory Benchmarks
code_gen/ APPS, LiveCodeBench-Gen, HumanEval, MBPP
code_repair/ DebugBench, HumanEvalPack, LiveCodeBench-Repair, QuixBugs
code_sum/ CodeXGLUE, LiveCodeBench-Sum

Corpus files (Stack Overflow, Python API docs, LeetCode, CodeSearchNet, Code-Contests) are not included due to size. Set their paths in config.json under corpus_config.


Running Experiments

The pipeline runs in five stages. Each stage's output feeds the next. See the individual scripts for configuration options.

1. Embeddingsrag_technique/retriever_embed.py

2. Retrievalretriever_embed.py (dense), retriever_bm25.py (sparse), retriever_hybrid.py (hybrid)

3. Context Refinement (optional)

  • Reranking: rerank/rerank_top_k.py
  • Compression: compression/llmlingua_compression.py, compression/recomp_reranking.py + recomp_compression.py
  • Query transformation: querytrans/query_transformation.py + querytrans_retrieval.py

4. Generationrag_generation.py (standard); variant scripts in querytrans/, rerank/, compression/, no_embedder/

5. Parse & Evaluateparse_solutions.py then evaluation scripts under each results_*/ directory


Managing Answer Directories

The answers/ subdirectories can grow to ~2.4 GB total. Use the provided scripts to manage them:

# Compress all answers/ dirs (deletes originals by default)
python compress_answers.py

# Restore from archives
python restore_answers.py

# Or use the legacy manage_answers.py inside rag_technique/
python rag_technique/manage_answers.py status

Adaptive RAG Framework

A key contribution is an LLM-driven adaptive framework that configures an optimal RAG pipeline for any new task. The framework*.ipynb notebooks demonstrate this across five SE scenarios:

Notebook Scenario
framework.ipynb Code generation (APPS)
framework_2.ipynb Move Method refactoring
framework_3.ipynb Repository-level code generation
framework_4.ipynb API-guided code generation
framework_5.ipynb Unit test generation

The framework works in two steps: (1) a frontier LLM profiles the task into a structured YAML Task Profile, then (2) reasons over Rag_Class/EmpiricalRules.md to derive a concrete pipeline configuration. Set REASONING_LLM_KEY in each notebook to any model key defined in config.json.


Citation

@inproceedings{10.1145/3808190,
  title     = {Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks},
  booktitle = {Proceedings of the ACM International Conference on the Foundations of Software Engineering (FSE)},
  year      = {2026},
  doi       = {10.1145/3808190},
  url       = {https://doi.org/10.1145/3808190}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors