Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems.

This repository provides code for the first pipeline for automatic, difficulty-controlled creation of uncheatable, realistic, unanswerable, and multi-hop queries (CRUMQs), adaptable to any corpus and domain. It includes code to (1) create CRUMQs over two popular RAG datasets and (2) evaluate the resulting benchmark's effectiveness via experiments on leading retrieval-augmented LLMs.

Compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

📄 Paper

🛠 Setup & Installation

Clone the Repository

Create Environment & Install Dependencies

cd CRUMQs
chmod +x setup.sh && setup.sh
source unans_env/bin/activate
export PYTHONPATH=.
export TOKENIZERS_PARALLELISM=true

Configure Environment Variables
Create a .env file (or set these in your shell) with the following keys:
- OPENAI_API_KEY: To access GPT models
- GEMINI_API_KEY: To access Gemini models
- COHERE_API_KEY: To access Cohere models
- LITELLM_API_KEY: To access additional models
- LITELLM_IP: LiteLLM server address
We use LiteLLM for inference with open-source models.
Dataset Preparation
To prepare the relevant files from the TREC NeuCLIR 2024 and TREC RAG 2025 datasets for CRUMQs generation and benchmarking, run the following:
```
cd CRUMQs
tar -xvf _database.tar
tar -xvf _benchmark_database.tar
```

📂 File Structure

_benchmark_database/: Directory containing database for benchmarking experiments.
_database/
- neuclir_test/: Directory containing golden retrieved articles per test topic for TREC NeuCLIR 2024.
- trecrag2025/: Directory containing golden retrieved articles per test topic for TREC RAG 2025.
- test_requests_neuclir.jsonl: Test topics for TREC NeuCLIR 2024.
- test_requests_trecrag2025.jsonl: Test topics for TREC RAG 2025.
src/: Core benchmark generation & evaluation code.
- crumqs_generation/: Code to generate CRUMQs.
  - prompts/: Prompts used in the CRUMQs pipeline.
  - crawler.py: Code to scrape external news articles.
  - crawl_science.py: Code to scrape external scientific articles.
  - create_dataset.py: Main code to create CRUMQs.
  - dataset_format.py: Code for Pydantic representation of questions and answers.
  - filter_dataset.py: Code to filter & compile resulting CRUMQs generated over the two document collections.
  - rag.py: Code to help verify CRUMQs unanswerability.
  - run.py: Code to run CRUMQs creation.
  - run*.sh: Sample run scripts for generating CRUMQs over the NeuCLIR & RAG collections.
  - utils*.py: Utility functions for various parts of the pipeline.
- evaluation/: Code to evaluate RAG systems on CRUMQs and related benchmarks.
  - prompts_evaluation.py: Prompts to score RAG system responses.
  - prompts_generation.py: Prompts to elicit retrieval-augmented LLM responses.
  - run_rag.py: Code to implement and run RAG systems.
  - score_rag.py: Code to performance of RAG systems using different generator models.
requirements.txt: Dependencies for environment creation.
setup.sh: Script for environment creation.

📊 Usage & Benchmarking

We provide is a simple pipeline for automatic generation of uncheatable, realistic, unanswerable, and multi-hop queries (CRUMQs) that are robust against reasoning shortcuts, target content beyond models' training data cutoff dates, and can be tailored to any document corpus.

In particular, we leverage recent insights in synthetic data generation to ensure coverage of diverse task types and complexity levels, with benchmark difficulty easily controllable via the distribution of in- vs. out-of-scope hops per question.

🔧 Generating CRUMQs

To create CRUMQs over the TREC NeuCLIR 2024 and TREC RAG 2025 datasets, simply run the following:

cd CRUMQs

# Generate CRUMQs
bash ./src/crumqs_generation/run_neuclir.sh
bash ./src/crumqs_generation/run_trecrag.sh

# Compile & Filter CRUMQs
bash ./src/crumqs_generation/filter_dataset.py

Details of the data generation parameters are described in run.py and may be altered via command-line arguments as desired (e.g., the number of total queries generated, or the number of documents to consider when generating question-answer pairs).

The resulting CRUMQs will be automatically saved, compiled, and filtered accordng to quality scores assigned via LLM judgment. Scoring thresholds may be adjusted in filter_dataset.py.

🔍 Evaluating RAG Systems

Use run_rag.py and score_rag.py to configure and benchmark different RAG systems on the final CRUMQs. For example:

cd CRUMQs; source ./unans_env/bin/activate
export PYTHONPATH=.; export TOKENIZERS_PARALLELISM=true

### RAG System: GPT-5 generator, OpenAI embedding, & Vector retriever

# Get predictions
python ./src/evaluation/run_rag.py  --database_dir=./_database/_benchmark_database  --results_dir=./_predictions  --dataset_path=./generated_data/filtered_crumqs.json  --evaluation_mode=gateway  --generator_llm=gpt-5 --embedding=openai --retrieval=vector

# Score predictions
python ./src/evaluation/score_rag.py  --gt_data_path=./generated_data/filtered_crumqs.json  --results_path=./_predictions/gateway/filtered_crumqs/gpt-5__emb_openai__ret_vector__rerank_None__rewr_None__prompt_DEFAULT/predictions.json

### RAG System: GPT-4o generator, Cohere embedding, Vector retriever, Cohere reranker, & HyDE rewriter

# Get predictions
python ./src/evaluation/run_rag.py  --database_dir=./_database/_benchmark_database --results_dir=./_predictions  --dataset_path=./generated_data/filtered_crumqs.json  --evaluation_mode=rag  --embedding=cohere  --retrieval=vector  --reranker=cohere  --rewriting=hyde

# Score predictions
python ./src/evaluation/score_rag.py  --gt_data_path=./generated_data/filtered_crumqs.json  --results_path=./_predictions/rag/filtered_crumqs/gpt-4o__emb_cohere__ret_vector__rerank_cohere__rewr_hyde__prompt_DEFAULT/predictions.json

Evaluation metrics are defined consistently with prior work. Scores for each RAG system are automatically saved to the same directory as the predictions.

📝 Citation

If you find the content of this project helpful, please cite our paper as follows:

@article{liu2025crumqs,
    title={Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries}, 
    author={Gabrielle Kaili-May Liu and Bryan Li and Arman Cohan and William Gantt Walden and Eugene Yang},
    journal={arXiv preprint arXiv:2510.11956},
    year={2025},
    url={https://arxiv.org/abs/2510.11956}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__figs		__figs
src		src
LICENSE		LICENSE
README.md		README.md
_benchmark_database.tar		_benchmark_database.tar
_database.tar		_database.tar
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Quick Links

🛠 Setup & Installation

📂 File Structure

📊 Usage & Benchmarking

🔧 Generating CRUMQs

🔍 Evaluating RAG Systems

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

pybeebee/CRUMQs

Folders and files

Latest commit

History

Repository files navigation

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Quick Links

🛠 Setup & Installation

📂 File Structure

📊 Usage & Benchmarking

🔧 Generating CRUMQs

🔍 Evaluating RAG Systems

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages