Reverse Question Answering

This repository is the official implementation of "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?", which was accepted to NAACL 2025 and can be found here.

Overview

This repository contains the code and dataset to compare the accuracy and consistency of LLM question answering and reverse question answering.

Dataset

Our dataset contains 3443 question/answer pairs across four categories (Number, Number + Text, Easy Facts, Hard Facts), and can be accessed through Huggingface here

Setup

Python 3.10.0, pip 23.2.1, and CUDA Version: 12.4 were used when running the code in this repository. A list of requirements can be found in requirements.txt, which can be installed through the following command:

pip install -r requirements.txt

Afterwards, if you would like to use HuggingFace models on GPU, you can install the relevant packages with the following command:

pip install torch==2.4.1+cu124 torchvision==0.19.1+cu124 torchaudio==2.4.1+cu124 --index-url https://download.pytorch.org/whl/cu124

The files in this repository are organized as follows:

/model/: Contains the code to run the three-step prompting pipeline
/evaluation/: Contains the code to evaluate the accuracy and consistency of abduction and deduction
/results/: Directory for storing model outputs and contains code for parsing model outputs
/sample_scripts/: Scripts to easily run the inference code in /model/

Model Inference

You can run inference on the Huggingface models with the following command:

bash .../scripts/model.py

You can change the following parameters for each run:

dataset_name: where the dataset can be accessed on huggingface/locally
inference_split: split of the dataset to run inference on
model_name: Name of the model on the API. String type
model_type: Type of the model, i.e., where it is accessed from. All available models are listed in model/enums.py Currently supports "hf_chat" (HuggingFace chat models), "open_ai" (OpenAI models), "cohere" (Cohere models), and "anthropic" (Anthropic models). String type
run_name: Identifier for the run. String type
device_map: Device map for the GPUs ("cpu", "cuda", "auto"). String type
partition: Partition of the dataset. can be "full" or in halves (e.g. "first_half"), quarters (e.g. "first_quarter"), or eigths (e.g. "first_eighth")
experiments: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)

API tokens (depending on the model) can be specified:

hf_token: Huggingface read token (for downloading gated models and datasets). String type
open_ai_token: OpenAI token (GPT models). String type
cohere_token: Cohere token (Command-R models). String type
anthropic_token: Anthropic token (Claude models). String type

You can also specify the following generation hyperparameters:

temperature: Model temperature. Float type
min_tokens: Minimum tokens to generate. Integer type
max_tokens: Maximum tokens to generate. Integer type

Finally, the following parameters set up the directories for storing model outputs:

res_dir: Pointing to the folder where results are stored
cache_dir: Pointing to the folder where model and dataset downloads can be cached through huggingface

Evaluation

After running any question answering or generating prompt, you can parse the results with the following Python files in /results/:

parse_answer.py
parse_question.py

Both methods use the following parameters:

experiments: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)
run_name: Identifier for the run. String type
model_name: Name of the model on the API. String type
res_dir: Pointing to the folder where results are stored

Please note that to run the self-consistency check, you must perform a four-step process of:

run_model.py with a QG prompt
parse_question.py with the prompt from (1)
run_model.py with the QA prompt qa_selfcons
parse_answer.py with the QA prompt qa_selfcons

The relevant repositories for computing question difficulty and token count are below:

Question Difficulty: Prometheus LLM
Token Count: Infini-Gram If you would like to have these re-implemented on this repo, please raise an issue and let us know!

Contact

If you have questions, please feel free to raise an issue or contact either of the following authors of the repository:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reverse Question Answering

Overview

Dataset

Setup

Model Inference

Evaluation

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
evaluation		evaluation
model		model
results		results
sample_scripts		sample_scripts
README.md		README.md
requirements.txt		requirements.txt

nbalepur/Reverse-QA

Folders and files

Latest commit

History

Repository files navigation

Reverse Question Answering

Overview

Dataset

Setup

Model Inference

Evaluation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages