This repository is the official implementation of "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?", which was accepted to NAACL 2025 and can be found here.
This repository contains the code and dataset to compare the accuracy and consistency of LLM question answering and reverse question answering.
Our dataset contains 3443 question/answer pairs across four categories (Number, Number + Text, Easy Facts, Hard Facts), and can be accessed through Huggingface here
Python 3.10.0, pip 23.2.1, and CUDA Version: 12.4 were used when running the code in this repository. A list of requirements can be found in requirements.txt
, which can be installed through the following command:
pip install -r requirements.txt
Afterwards, if you would like to use HuggingFace models on GPU, you can install the relevant packages with the following command:
pip install torch==2.4.1+cu124 torchvision==0.19.1+cu124 torchaudio==2.4.1+cu124 --index-url https://download.pytorch.org/whl/cu124
The files in this repository are organized as follows:
/model/
: Contains the code to run the three-step prompting pipeline/evaluation/
: Contains the code to evaluate the accuracy and consistency of abduction and deduction/results/
: Directory for storing model outputs and contains code for parsing model outputs/sample_scripts/
: Scripts to easily run the inference code in/model/
You can run inference on the Huggingface models with the following command:
bash .../scripts/model.py
You can change the following parameters for each run:
dataset_name
: where the dataset can be accessed on huggingface/locallyinference_split
: split of the dataset to run inference onmodel_name
: Name of the model on the API. String typemodel_type
: Type of the model, i.e., where it is accessed from. All available models are listed inmodel/enums.py
Currently supports "hf_chat" (HuggingFace chat models), "open_ai" (OpenAI models), "cohere" (Cohere models), and "anthropic" (Anthropic models). String typerun_name
: Identifier for the run. String typedevice_map
: Device map for the GPUs ("cpu", "cuda", "auto"). String typepartition
: Partition of the dataset. can be "full" or in halves (e.g. "first_half"), quarters (e.g. "first_quarter"), or eigths (e.g. "first_eighth")experiments
: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)
API tokens (depending on the model) can be specified:
hf_token
: Huggingface read token (for downloading gated models and datasets). String typeopen_ai_token
: OpenAI token (GPT models). String typecohere_token
: Cohere token (Command-R models). String typeanthropic_token
: Anthropic token (Claude models). String type
You can also specify the following generation hyperparameters:
temperature
: Model temperature. Float typemin_tokens
: Minimum tokens to generate. Integer typemax_tokens
: Maximum tokens to generate. Integer type
Finally, the following parameters set up the directories for storing model outputs:
res_dir
: Pointing to the folder where results are storedcache_dir
: Pointing to the folder where model and dataset downloads can be cached through huggingface
After running any question answering or generating prompt, you can parse the results with the following Python files in /results/
:
parse_answer.py
parse_question.py
Both methods use the following parameters:
experiments
: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)run_name
: Identifier for the run. String typemodel_name
: Name of the model on the API. String typeres_dir
: Pointing to the folder where results are stored
Please note that to run the self-consistency check, you must perform a four-step process of:
run_model.py
with a QG promptparse_question.py
with the prompt from (1)run_model.py
with the QA promptqa_selfcons
parse_answer.py
with the QA promptqa_selfcons
The relevant repositories for computing question difficulty and token count are below:
- Question Difficulty: Prometheus LLM
- Token Count: Infini-Gram If you would like to have these re-implemented on this repo, please raise an issue and let us know!
If you have questions, please feel free to raise an issue or contact either of the following authors of the repository: