Skip to content

Official implementation of "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?"

Notifications You must be signed in to change notification settings

nbalepur/Reverse-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reverse Question Answering

This repository is the official implementation of "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?", which was accepted to NAACL 2025 and can be found here.

Overview

This repository contains the code and dataset to compare the accuracy and consistency of LLM question answering and reverse question answering.

Dataset

Our dataset contains 3443 question/answer pairs across four categories (Number, Number + Text, Easy Facts, Hard Facts), and can be accessed through Huggingface here

Setup

Python 3.10.0, pip 23.2.1, and CUDA Version: 12.4 were used when running the code in this repository. A list of requirements can be found in requirements.txt, which can be installed through the following command:

pip install -r requirements.txt 

Afterwards, if you would like to use HuggingFace models on GPU, you can install the relevant packages with the following command:

pip install torch==2.4.1+cu124 torchvision==0.19.1+cu124 torchaudio==2.4.1+cu124 --index-url https://download.pytorch.org/whl/cu124

The files in this repository are organized as follows:

  • /model/: Contains the code to run the three-step prompting pipeline
  • /evaluation/: Contains the code to evaluate the accuracy and consistency of abduction and deduction
  • /results/: Directory for storing model outputs and contains code for parsing model outputs
  • /sample_scripts/: Scripts to easily run the inference code in /model/

Model Inference

You can run inference on the Huggingface models with the following command:

bash .../scripts/model.py

You can change the following parameters for each run:

  • dataset_name: where the dataset can be accessed on huggingface/locally
  • inference_split: split of the dataset to run inference on
  • model_name: Name of the model on the API. String type
  • model_type: Type of the model, i.e., where it is accessed from. All available models are listed in model/enums.py Currently supports "hf_chat" (HuggingFace chat models), "open_ai" (OpenAI models), "cohere" (Cohere models), and "anthropic" (Anthropic models). String type
  • run_name: Identifier for the run. String type
  • device_map: Device map for the GPUs ("cpu", "cuda", "auto"). String type
  • partition: Partition of the dataset. can be "full" or in halves (e.g. "first_half"), quarters (e.g. "first_quarter"), or eigths (e.g. "first_eighth")
  • experiments: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)

API tokens (depending on the model) can be specified:

  • hf_token: Huggingface read token (for downloading gated models and datasets). String type
  • open_ai_token: OpenAI token (GPT models). String type
  • cohere_token: Cohere token (Command-R models). String type
  • anthropic_token: Anthropic token (Claude models). String type

You can also specify the following generation hyperparameters:

  • temperature: Model temperature. Float type
  • min_tokens: Minimum tokens to generate. Integer type
  • max_tokens: Maximum tokens to generate. Integer type

Finally, the following parameters set up the directories for storing model outputs:

  • res_dir: Pointing to the folder where results are stored
  • cache_dir: Pointing to the folder where model and dataset downloads can be cached through huggingface

Evaluation

After running any question answering or generating prompt, you can parse the results with the following Python files in /results/:

  • parse_answer.py
  • parse_question.py

Both methods use the following parameters:

  • experiments: List of strings denoting experiments to run. Currently supports "qg" (0-shot QG), "qg_cot" (0-shot QG with CoT), "qg_selfcheck", (LLM checks its own answer), "qg_fewshot" (few-shot QG, just for nuemrical entities), "qa" (0-shot QA), and "qa_selfcons" (QA on LLM's own generated question)
  • run_name: Identifier for the run. String type
  • model_name: Name of the model on the API. String type
  • res_dir: Pointing to the folder where results are stored

Please note that to run the self-consistency check, you must perform a four-step process of:

  1. run_model.py with a QG prompt
  2. parse_question.py with the prompt from (1)
  3. run_model.py with the QA prompt qa_selfcons
  4. parse_answer.py with the QA prompt qa_selfcons

The relevant repositories for computing question difficulty and token count are below:

  • Question Difficulty: Prometheus LLM
  • Token Count: Infini-Gram If you would like to have these re-implemented on this repo, please raise an issue and let us know!

Contact

If you have questions, please feel free to raise an issue or contact either of the following authors of the repository:

About

Official implementation of "Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published