This is the repository for paper LLMs with Chain-of-Thought Are Non-Causal Reasoners. We explores the role of the Chain of Thought (CoT) in Large Language Models (LLMs) reasoning, specifically:
- We assess the significance of the cause-effect relationship between CoT and answers across various tasks to unveil the Structural Causal Model (SCM) that LLMs emulate.
- We investigate the factors influencing the causal structure of the implied SCM across several distinct tasks.
In this study, we carefully selected datasets and tasks to benchmark arithmetic and logical reasoning performance. Here's an overview of the datasets and tasks used:
- Addition: Generated datasets for 6-digit and 9-digit numbers, with each category comprising 500 samples.
- Multiplication: Created datasets for 2-digit and 3-digit numbers, also with 500 samples per category.
- GSM8K: A collection of grade-school math word problems, from which we randomly selected 500 samples from the test set.
- ProofWriter: Focuses on deductive logical reasoning, with 600 instances chosen from the 5-hop-reasoning development set.
- FOLIO: Another dataset dedicated to deductive logical reasoning, utilizing all 204 instances from the development set.
- LOGIQA: Contains verbal reasoning exam questions. We randomly selected 600 entries from the LogiQA 2.0 test set.
All datasets have undergone preprocessing and are located within the ./data
folder. The corresponding prompts for each experimental setting are stored in the ./prompts
folder.
To generate data for arithmetic problems automatically, execute:
bash make_data.sh
Begin by installing the necessary packages:
pip install -r requirements.txt
To assess performance on the selected tasks, execute the following commands. This will run the code and save the results:
export api_key=sk_xxxxxxxxxx # your OpenAI API key
bash api_run.sh {MODEL} {TASK} {PROMPT}
MODEL
: Specify the model name, options include ['gpt-3.5-turbo', 'gpt-4', ...].TASK
: Define the task name, choices are [Addition:6, Addition:9, Product:2, Product:3, GSM8K, LOGIQA, FOLIO, ProofWriter].PROMPT
: Choose the prompt setting, options range from [direct, cot0shot, cot4shot, ...].
For arithmetic problems, CoT correctness can be automatically verified using:
bash check.sh
For other tasks, correctness is manually checked.
To investigate the outcomes affected by intervening in the random variables, use the commands below:
bash interfere.sh {MODEL} {TASK} {PROMPT} {DO_REASON} {DO_BIAS} {DO_ROLE}
DO_REASON
: Intervene on the CoT. Usedefaultreason
for the CoT from original generation,goldreason
for the golden CoT, orrandomreason
for interventions on the number, subject, or logic.DO_BIAS
: Introduce bias into the prompt. Options arenobias
,weakbias
orstrongbias
DO_ROLE
: Assign different roles in the prompt, such asdefaultrole
(math teacher) orrandomrole
(detective, chef, judge)
Before intervening in CoT on reasoning tasks, generate the random reason with ChatGPT:
bash random_reason.sh {MODEL} {TASK} {PROMPT}
To conduct McNemar’s test, execute:
bash mcnemar_test.sh {SETTING}
SETTING
: Specify the setting name, options includeDirect.vs.CoT
for testing the difference between direct answering and CoT, or one of the following for specific comparisons: [GoldCoT.vs.Default
,RandCoT.vs.Default
,RandRole.vs.Default|DefaultCoT
,RandRole.vs.Default|GoldCoT
,RandBias.vs.Default|DefaultCoT
,RandBias.vs.Default|GoldCoT
]
To evaluate the performance of open Large Language Models (LLMs), it is necessary to first deploy the model using vLLM (OpenAI-Compatible Server). After deployment, update the api_base
in both api_run.sh
and interfere.sh
scripts to the address of your deployed model. For detailed instructions on deploying models with vLLM, please refer to the vLLM documentation
The outcomes of our experiments are meticulously documented and stored in the ./exp_cot
folder. This repository includes all results discussed in our paper, covering task performance metrics and intervention analysis outcomes.
Our codes are based on LogicLLM.
Please cite the paper in the following format if you find our work beneficial.
@misc{bao2024llms,
title={LLMs with Chain-of-Thought Are Non-Causal Reasoners},
author={Guangsheng Bao and Hongbo Zhang and Linyi Yang and Cunxiang Wang and Yue Zhang},
year={2024},
eprint={2402.16048},
archivePrefix={arXiv},
primaryClass={cs.CL}
}