Our work studies the evaluation and enhancement of LLM meta-cognition observation methods, which focus on their ability to capture self-awareness of errors based on model internal states during the reasoning process.
Figure 1. In reasoning tasks, error detection (a) focuses on LLMs' cognitive ability to analyze errors in reasoning steps. Self-evaluation (b) utilizes measures such as entropy as lenses to reflect self-awareness of answer rightness. Our work (c) studies the evaluation and improvement of the current lenses in reflecting LLM meta-cognition. Bold "correct" and "wrong" within boxes are ground truths of the answer or step correctness.
conda create -n automeco python=3.11
conda activate automeco
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtTo prevent instability in remote access, our code uses local model loading. You need to download the model you need to deploy (e.g., Qwen2.5-7B) into the Model folder and add the name of the model folder to MODEL_POOL in config_pool.py.
Model/
Model
└── Qwen2.5-7B ### add
├── config.json
├── model-00001-...
├── tokenizer_config.json
└── ...config_pool.py
MODEL_POOL = [
"Qwen2.5-7B", ### add
]You need to add the dataset (.jsonl format) to be tested into the Data folder (e.g., math500).
Data/
Data
└── math500.jsonl ### addEach sample must contain at least the following keys and values:
{
"id": 1, ### Unique identifier
"en": "Find the units digit of $29 \\cdot 79 + 31 \\cdot 81$.", ### Question described in English
"answer": "2" ### Standard answer without solution process
}Similarly, please add the dataset name to DATASET_POOL in config_pool.py.
config_pool.py
DATASET_POOL = [
"math", ### add
]The instructions corresponding to different datasets are stored under DATASET_PROMPTS in prompt_pool.py. We provide instructions for all the datasets used in the paper.
prompt_pool.py
DATASET_PROMPTS = {
"mgsm": "Solve this math problem step by step. Give the reasoning steps using 'Step n: ' before each step to distinguish between different steps, where n is a positive integer starting from 1, representing the current step number. Then give the final answer on the last line by itself in the format of \"Answer:\". Do not add anything other than the integer answer after \"Answer:\".\n\nQuestion:\n{input_data}\n", ### add
}Note: Please use string parsing format to facilitate automated parsing of input questions.
This step collects the LLM's intrinsic rewards based on self-evaluation methods during the reasoning process and annotates step correctness with PRM. The two stages, Inference and Rewarding, are in the scripts. The outputs will be saved in the folder OutputInfo.
bash Scripts/automeco.shYou can modify the models and datasets in this script:
#!/bin/bash
export PROJECT_PATH="" # Set your project path
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Set your CUDA devices
reward_model_list=(Qwen2.5-Math-PRM-7B)
model_list=(Qwen2.5-7B Meta-Llama-3-8B-Instruct Mistral-7B-Instruct-v0.2)
dataset_list=(mgsm math500 minervamath)
# Inference
for model_name in ${model_list[*]}; do
for i in ${dataset_list[*]}; do
python main.py --model_name $model_name \
--dataset "$i" \
--task_type inference >> $PROJECT_PATH/log/$model_name/log_${i}_infer.out
done
done
# Reward
for prm in ${reward_model_list[*]}; do
for model_name in ${model_list[*]}; do
for i in ${dataset_list[*]}; do
python main.py --model_name $model_name \
--dataset "$i" \
--task_type rewarding \
--reward_model_name $prm >> $PROJECT_PATH/log/$model_name/log_${i}_reward.out
done
done
done- Optional Parameters:
print_model_parameter: print the number of parameters of the model.
- Core Implementation of Intrinsic Rewarding Methods
- In
score.py.
- In
bash Scripts/bon.shYour can modify the models and datasets in this script:
#!/bin/bash
export PROJECT_PATH="" # Set your project path
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Set your CUDA devices
reward_model_list=(Qwen2.5-Math-PRM-7B)
model_list=(Qwen2.5-7B Meta-Llama-3-8B-Instruct Mistral-7B-Instruct-v0.2)
dataset_list=(mgsm math500 minervamath)
# Best-of-N Verification
for model_name in ${model_list[*]}; do
for i in ${dataset_list[*]}; do
python main.py --model_name $model_name \
--dataset "$i" \
--task_type verification \
--num_sequence 6 \
--adjust True
done
done- Optional Parameters:
adjust: Ifadjustis True, both the default self-evaluation methods and those with MIRA will be evaluated. Otherwise, MIRA will not be tested.
Codes to get evaluation metrics are integrated in Experiments.ipynb.
We sincerely thank the authors of Chain-of-Embedding for their open-sourced codes.
If you find AutoMeco or MIRA useful for your research or applications, welcome to cite our paper:
@article{ma2025large,
title={Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens},
author={Ma, Ziyang and Yuan, Qingyue and Wang, Zhenglin and Zhou, Deyu},
journal={arXiv preprint arXiv:2506.08410},
year={2025}
}
