Revision or Re-Solving?

A research project investigating whether cross-model critique actually improves outputs through genuine revision, or whether performance gains are primarily attributable to the stronger model re-solving the problem independently.

Research Question

When a stronger LLM critiques and revises a weaker LLM's output, does the weaker model's draft contribute meaningfully — or is the improvement just the stronger model doing the work from scratch?

Experimental Design

The study uses two model pairs across 16 conditions:

C-series: Gemini Flash Lite (weak generator) → GPT-4o-mini (strong critic)
D-series: GPT-4o-mini (weak generator) → Gemini Flash (strong critic)

Each series has four conditions, designed to decompose the source of improvement:

Condition	Setup	Measures
x1	Weak model alone	Baseline
x2	Strong model sees question + real draft	Full critique
x3	Strong model sees question only (no draft)	Re-solving effect
x4	Strong model sees question + dummy draft	Framing effect

Effect decomposition:

Total gain (x2 - x1) = Re-solving (x3 - x1) + Framing (x4 - x3) + Content (x2 - x4)

A near-zero or negative content effect indicates the stronger model is not genuinely using the draft.

Role-swap variants (x1r–x4r) swap the generator and critic roles between the two model pairs.

Project Structure

.
├── runner.py                          # CLI entry point for running experiments
├── experiments.py                     # Logic for all 16 conditions
├── llm.py                             # Unified API wrapper for GPT and Gemini (with caching)
├── loaders.py                         # Dataset loaders (GPQA, HLE, MBPP, HumanEval+, LiveCodeBench)
├── evaluate.py                        # Per-dataset correctness evaluators
├── summary.py                         # Accuracy tables, McNemar's tests, effect decomposition
├── hypothesis_test.py                 # Pairwise McNemar's test between two trial runs
├── livecodebench_difficulty_analysis.py  # Stratified analysis by problem difficulty
├── cache.pkl                          # Persistent LLM response cache
└── trials/                            # Experiment outputs (one directory per run)
    └── {dataset}_{condition}_{timestamp}/
        ├── summary.json               # Accuracy and metadata
        └── details.json               # Per-sample prompts, outputs, and correctness

Datasets

Dataset	Task Type	Size
GPQA	Graduate-level MCQ	~198
HLE	Humanity's Last Exam MCQ	~2700
LiveCodeBench	Competition coding (post-2024)	~400+

Usage

Run an experiment:

python runner.py --dataset gpqa --condition C2 --top_n 198

Analyze results:

python summary.py --datasets gpqa hle

Pairwise statistical test:

python hypothesis_test.py trials/gpqa_C2_... trials/gpqa_C3_...

Setup

Install dependencies: openai, google-genai, datasets, scipy, tqdm, python-dotenv
Create a .env file with OPENAI_API_KEY and GEMINI_API_KEY
Run experiments with runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revision or Re-Solving?

Research Question

Experimental Design

Project Structure

Datasets

Usage

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
.gitignore		.gitignore
README.md		README.md
cache.pkl		cache.pkl
evaluate.py		evaluate.py
experiments.py		experiments.py
hypothesis_test.py		hypothesis_test.py
livecodebench_difficulty_analysis.py		livecodebench_difficulty_analysis.py
llm.py		llm.py
loaders.py		loaders.py
playground.py		playground.py
plot_effect_decomposition.py		plot_effect_decomposition.py
runner.py		runner.py
runner_command.txt		runner_command.txt
summary.py		summary.py

Folders and files

Latest commit

History

Repository files navigation

Revision or Re-Solving?

Research Question

Experimental Design

Project Structure

Datasets

Usage

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages