A research project investigating whether cross-model critique actually improves outputs through genuine revision, or whether performance gains are primarily attributable to the stronger model re-solving the problem independently.
When a stronger LLM critiques and revises a weaker LLM's output, does the weaker model's draft contribute meaningfully — or is the improvement just the stronger model doing the work from scratch?
The study uses two model pairs across 16 conditions:
- C-series: Gemini Flash Lite (weak generator) → GPT-4o-mini (strong critic)
- D-series: GPT-4o-mini (weak generator) → Gemini Flash (strong critic)
Each series has four conditions, designed to decompose the source of improvement:
| Condition | Setup | Measures |
|---|---|---|
| x1 | Weak model alone | Baseline |
| x2 | Strong model sees question + real draft | Full critique |
| x3 | Strong model sees question only (no draft) | Re-solving effect |
| x4 | Strong model sees question + dummy draft | Framing effect |
Effect decomposition:
Total gain (x2 - x1) = Re-solving (x3 - x1) + Framing (x4 - x3) + Content (x2 - x4)
A near-zero or negative content effect indicates the stronger model is not genuinely using the draft.
Role-swap variants (x1r–x4r) swap the generator and critic roles between the two model pairs.
.
├── runner.py # CLI entry point for running experiments
├── experiments.py # Logic for all 16 conditions
├── llm.py # Unified API wrapper for GPT and Gemini (with caching)
├── loaders.py # Dataset loaders (GPQA, HLE, MBPP, HumanEval+, LiveCodeBench)
├── evaluate.py # Per-dataset correctness evaluators
├── summary.py # Accuracy tables, McNemar's tests, effect decomposition
├── hypothesis_test.py # Pairwise McNemar's test between two trial runs
├── livecodebench_difficulty_analysis.py # Stratified analysis by problem difficulty
├── cache.pkl # Persistent LLM response cache
└── trials/ # Experiment outputs (one directory per run)
└── {dataset}_{condition}_{timestamp}/
├── summary.json # Accuracy and metadata
└── details.json # Per-sample prompts, outputs, and correctness
| Dataset | Task Type | Size |
|---|---|---|
| GPQA | Graduate-level MCQ | ~198 |
| HLE | Humanity's Last Exam MCQ | ~2700 |
| LiveCodeBench | Competition coding (post-2024) | ~400+ |
Run an experiment:
python runner.py --dataset gpqa --condition C2 --top_n 198Analyze results:
python summary.py --datasets gpqa hlePairwise statistical test:
python hypothesis_test.py trials/gpqa_C2_... trials/gpqa_C3_...- Install dependencies:
openai,google-genai,datasets,scipy,tqdm,python-dotenv - Create a
.envfile withOPENAI_API_KEYandGEMINI_API_KEY - Run experiments with
runner.py