Skip to content

EthanNing/Revision-or-Re-Solving

Repository files navigation

Revision or Re-Solving?

A research project investigating whether cross-model critique actually improves outputs through genuine revision, or whether performance gains are primarily attributable to the stronger model re-solving the problem independently.

Research Question

When a stronger LLM critiques and revises a weaker LLM's output, does the weaker model's draft contribute meaningfully — or is the improvement just the stronger model doing the work from scratch?

Experimental Design

The study uses two model pairs across 16 conditions:

  • C-series: Gemini Flash Lite (weak generator) → GPT-4o-mini (strong critic)
  • D-series: GPT-4o-mini (weak generator) → Gemini Flash (strong critic)

Each series has four conditions, designed to decompose the source of improvement:

Condition Setup Measures
x1 Weak model alone Baseline
x2 Strong model sees question + real draft Full critique
x3 Strong model sees question only (no draft) Re-solving effect
x4 Strong model sees question + dummy draft Framing effect

Effect decomposition:

Total gain (x2 - x1) = Re-solving (x3 - x1) + Framing (x4 - x3) + Content (x2 - x4)

A near-zero or negative content effect indicates the stronger model is not genuinely using the draft.

Role-swap variants (x1rx4r) swap the generator and critic roles between the two model pairs.

Project Structure

.
├── runner.py                          # CLI entry point for running experiments
├── experiments.py                     # Logic for all 16 conditions
├── llm.py                             # Unified API wrapper for GPT and Gemini (with caching)
├── loaders.py                         # Dataset loaders (GPQA, HLE, MBPP, HumanEval+, LiveCodeBench)
├── evaluate.py                        # Per-dataset correctness evaluators
├── summary.py                         # Accuracy tables, McNemar's tests, effect decomposition
├── hypothesis_test.py                 # Pairwise McNemar's test between two trial runs
├── livecodebench_difficulty_analysis.py  # Stratified analysis by problem difficulty
├── cache.pkl                          # Persistent LLM response cache
└── trials/                            # Experiment outputs (one directory per run)
    └── {dataset}_{condition}_{timestamp}/
        ├── summary.json               # Accuracy and metadata
        └── details.json               # Per-sample prompts, outputs, and correctness

Datasets

Dataset Task Type Size
GPQA Graduate-level MCQ ~198
HLE Humanity's Last Exam MCQ ~2700
LiveCodeBench Competition coding (post-2024) ~400+

Usage

Run an experiment:

python runner.py --dataset gpqa --condition C2 --top_n 198

Analyze results:

python summary.py --datasets gpqa hle

Pairwise statistical test:

python hypothesis_test.py trials/gpqa_C2_... trials/gpqa_C3_...

Setup

  1. Install dependencies: openai, google-genai, datasets, scipy, tqdm, python-dotenv
  2. Create a .env file with OPENAI_API_KEY and GEMINI_API_KEY
  3. Run experiments with runner.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages