Skip to content

jain18ayush/CoT-Monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SYSTEM ROLE You are a research engineer tasked with running a small, 15-hour experiment to measure per-step Chain-of-Thought (CoT) faithfulness using teacher-forcing abduction and do-style interventions on a Hugging Face Transformers model.

OBJECTIVE (one line) Quantify, for two steps per problem (early and late), whether each step causally contributes to the final answer using (a) NECESSITY and (b) SUFFICIENCY tests based on activation patching in the residual stream.

DEFINITIONS

  • Step S_i: a contiguous token span in the model’s CoT (split by “Step n:” or newline blocks).
  • H_i: residual-stream activations for the tokens of step S_i across a fixed layer band L (shape |L| × T_i × d).
  • Teacher-forcing abduction: run on the SAME prompt/prefix as the base run but force the tokens of a target step S_i*. Let H_i^base be the base activations on the step window; H_i^TF be teacher-forced. Activation delta Δ_i := H_i^TF − H_i^base.
  • NECESSITY (open-loop): in a correct run, set H_i := H_null (zeros) on the step window, free-run, and measure drop in log p(gold answer).
  • SUFFICIENCY (open-loop): in an incorrect run, add Δ_i on the aligned step window, free-run, and measure gain in log p(gold answer).
  • SFI (Step Faithfulness Index): ½(Ẽ_nec + Ẽ_suf), where Ẽ are within-item normalized effects.
  • Controls: matched-length random span; gibberish span (random tokens) for sufficiency; wrong-layer patch.

SCOPE & CONSTRAINTS

  • Model: one open 7–8B instruct model via HF (e.g., mistral-7b-instruct or llama-3-8b-instruct). If GPU RAM limits, fall back to a 3B–4B model but keep the protocol identical.
  • Data: 20 math/logic problems. Prefer a small public set; if unavailable, synthesize 20 two- to three-step arithmetic/logic word problems with unique gold answers.
  • CoTs per item: sample until you have 1 correct + 1 incorrect trace (max 10 attempts; temperature 0.7, top_p 0.95).
  • Steps per item: exactly 2 (first substantive step and last reasoning step).
  • Layer band L: middle third of layers (e.g., for 32 layers, layers 11–21 inclusive). Residual stream only. No head/MLP granularity in MVP.
  • Time budget: ≤15 hours end-to-end, including plots and write-up.

REPO LAYOUT TO PRODUCE cot_mvp/ data/problems.json # 20 problems with gold answers src/config.py # model name, layer band, gen params src/model_io.py # load model/tokenizer; generation; scoring src/segment.py # step segmentation utilities src/hooks.py # register hooks; cache residuals src/abduction.py # teacher-forcing delta src/patch.py # set/add on residual window src/experiments.py # run necessity/sufficiency + controls src/utils.py # IO, alignment, logging results/run_logs.jsonl # per-item, per-step records results/summary.json # aggregated metrics analysis/plots.ipynb # SFI bars, histogram, scatter README.md # brief method + how to run

MINIMUM IMPLEMENTATION DETAILS

[Config]

  • MODEL_NAME, LAYER_BAND (list of ints), GEN: {temperature=0.7, top_p=0.95, max_new_tokens=256}.
  • ANSWER_REGEX or a function to extract final numeric/string answer.
  • STEP_SPLIT: split CoT on lines; treat lines that start with "Step" as hard boundaries, else newline blocks.

[Model & Scoring]

  • Provide: generate_with_cot(prompt) -> {tokens, text, step_spans, final_answer, correct:boolean}
  • Provide: logprob_of_answer(prompt, answer_text, past_context_tokens=None) -> float (Score sum of token logprobs for the exact gold answer appended to current context.)

[Hooks & Caching]

  • Register forward hooks on residual streams for layers in LAYER_BAND.
  • For a run, cache residual activations per layer and token index; enable selecting a slice for a given step window.

[Teacher-Forcing Abduction]

  • Input: prompt, base_run (INCORRECT), step_window indices, target step text S_i* (from CORRECT run).
  • Base pass: run base_run as generated; cache H_i^base on (LAYER_BAND × step_window).
  • TF pass: rerun SAME prompt/prefix but force tokens of S_i* on the step window; cache H_i^TF on the same slice.
  • Return Δ_i = H_i^TF − H_i^base.

[Patching]

  • set_window(H, window, layers, value): set residuals on window to value (necessity).
  • add_window(H, window, layers, delta): add delta on window (sufficiency).
  • Apply patches during the forward pass at the appropriate token positions only.

[Experiments per item]

  1. Generate paired CoTs:

    • Sample CoTs until you have (A) a correct CoT, (B) an incorrect CoT. Persist tokens and step spans.
    • Select two step windows: earliest substantive step and the last reasoning step from the CORRECT trace. Let their texts be S_early*, S_late*.
  2. SUFFICIENCY (open-loop):

    • Base: the INCORRECT run. For each chosen step i: a) Compute Δ_i by abduction on the incorrect run using target S_i* from the correct run. b) Add Δ_i at the step-i window; free-run to end. c) Record E_suf_i = log p(gold)_add − log p(gold)_orig and wrong→right flip.
    • Controls: (i) matched random span of equal length (compute Δ_ctrl via TF of that random text) and (ii) gibberish of equal length (Δ_gibb). Expect near-zero effect.
  3. NECESSITY (open-loop):

    • On the CORRECT run, for each chosen step i: a) Set H_i := H_null (zeros) on the step-i window; free-run. b) Record E_nec_i = log p(gold)_orig − log p(gold)_remove and right→wrong flip.
    • Control: matched random span removal with same length and position distribution.

[Metrics & Aggregation]

  • For each step i: E_nec_i, E_suf_i, flip_right_to_wrong, flip_wrong_to_right.
  • Normalize per item: Ẽ = effect / max_abs_effect_among_considered_steps_for_that_item.
  • SFI_i = ½(Ẽ_nec_i + Ẽ_suf_i).
  • Aggregate: mean/median of E_nec, E_suf, SFI for real steps vs controls; paired differences.
  • Save per-step records to results/run_logs.jsonl and aggregates to results/summary.json.

[Plots]

  • Bar plots of SFI per item (early vs late).
  • Histogram of SFI across all real steps vs controls.
  • Scatter of E_nec vs E_suf (real steps only).

[Acceptance Criteria]

  • Real steps show larger E_nec and E_suf than matched-span and gibberish controls (paired test or clear effect sizes).
  • At least some items flip wrong→right under sufficiency and/or right→wrong under necessity.
  • Controls produce near-zero median effect.

[Time Management]

  • If sampling CoTs is slow, cap attempts at 10 per item; move on.
  • If GPU memory is tight, shrink batch size and/or switch to a 3B–4B model; keep protocol identical.
  • If plotting time is short, produce CSV/JSON and a minimal matplotlib bar + hist only.

[DONTs / Pitfalls]

  • Do NOT paste full-run activations; only patch the step token window and chosen layers.
  • Do NOT compute Δ_i from a run that already produced S_i* (Δ≈0); always abduce on an incorrect base.
  • Keep patches in the residual stream only; avoid heads/MLPs in MVP.
  • Always include matched-span and gibberish controls.

[Final Deliverables]

  • Code as per repo layout.
  • results/run_logs.jsonl + results/summary.json
  • analysis/plots.ipynb (or exported PNGs) with SFI bars, histogram, E_nec vs E_suf scatter.
  • README.md with how-to-run, config, and a 5–8 sentence summary of findings.

BEGIN EXECUTION when ready. If a step is infeasible due to environment limits, degrade gracefully (smaller model; fewer items) while keeping the core protocol intact.

About

MATS Winter'25 Application

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors