GitHub - jain18ayush/CoT-Monitoring: MATS Winter'25 Application

SYSTEM ROLE You are a research engineer tasked with running a small, 15-hour experiment to measure per-step Chain-of-Thought (CoT) faithfulness using teacher-forcing abduction and do-style interventions on a Hugging Face Transformers model.

OBJECTIVE (one line) Quantify, for two steps per problem (early and late), whether each step causally contributes to the final answer using (a) NECESSITY and (b) SUFFICIENCY tests based on activation patching in the residual stream.

DEFINITIONS

Step S_i: a contiguous token span in the model’s CoT (split by “Step n:” or newline blocks).
H_i: residual-stream activations for the tokens of step S_i across a fixed layer band L (shape |L| × T_i × d).
Teacher-forcing abduction: run on the SAME prompt/prefix as the base run but force the tokens of a target step S_i*. Let H_i^base be the base activations on the step window; H_i^TF be teacher-forced. Activation delta Δ_i := H_i^TF − H_i^base.
NECESSITY (open-loop): in a correct run, set H_i := H_null (zeros) on the step window, free-run, and measure drop in log p(gold answer).
SUFFICIENCY (open-loop): in an incorrect run, add Δ_i on the aligned step window, free-run, and measure gain in log p(gold answer).
SFI (Step Faithfulness Index): ½(Ẽ_nec + Ẽ_suf), where Ẽ are within-item normalized effects.
Controls: matched-length random span; gibberish span (random tokens) for sufficiency; wrong-layer patch.

SCOPE & CONSTRAINTS

Model: one open 7–8B instruct model via HF (e.g., mistral-7b-instruct or llama-3-8b-instruct). If GPU RAM limits, fall back to a 3B–4B model but keep the protocol identical.
Data: 20 math/logic problems. Prefer a small public set; if unavailable, synthesize 20 two- to three-step arithmetic/logic word problems with unique gold answers.
CoTs per item: sample until you have 1 correct + 1 incorrect trace (max 10 attempts; temperature 0.7, top_p 0.95).
Steps per item: exactly 2 (first substantive step and last reasoning step).
Layer band L: middle third of layers (e.g., for 32 layers, layers 11–21 inclusive). Residual stream only. No head/MLP granularity in MVP.
Time budget: ≤15 hours end-to-end, including plots and write-up.

REPO LAYOUT TO PRODUCE cot_mvp/ data/problems.json # 20 problems with gold answers src/config.py # model name, layer band, gen params src/model_io.py # load model/tokenizer; generation; scoring src/segment.py # step segmentation utilities src/hooks.py # register hooks; cache residuals src/abduction.py # teacher-forcing delta src/patch.py # set/add on residual window src/experiments.py # run necessity/sufficiency + controls src/utils.py # IO, alignment, logging results/run_logs.jsonl # per-item, per-step records results/summary.json # aggregated metrics analysis/plots.ipynb # SFI bars, histogram, scatter README.md # brief method + how to run

MINIMUM IMPLEMENTATION DETAILS

[Config]

MODEL_NAME, LAYER_BAND (list of ints), GEN: {temperature=0.7, top_p=0.95, max_new_tokens=256}.
ANSWER_REGEX or a function to extract final numeric/string answer.
STEP_SPLIT: split CoT on lines; treat lines that start with "Step" as hard boundaries, else newline blocks.

[Model & Scoring]

Provide: generate_with_cot(prompt) -> {tokens, text, step_spans, final_answer, correct:boolean}
Provide: logprob_of_answer(prompt, answer_text, past_context_tokens=None) -> float (Score sum of token logprobs for the exact gold answer appended to current context.)

[Hooks & Caching]

Register forward hooks on residual streams for layers in LAYER_BAND.
For a run, cache residual activations per layer and token index; enable selecting a slice for a given step window.

[Teacher-Forcing Abduction]

Input: prompt, base_run (INCORRECT), step_window indices, target step text S_i* (from CORRECT run).
Base pass: run base_run as generated; cache H_i^base on (LAYER_BAND × step_window).
TF pass: rerun SAME prompt/prefix but force tokens of S_i* on the step window; cache H_i^TF on the same slice.
Return Δ_i = H_i^TF − H_i^base.

[Patching]

set_window(H, window, layers, value): set residuals on window to value (necessity).
add_window(H, window, layers, delta): add delta on window (sufficiency).
Apply patches during the forward pass at the appropriate token positions only.

[Experiments per item]

Generate paired CoTs:
- Sample CoTs until you have (A) a correct CoT, (B) an incorrect CoT. Persist tokens and step spans.
- Select two step windows: earliest substantive step and the last reasoning step from the CORRECT trace. Let their texts be S_early*, S_late*.
SUFFICIENCY (open-loop):
- Base: the INCORRECT run. For each chosen step i: a) Compute Δ_i by abduction on the incorrect run using target S_i* from the correct run. b) Add Δ_i at the step-i window; free-run to end. c) Record E_suf_i = log p(gold)_add − log p(gold)_orig and wrong→right flip.
- Controls: (i) matched random span of equal length (compute Δ_ctrl via TF of that random text) and (ii) gibberish of equal length (Δ_gibb). Expect near-zero effect.
NECESSITY (open-loop):
- On the CORRECT run, for each chosen step i: a) Set H_i := H_null (zeros) on the step-i window; free-run. b) Record E_nec_i = log p(gold)_orig − log p(gold)_remove and right→wrong flip.
- Control: matched random span removal with same length and position distribution.

[Metrics & Aggregation]

For each step i: E_nec_i, E_suf_i, flip_right_to_wrong, flip_wrong_to_right.
Normalize per item: Ẽ = effect / max_abs_effect_among_considered_steps_for_that_item.
SFI_i = ½(Ẽ_nec_i + Ẽ_suf_i).
Aggregate: mean/median of E_nec, E_suf, SFI for real steps vs controls; paired differences.
Save per-step records to results/run_logs.jsonl and aggregates to results/summary.json.

[Plots]

Bar plots of SFI per item (early vs late).
Histogram of SFI across all real steps vs controls.
Scatter of E_nec vs E_suf (real steps only).

[Acceptance Criteria]

Real steps show larger E_nec and E_suf than matched-span and gibberish controls (paired test or clear effect sizes).
At least some items flip wrong→right under sufficiency and/or right→wrong under necessity.
Controls produce near-zero median effect.

[Time Management]

If sampling CoTs is slow, cap attempts at 10 per item; move on.
If GPU memory is tight, shrink batch size and/or switch to a 3B–4B model; keep protocol identical.
If plotting time is short, produce CSV/JSON and a minimal matplotlib bar + hist only.

[DONTs / Pitfalls]

Do NOT paste full-run activations; only patch the step token window and chosen layers.
Do NOT compute Δ_i from a run that already produced S_i* (Δ≈0); always abduce on an incorrect base.
Keep patches in the residual stream only; avoid heads/MLPs in MVP.
Always include matched-span and gibberish controls.

[Final Deliverables]

Code as per repo layout.
results/run_logs.jsonl + results/summary.json
analysis/plots.ipynb (or exported PNGs) with SFI bars, histogram, E_nec vs E_suf scatter.
README.md with how-to-run, config, and a 5–8 sentence summary of findings.

BEGIN EXECUTION when ready. If a step is infeasible due to environment limits, degrade gracefully (smaller model; fewer items) while keeping the core protocol intact.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
notebooks		notebooks
plots		plots
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
Causal_CoT_Report (1).md		Causal_CoT_Report (1).md
Dataset.md		Dataset.md
README.md		README.md
ablation.jsonl		ablation.jsonl
ablation_results_v1.jsonl		ablation_results_v1.jsonl
causal.jsonl		causal.jsonl
dataset.ipynb		dataset.ipynb
final_answers.csv		final_answers.csv
final_answers_clean.csv		final_answers_clean.csv
highlight (1).py		highlight (1).py
main.py		main.py
plot_cot_causal.py		plot_cot_causal.py
pyproject.toml		pyproject.toml
runs.jsonl		runs.jsonl
runs_cleaned.jsonl		runs_cleaned.jsonl
runs_official.jsonl		runs_official.jsonl
tasks.todo		tasks.todo
temp.txt		temp.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages