From f8d0598ba1147f2e5a01d5512b8e85e62a8a05b6 Mon Sep 17 00:00:00 2001 From: amay Date: Sat, 28 Mar 2026 04:13:45 +0100 Subject: [PATCH 01/74] docs: add campaign scaffolding, evidence summary, and V100-compat patch - 5-run A100 evidence package (baseline, seed42, LowerLR, warmdown, smoke) - Campaign sessions 01-07 with templates and runbook - Pre-TTT anchor diff analysis and root port-gap audit - V100/fp16 compatibility via AMP_DTYPE auto-detection - Agent coordination files (CLAUDE.md, AGENTS.md, AGENT_SYNC.md) Co-Authored-By: Claude Opus 4.6 (1M context) --- AGENTS.md | 25 + CLAUDE.md | 32 ++ docs/campaign/AGENT_SYNC.md | 206 +++++++ docs/campaign/PEGASUS_H100_RUNBOOK.md | 86 +++ docs/campaign/PROMPT_TEMPLATE.md | 59 ++ docs/campaign/README.md | 115 ++++ .../01_lineage_and_environment_audit.md | 304 ++++++++++ .../artifacts/02a_pegasus_verification.md | 127 +++++ .../03a_pre_ttt_anchor_diff_analysis.md | 301 ++++++++++ .../03b_root_train_gpt_port_gap_audit.md | 539 ++++++++++++++++++ .../2026-03-28_a100_evidence_summary.md | 143 +++++ docs/campaign/artifacts/README.md | 12 + .../01_lineage_and_environment_audit.md | 44 ++ .../sessions/02_pegasus_baseline_ladder.md | 92 +++ .../sessions/03_pre_ttt_anchor_port.md | 45 ++ .../sessions/04_targeted_delta_sweep.md | 42 ++ .../sessions/05_ttt_correctness_audit.md | 37 ++ .../06_attribution_graph_sidecar_probe.md | 77 +++ .../campaign/sessions/06_rfn_sidecar_probe.md | 41 ++ docs/campaign/sessions/07_go_no_go_review.md | 41 ++ .../templates/EXPERIMENT_SUMMARY_TEMPLATE.md | 33 ++ .../PEGASUS_H100_TORCHRUN_TEMPLATE.sh | 20 + .../templates/RUN_MANIFEST_TEMPLATE.md | 45 ++ docs/codex-memory/BOOTSTRAP.md | 39 ++ docs/codex-memory/README.md | 58 ++ docs/codex-memory/decisions.md | 45 ++ docs/codex-memory/hardware-and-constraints.md | 34 ++ docs/codex-memory/leaderboard-techniques.md | 46 ++ docs/codex-memory/next-session.md | 130 +++++ docs/codex-memory/project-state.md | 150 +++++ .../rfn-and-attribution-assessment.md | 32 ++ docs/codex-memory/session-handoff.md | 172 ++++++ docs/codex-memory/sync-codex-memory.sh | 22 + train_gpt.py | 49 +- 34 files changed, 3232 insertions(+), 11 deletions(-) create mode 100644 AGENTS.md create mode 100644 CLAUDE.md create mode 100644 docs/campaign/AGENT_SYNC.md create mode 100644 docs/campaign/PEGASUS_H100_RUNBOOK.md create mode 100644 docs/campaign/PROMPT_TEMPLATE.md create mode 100644 docs/campaign/README.md create mode 100644 docs/campaign/artifacts/01_lineage_and_environment_audit.md create mode 100644 docs/campaign/artifacts/02a_pegasus_verification.md create mode 100644 docs/campaign/artifacts/03a_pre_ttt_anchor_diff_analysis.md create mode 100644 docs/campaign/artifacts/03b_root_train_gpt_port_gap_audit.md create mode 100644 docs/campaign/artifacts/2026-03-28_a100_evidence_summary.md create mode 100644 docs/campaign/artifacts/README.md create mode 100644 docs/campaign/sessions/01_lineage_and_environment_audit.md create mode 100644 docs/campaign/sessions/02_pegasus_baseline_ladder.md create mode 100644 docs/campaign/sessions/03_pre_ttt_anchor_port.md create mode 100644 docs/campaign/sessions/04_targeted_delta_sweep.md create mode 100644 docs/campaign/sessions/05_ttt_correctness_audit.md create mode 100644 docs/campaign/sessions/06_attribution_graph_sidecar_probe.md create mode 100644 docs/campaign/sessions/06_rfn_sidecar_probe.md create mode 100644 docs/campaign/sessions/07_go_no_go_review.md create mode 100644 docs/campaign/templates/EXPERIMENT_SUMMARY_TEMPLATE.md create mode 100755 docs/campaign/templates/PEGASUS_H100_TORCHRUN_TEMPLATE.sh create mode 100644 docs/campaign/templates/RUN_MANIFEST_TEMPLATE.md create mode 100644 docs/codex-memory/BOOTSTRAP.md create mode 100644 docs/codex-memory/README.md create mode 100644 docs/codex-memory/decisions.md create mode 100644 docs/codex-memory/hardware-and-constraints.md create mode 100644 docs/codex-memory/leaderboard-techniques.md create mode 100644 docs/codex-memory/next-session.md create mode 100644 docs/codex-memory/project-state.md create mode 100644 docs/codex-memory/rfn-and-attribution-assessment.md create mode 100644 docs/codex-memory/session-handoff.md create mode 100755 docs/codex-memory/sync-codex-memory.sh diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000000..f1f5460490 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,25 @@ +# Shared Agent Entry Point + +Start here for both Claude Code and Codex. + +## Read First + +1. `docs/campaign/AGENT_SYNC.md` +2. `CLAUDE.md` + +## Purpose + +`docs/campaign/AGENT_SYNC.md` is the mutable source of truth for: + +- current objective +- current scope +- latest measured results +- next commands to run + +`CLAUDE.md` contains the standing coordination rules for sessions, updates, and disagreement handling. + +## Current Working Mode + +- Active goal: A100 development evidence for a larger compute request +- Next runs: `a100_baseline_600s`, then `a100_lowerlr_600s` +- Out of scope: H100 parity claim, Session 03 anchor port, arbitrary strategy drift diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000..e6caa90c14 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,32 @@ +# Coordination Rules + +This repo uses one shared handoff protocol for Claude Code and Codex. + +## Read Order + +1. Read `docs/campaign/AGENT_SYNC.md` first. +2. If the task touches campaign strategy or prior experiments, also read: + - `docs/codex-memory/project-state.md` + - `docs/codex-memory/next-session.md` + - `docs/codex-memory/decisions.md` + +## Session Rules + +1. Treat `docs/campaign/AGENT_SYNC.md` as the current source of truth for: + - active objective + - current scope + - next commands to run + - latest measured results +2. Before starting new work, check `docs/campaign/artifacts/` and `records/` to avoid duplicating completed work. +3. If you change the objective, next step, or interpretation of results, update `docs/campaign/AGENT_SYNC.md`. +4. If you make a campaign-level decision or disagree with an earlier recommendation, record it in `docs/codex-memory/decisions.md`. +5. If you finish a meaningful session, update `docs/codex-memory/project-state.md` and `docs/codex-memory/next-session.md`. + +## Working Agreement + +- Current goal is development evidence for a larger compute request, not a leaderboard-parity claim. +- Pegasus `A100-80GB` is the active development target. +- Pegasus `H100` remains the parity target, but is not the current blocking path. +- RunPod is reserved for final validation only. +- `git clone` and `git pull` are the default sync path for remote workspaces. +- Use `rsync` only to push local uncommitted changes quickly. diff --git a/docs/campaign/AGENT_SYNC.md b/docs/campaign/AGENT_SYNC.md new file mode 100644 index 0000000000..c5529e609f --- /dev/null +++ b/docs/campaign/AGENT_SYNC.md @@ -0,0 +1,206 @@ +# Agent Sync + +Date: 2026-03-28 + +## Current Objective + +Turn the completed Pegasus `A100-80GB` evidence runs into a short, defensible compute-grant evidence note. + +This is not yet an H100-parity validation campaign. + +## In Scope + +- Summarize the completed `A100-80GB` smoke run +- Summarize the completed `600s` root baseline run +- Summarize the completed `600s` `LowerLR` comparison run +- Produce a short grant-ready evidence note with exact measured metrics + +## Out Of Scope + +- Claiming H100 parity +- Starting the Session 03 anchor port +- Treating RFN as the mainline strategy +- Spending RunPod budget except for final validation later +- Arbitrary trainer edits unrelated to hardware compatibility or the current baseline/comparison plan + +## Current Hardware Stance + +- Parity target: Pegasus `H100` +- Active development target: Pegasus `A100-80GB` +- Development fallback: other Pegasus GPUs only when A100/H100 are unavailable + +## Status Snapshot + +- Pegasus operator path: confirmed working +- A100 smoke run: complete +- A100 `600s` baseline run: complete +- A100 `600s` `LowerLR` comparison: complete +- A100 `600s` baseline seed-42 reproducibility run: complete +- A100 `600s` warmdown-only variant: complete +- Current best measured A100 result: root baseline (`val_bpb=1.37140771`) +- Baseline seed spread is small (`+0.00319322` BPB from seed `1337` to seed `42`) +- Immediate next deliverable: written summary, not another rerun + +## Canonical Workspaces + +- Local repo: `/home/amay/Work/parameter-golf` +- Remote repo: `/netscratch/$USER/parameter-golf` + +Use `git clone` and `git pull` as the default remote sync path. +Use `rsync` only when local uncommitted changes need to be pushed quickly. + +## Latest Measured Results + +Date: 2026-03-27 +Node: `serv-3333` +GPU: `NVIDIA A100-SXM4-80GB` +Run: `a100_baseline_smoke` + +Measured outputs: + +- Train setup: `1` shard, `200` iterations, `TRAIN_BATCH_TOKENS=65536` +- `amp_dtype: bf16` +- Step average at finish: `154.57 ms` +- Pre-roundtrip eval: `val_loss=3.6186`, `val_bpb=2.1432` +- Post-roundtrip exact eval: `val_loss=3.67022861`, `val_bpb=2.17371612` +- Post-roundtrip eval time: `250881 ms` +- Peak memory: `1548 MiB allocated`, `1566 MiB reserved` +- Total submission size `int8+zlib`: `7066088` bytes + +Interpretation: + +- Pegasus execution path is working end to end on available hardware. +- Artifact size is comfortably below the `16,000,000` byte cap. +- This run is development evidence only, not H100-equivalent validation. + +Date: 2026-03-28 +Node: `serv-3333` +GPU: `NVIDIA A100-SXM4-80GB` +Run: `a100_baseline_600s` + +Measured outputs: + +- Train setup: `10` shards, `600s` wallclock cap +- `amp_dtype: bf16` +- Stopped at `907` steps in `600119 ms` +- Pre-roundtrip eval: `val_loss=2.3117`, `val_bpb=1.3691` +- Post-roundtrip exact eval: `val_loss=2.31556447`, `val_bpb=1.37140771` +- Post-roundtrip eval time: `22204 ms` +- Peak memory: `10253 MiB allocated`, `10578 MiB reserved` +- Total submission size `int8+zlib`: `12046627` bytes + +Date: 2026-03-28 +Node: `serv-3333` +GPU: `NVIDIA A100-SXM4-80GB` +Run: `a100_lowerlr_600s` + +Measured outputs: + +- Train setup: `10` shards, `600s` wallclock cap +- `amp_dtype: bf16` +- Stopped at `908` steps in `600185 ms` +- Pre-roundtrip eval: `val_loss=2.3142`, `val_bpb=1.3706` +- Post-roundtrip exact eval: `val_loss=2.32600988`, `val_bpb=1.37759407` +- Post-roundtrip eval time: `22206 ms` +- Peak memory: `10253 MiB allocated`, `10530 MiB reserved` +- Total submission size `int8+zlib`: `10723611` bytes + +Comparison: + +- On this 1xA100 600s setup, `LowerLR` is worse than the root baseline by `+0.00618636` BPB post-roundtrip. +- `LowerLR` does reduce artifact size by `1323016` bytes, but size was not the limiting factor here. +- For grant evidence, the useful conclusion is that the operator path is reproducible and controlled comparisons are already discriminating between variants. + +Date: 2026-03-28 +Node: `serv-3338` +GPU: `NVIDIA A100-SXM4-80GB` +Run: `a100_baseline_seed42_600s` + +Measured outputs: + +- Train setup: `10` shards, `600s` wallclock cap +- `amp_dtype: bf16` +- Stopped at `900` steps in `600537 ms` +- Pre-roundtrip eval: `val_loss=2.3165`, `val_bpb=1.3720` +- Post-roundtrip exact eval: `val_loss=2.32095610`, `val_bpb=1.37460093` +- Post-roundtrip eval time: `22483 ms` +- Peak memory: `10253 MiB allocated`, `10578 MiB reserved` +- Total submission size `int8+zlib`: `12018778` bytes + +Reproducibility read: + +- Baseline seed `42` is worse than baseline seed `1337` by `+0.00319322` BPB post-roundtrip. +- Step time and memory are effectively unchanged across baseline seeds. +- This is strong enough to claim the baseline behavior is reproducible on Pegasus `A100-80GB`. + +Date: 2026-03-28 +Node: `serv-3338` +GPU: `NVIDIA A100-SXM4-80GB` +Run: `a100_warmdown3600_600s` + +Measured outputs: + +- Train setup: `10` shards, `600s` wallclock cap, `WARMDOWN_ITERS=3600` +- `amp_dtype: bf16` +- Stopped at `903` steps in `600661 ms` +- Pre-roundtrip eval: `val_loss=2.3568`, `val_bpb=1.3958` +- Post-roundtrip exact eval: `val_loss=2.38171775`, `val_bpb=1.41058741` +- Post-roundtrip eval time: `22360 ms` +- Peak memory: `10253 MiB allocated`, `10530 MiB reserved` +- Total submission size `int8+zlib`: `9951155` bytes + +Schedule read: + +- Warmdown-only is worse than the root baseline by `+0.03917970` BPB post-roundtrip. +- Warmdown-only is also worse than `LowerLR` by `+0.03299334` BPB post-roundtrip. +- It does reduce artifact size by `2095472` bytes versus the root baseline, but size was not the bottleneck. +- Current evidence says the root schedule should remain the A100 anchor. + +## Next Actions + +### 1. Extract comparable evidence lines + +```bash +grep -E "amp_dtype:|step:.*val_loss:|stopping_early:|peak memory|Serialized model int8\\+zlib|Total submission size int8\\+zlib|final_int8_zlib_roundtrip" /netscratch/$USER/a100-*.log +``` + +### 2. Write grant-ready summary + +The summary should include: + +- the successful `A100-SXM4-80GB` smoke run +- the `600s` baseline result +- the `600s` `LowerLR` comparison result +- the `600s` baseline seed-42 reproducibility result +- the `600s` warmdown-only negative result +- the conclusion that baseline currently beats `LowerLR` on this setup +- the conclusion that baseline seed sensitivity appears small on this setup +- the conclusion that extending warmdown alone is harmful on this setup +- the fact that artifact sizes are already under the challenge cap + +### 3. Optional next experiment only after the summary exists + +If another controlled A100 run is needed, do not repeat `LowerLR`. +Pick a different single-change variant. + +Candidate next variants after the summary: + +- a pure artifact-size tradeoff variant +- an eval-side control, only if kept clearly separate from training-side comparisons + +Warmdown-only has now been tested and should not be repeated unless coupled to another materially different change. + +## Evidence Required From Each Run + +- GPU model +- Exact command +- Train wallclock / `step_avg` +- Final post-roundtrip `val_bpb` +- Artifact size +- Peak memory +- Any compile or export warnings + +## Decision Rule + +The two-run A100 baseline/comparison pair now exists. +Do not broaden scope further until it is summarized into a short grant-ready evidence note. diff --git a/docs/campaign/PEGASUS_H100_RUNBOOK.md b/docs/campaign/PEGASUS_H100_RUNBOOK.md new file mode 100644 index 0000000000..7ec2cac75e --- /dev/null +++ b/docs/campaign/PEGASUS_H100_RUNBOOK.md @@ -0,0 +1,86 @@ +# Pegasus H100 Runbook + +This is a lightweight operating note for Parameter Golf work on Pegasus. + +## Why this partition + +From `@docs/Pegasus_Server_documentation.txt`: +- `H100` is `H100-SXM5` +- 8 GPUs per node +- NVSwitch connectivity +- 80 GB per GPU + +This is the right Pegasus partition for challenge-like iteration. + +Do not use: +- `batch` for multi-GPU challenge work +- `H100-PCI` when the goal is closest parity with the challenge hardware + +## Minimum scheduling stance + +Prefer: +- one node +- 8 GPUs on that node +- one process per GPU + +Important scheduler notes from the Pegasus docs: +- keep all GPUs on one node for this workload +- add `--gpu-bind=none` for peer-to-peer visibility in multi-GPU jobs +- avoid mixed-model allocations from the `batch` partition + +## Suggested job metadata to record + +For every real run, capture: +- partition +- node name +- GPU type +- GPU count +- seed +- training script path +- dataset path +- tokenizer path +- train batch tokens +- train seq len +- iterations +- wallclock cap +- step average +- pre-quant `val_bpb` +- post-quant `val_bpb` +- sliding `val_bpb` +- artifact bytes +- code bytes +- total bytes + +Use `templates/RUN_MANIFEST_TEMPLATE.md` and `templates/EXPERIMENT_SUMMARY_TEMPLATE.md`. + +## Practical command shape + +This is not a final cluster script, just the scheduling shape to preserve: + +```bash +srun \ + --partition=H100 \ + --nodes=1 \ + --ntasks=8 \ + --gpus=8 \ + --cpus-per-gpu= \ + --gpu-bind=none \ + --time=01:00:00 \ + bash -lc ' + cd && + torchrun --standalone --nproc_per_node=8