Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
f8d0598
docs: add campaign scaffolding, evidence summary, and V100-compat patch
amrayach Mar 28, 2026
5841286
ops: add optimized Pegasus launcher, diagnostic, and fscratch setup s…
amrayach Mar 28, 2026
6353093
research(protocol): Session 03 pre-TTT anchor — implementation comple…
amrayach Mar 28, 2026
85c41ca
fix: thread rope_train_seq_len through GPT/Block/CausalSelfAttention …
amrayach Mar 28, 2026
535e693
fix: install sentencepiece+zstandard inside NGC container before trai…
amrayach Mar 28, 2026
ea4e197
fix: add --mem=64G to smoke test to prevent Slurm OOM kill
amrayach Mar 28, 2026
df25b24
fix: add --mem=0 and rank-guarded pip install to prevent container OOM
amrayach Mar 28, 2026
563700f
perf: disable math SDP backend — force flash-only attention dispatch
amrayach Mar 28, 2026
884c758
research(results): Session 03 anchor — val_bpb 1.1290 on 8xH100
amrayach Mar 28, 2026
0878f42
docs: Codex Session 03 handoff refinements and Session 04 planning up…
amrayach Mar 28, 2026
82386dd
research(protocol): Session 04 Delta 1 — GPTQ-lite clip search, pre-run
amrayach Mar 28, 2026
33f4671
research(results): Session 04 Delta 1 GPTQ-lite FAILED + research syn…
amrayach Mar 29, 2026
ee26957
research(protocol): Session 04 Delta 2 — LeakyReLU² isolated delta, p…
amrayach Mar 29, 2026
f8f5225
research(results): Session 04 Delta 2 LeakyReLU² NEUTRAL + doc sync
amrayach Mar 29, 2026
44f64bb
docs: close Session 04, open Session 05 — throughput + pre-TTT + TTT …
amrayach Mar 29, 2026
c4f57f7
docs(campaign): add Session 05 planning prompt with full conventions
amrayach Mar 29, 2026
e25e722
docs(campaign): fix Session 05 prompt — soften MCP requirements, clar…
amrayach Mar 29, 2026
efbcbe5
docs(campaign): add Session 05a FA3 implementation prompt
amrayach Mar 29, 2026
ef1d775
bench: FA3 vs SDPA flash microbenchmark
amrayach Mar 29, 2026
61824eb
feat(session05): FA3 port delta + revised 3-phase strategy
amrayach Mar 29, 2026
44810d3
docs(campaign): add Session 05b GPTQ + 05c training bundle prompts, u…
amrayach Mar 29, 2026
e00bc0a
research(protocol): add Full Hessian GPTQ delta on Session 03 anchor
amrayach Mar 29, 2026
d6d7c65
docs: Session 05b GPTQ smoke test — correctness bug found, update all…
amrayach Mar 29, 2026
58f5187
Fix GPTQ export loop and add diagnostics
amrayach Mar 29, 2026
190d375
Add export-only GPTQ replay ablations
amrayach Mar 29, 2026
4f72329
Add debug skip for sliding replay eval
amrayach Mar 29, 2026
92db297
Record GPTQ replay ablation results
amrayach Mar 29, 2026
9cea7e9
Fix GPTQ Hessian collection: normalize, damp, and match PR calibratio…
amrayach Mar 30, 2026
fc9994e
Replace GPTQ path with PR #1019 faithful transplant
amrayach Mar 30, 2026
532bc77
Add AR self-generated calibration path for GPTQ (PR #1019 match)
amrayach Mar 30, 2026
b2e2e55
research(protocol): Session 05c-plus training bundle (XSA-all + VE128…
amrayach Mar 30, 2026
0766d5e
research(protocol): Add 1xGPU smoke test for 05c-plus bundle
amrayach Mar 30, 2026
7bf7e79
research(protocol): Fix smoke test and add EVAL_STRIDE env var
amrayach Mar 30, 2026
e62ce85
research(protocol): Fix smoke failure handling and VE gradient valida…
amrayach Mar 30, 2026
93e2145
research(protocol): Fix stale artifact contamination and VE claim wor…
amrayach Mar 30, 2026
8afe23d
docs: Update stale handoff docs for 05c-plus and three-agent coordina…
amrayach Mar 30, 2026
ce355a5
Add 05e GPTQ probe results and 05f follow-up branch
amrayach Mar 31, 2026
8ffa032
Add Session 05g: XSA-8 throughput recovery on 05c-plus base
amrayach Mar 31, 2026
5d65bae
chore(campaign): organize diagnostics and compression phase state
amrayach Mar 31, 2026
732fdbb
docs(campaign): sync measured state and runbook
amrayach Mar 31, 2026
1a084c7
feat(diagnostics): add int5 tolerance probe for mixed-precision feasi…
amrayach Apr 1, 2026
3d4082c
feat(06a): width-325 mixed int5/int6 + brotli + coprime loader + late…
amrayach Apr 1, 2026
73a1d18
fix(06a): wire real conservative int5 tensor names from probe JSON
amrayach Apr 1, 2026
7ed761e
fix(export_ab): add --sw-stride flag to avoid timeout on batch GPU
amrayach Apr 1, 2026
33aa708
fix(06a): stabilize late QAT gate and record int5 export failure
amrayach Apr 1, 2026
6bfdfd0
feat(06b): add parallel muon banking graft
amrayach Apr 1, 2026
d04a556
fix(06b): restore batched muon ns5
amrayach Apr 1, 2026
4a594e8
feat(07a): add non-banked ksv2/ksv3 pivot
amrayach Apr 1, 2026
1bf926a
feat(07b): add turbomuon engramlite pivot
amrayach Apr 1, 2026
e0c5a3a
Add 07b1 fidelity-fix fork
amrayach Apr 1, 2026
d919041
Add faithful PR1212 07c repro scaffold
amrayach Apr 1, 2026
3fe30eb
feat(07c1): wire TTT sliding eval and add Pegasus jobs
amrayach Apr 2, 2026
1c15cad
chore(jobs): add second-wave 07c1 TTT queue
amrayach Apr 2, 2026
cfae3e1
chore(jobs): add 07c1 width probes
amrayach Apr 2, 2026
f0197cf
fix(07c1): hardcode brotli, remove lzma fallback, add fixed TTT/base …
amrayach Apr 2, 2026
e1f91c4
Sync handoff docs and plan to pr1610-corrector branch
amrayach Apr 14, 2026
7ce8393
Add #1610 train_gpt.py as base for corrector work
amrayach Apr 14, 2026
1d92043
Add RunPod reproduction script for PR #1610 baseline (Gate A/B)
amrayach Apr 14, 2026
a33191f
corrector(#1610): add legality bundle and fix warmup token leak
amrayach Apr 17, 2026
218b623
pipeline(#1610): add runpod_pipeline scripts for corrector Gate A/B
amrayach Apr 17, 2026
775bb2f
docs(runpod-pipeline): remove stale s3 path and record launch state
amrayach Apr 17, 2026
876bb36
fix(#1610): address Ultrareview findings 001/002/003/004/009
amrayach Apr 18, 2026
e8556c6
docs(#1610): pin post-Ultrareview SHA 876bb36 as launch commit
amrayach Apr 18, 2026
1765afc
pipeline(#1610): auto-capture provenance, repo-type=model, exact-SHA …
amrayach Apr 18, 2026
78d1c17
docs(#1610): pin Rev 5 launch SHA 1765afc and record decision overlay
amrayach Apr 18, 2026
e99f18e
fix(1610-corrector): skip pre-quant diagnostic in quantized_eval_only…
amrayach Apr 19, 2026
c6f48c6
fix(stage1): use nullglob + array for shard counting under set -euo p…
amrayach Apr 19, 2026
3ac2653
fix(runpod-pipeline): add preflight guards and better failure surfaces
amrayach Apr 19, 2026
106543d
fix(dockerfile): replace bash CMD with sleep infinity to keep contain…
amrayach Apr 19, 2026
37c0388
docs(runpod): record HBM3, dockerStartCmd array form, MFS quota, cold…
amrayach Apr 19, 2026
71ed28e
docs(memory): record Session 3 negative result and Fallback 1A decision
amrayach Apr 19, 2026
67fe3a2
docs(session4): note stashed reproduction-script WIP
amrayach Apr 19, 2026
2621082
submission(non-record-16mb): #1610 reproduction + corrector negative …
amrayach Apr 19, 2026
811dacd
docs(campaign): record PR4 submission and update post-Session-3 state
amrayach Apr 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,12 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
diagnostics/*/
!diagnostics/README.md
# Agent tooling (Antigravity, Claude Code, Codex)
.agents/skills/
.agents/workflows/
.serena/
*.pyc
SESSION.md
43 changes: 43 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Shared Agent Entry Point

Start here for Claude Code, Codex, and Antigravity.

## Read First

1. `docs/campaign/AGENT_SYNC.md`
2. `CLAUDE.md`

For deep context (campaign strategy, prior experiments, hardware state):

3. `docs/codex-memory/BOOTSTRAP.md` — full bootstrap reading list

## Purpose

`docs/campaign/AGENT_SYNC.md` is the mutable source of truth for:

- current objective
- current scope
- latest measured results
- next commands to run

`CLAUDE.md` contains the standing coordination rules for sessions, updates, and disagreement handling.

## Tool-Specific Config

| Tool | Config | Skills | Workflows |
|------|--------|--------|-----------|
| Claude Code | `~/.claude/settings.json` + `.claude/settings.local.json` | `~/.claude/skills/` | `~/.claude/commands/` |
| Codex | `~/.codex/config.toml` | `~/.codex/skills/` | N/A |
| Antigravity | `~/.gemini/antigravity/mcp_config.json` | `.agents/skills/` (project-level) | `.agents/workflows/` |

## Current Working Mode

- Active goal: reproduce PR `#1610` directly and layer a full-vocab posterior corrector to push below 1.070 BPB
- Execution plan: `docs/campaign/PLAN_PR1610_CORRECTOR.md` (locked Revision 3)
- Source base: `#1610` `train_gpt.py` at SHA `ca191953` (NOT patched D variant)
- Non-record PR `#1598` remains open and frozen; do not edit unless reviewers request changes
- Best measured result: canonical D 5-seed mean TTT BPB `1.08129` (sigma = 0.00059)
- Target: <= 1.070 BPB via #1610 reproduction + posterior corrector
- Budget: $212 RunPod (~35 runs), deadline Apr 30
- Fallback cascade defined in plan if corrector < 0.001 BPB gain
- Out of scope: more Pegasus resubmissions, paid OWC salvage on D stack, SLOT, pre-quant validation TTT, casefold tokenizers, D-variant patching
75 changes: 75 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Coordination Rules

This repo uses one shared handoff protocol for Claude Code, Codex, and Antigravity.

## Entry Point

Every new session must read in this order:
1. `AGENTS.md` — shared entry point, current working mode
2. `docs/campaign/AGENT_SYNC.md` — mutable source of truth for objectives, results, next steps
3. This file (`CLAUDE.md`) — standing rules and operational constraints

This file (`CLAUDE.md`) contains **stable standing rules only**. Do not duplicate mutable campaign state (current objective, latest metrics, next commands) that is already tracked in `AGENT_SYNC.md`.

## Session Rules

1. Treat `docs/campaign/AGENT_SYNC.md` as the current source of truth for:
- active objective
- current scope
- next commands to run
- latest measured results
2. Before starting new work, check `docs/campaign/artifacts/` and `records/` to avoid duplicating completed work.
3. If you change the objective, next step, or interpretation of results, update `docs/campaign/AGENT_SYNC.md`.
4. If you make a campaign-level decision or disagree with an earlier recommendation, record it in `docs/codex-memory/decisions.md`.
5. If a run produces a measured result, append one JSON record to `docs/campaign/results_log.jsonl`. Do not rewrite prior lines.
6. If you finish a meaningful session, update `docs/codex-memory/project-state.md` and `docs/codex-memory/next-session.md`.
7. If the task touches campaign strategy or prior experiments, also read:
- `docs/codex-memory/project-state.md`
- `docs/codex-memory/next-session.md`
- `docs/codex-memory/decisions.md`
8. For competition re-implementations, use source priority:
- `openai/parameter-golf` PR code first
- local repo code second
- papers and generic web sources only to resolve ambiguous math or API details
9. If a post-training export path looks wrong, debug it on the same checkpoint before spending more H100 time on retraining.

## Working Agreement

- Pegasus `8xH100` is the active development target.
- Pegasus `A100-80GB` is fallback or grant-supporting evidence, not the mainline path.
- RunPod is reserved for final validation only.
- `git clone` and `git pull` are the default sync path for remote workspaces.
- Use `rsync` only to push local uncommitted changes quickly.

## Challenge Submission Rules

These are stable public rules from the challenge README and should not be rediscovered every session.

- Official leaderboard entry is **record-gated**, not top-5-open-entry.
- A record submission must beat the current official SOTA by at least `0.005` nats and provide enough logs for `p < 0.01`.
- Train and eval must each run under `10 minutes` on `8xH100`.
- Total artifact size is `16,000,000` bytes decimal for code plus compressed model.
- If a submission does not beat the current record bar, it is a non-record submission, not official leaderboard entry.

## Pegasus Operational Rules

These are stable constraints learned from operational experience. They apply to all Pegasus jobs.

### Launcher
- **Never use `torchrun --standalone`** on Pegasus multi-GPU. It hangs at rendezvous.
- Use Slurm-native `srun` with manual rank env vars: `LOCAL_RANK=$SLURM_LOCALID`, `RANK=$SLURM_PROCID`, `WORLD_SIZE=$SLURM_NTASKS`.

### Job output
- **Never use `| tail -1`** on Pegasus training or install commands. It hides errors and progress.
- Always set `PYTHONUNBUFFERED=1` or use `python -u` to prevent output buffering.

### Allocation shape
- **Always include `--nodes=1`** for challenge-shaped `8xH100` runs. Without it, Slurm may split across nodes, breaking NVSwitch locality.
- Use `--ntasks=8 --gpus-per-task=1 --gpu-bind=none` (not `--gpus=8`).
- If a job lands on multiple nodes, cancel and relaunch with `--nodes=1`.

### FA3 container path
- Saved FA3 container: `/netscratch/$USER/containers/pytorch_25.02_fa3.sqsh`
- **Do not use `--no-deps`** for FA3 wheel on stock NGC 25.02. The container's torch 2.7.0 is ABI-incompatible with FA3 (`undefined symbol: aoti_torch_abi_version`).
- Do not do ad hoc per-job `pip install` of FA3 once the saved container exists.
- See `docs/campaign/PEGASUS_H100_RUNBOOK.md` for full container build and benchmark commands.
34 changes: 34 additions & 0 deletions diagnostics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Diagnostics Index

This directory keeps local copies of analysis outputs pulled from Pegasus so
diagnostic state survives future training runs.

## Current contents

- `2026-03-31_05c_plus/`
- float and float-vs-int6 reports for the best measured branch
- `2026-03-31_05f/`
- cross-run comparison reports against `05c-plus`

## Canonical utilities

- `scripts/diagnostics/diagnose_weights.py`
- single-checkpoint weight statistics
- float-vs-int6 comparison on the same checkpoint
- `scripts/diagnostics/compress_probe.py`
- export-path feasibility probe for saved `.int6.ptz` artifacts

## Typical commands

From the repo root:

```bash
python scripts/diagnostics/diagnose_weights.py final_model.pt
python scripts/diagnostics/diagnose_weights.py final_model.pt final_model.int6.ptz
python scripts/diagnostics/compress_probe.py diagnostics/2026-03-31_05c_plus/final_model.int6.ptz
```

## Notes

- The authoritative preserved artifacts live on Pegasus under `/netscratch/$USER/parameter-golf/diagnostics/`.
- This directory is for pulled reports and local interpretation, not for live training outputs.
78 changes: 78 additions & 0 deletions docker/runpod-pr1610/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
ARG BASE_IMAGE=runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2204
FROM ${BASE_IMAGE}

ARG VENV_DIR=/opt/pg-venv
ARG TORCH_VERSION=2.9.1+cu128
ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128
ARG TORCH_CUDA_VERSION=12.8
ARG FLASH_ATTN_SPEC=flash-attn>=2.8,<2.9
ARG FLASH_ATTN_WHEEL_URL=
ARG TORCH_CUDA_ARCH_LIST=9.0
ARG MAX_JOBS=2

ENV DEBIAN_FRONTEND=noninteractive \
VENV_DIR=${VENV_DIR} \
PATH=${VENV_DIR}/bin:${PATH} \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1

SHELL ["/bin/bash", "-lc"]

RUN apt-get update && \
apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*

RUN python3 -m venv "${VENV_DIR}"

RUN python -m pip install --upgrade pip setuptools wheel && \
python -m pip install --no-cache-dir --upgrade --force-reinstall \
"torch==${TORCH_VERSION}" \
--extra-index-url "${TORCH_INDEX_URL}" && \
python -m pip install --no-cache-dir --upgrade \
numpy \
tqdm \
huggingface-hub \
setuptools \
typing-extensions==4.15.0 \
datasets \
"fsspec>=2023.1.0,<=2026.2.0" \
tiktoken \
sentencepiece \
zstandard \
brotli \
python-minifier \
psutil

# Prefer a prebuilt wheel when one is provided. Otherwise do the one-time
# source build here so fresh RunPod pods never rebuild flash-attn again.
RUN if [[ -n "${FLASH_ATTN_WHEEL_URL}" ]]; then \
python -m pip install --no-cache-dir --force-reinstall --no-deps "${FLASH_ATTN_WHEEL_URL}"; \
else \
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}" MAX_JOBS="${MAX_JOBS}" \
python -m pip install --no-cache-dir --force-reinstall --no-build-isolation --no-deps "${FLASH_ATTN_SPEC}"; \
fi

RUN python - <<PY
import brotli
import sentencepiece
import torch
import triton
from flash_attn_interface import flash_attn_func, flash_attn_varlen_func

expected_torch = "${TORCH_VERSION}"
expected_cuda = "${TORCH_CUDA_VERSION}"
if torch.__version__ != expected_torch:
raise SystemExit(f"torch version mismatch: expected {expected_torch}, got {torch.__version__}")
if torch.version.cuda != expected_cuda:
raise SystemExit(f"CUDA runtime mismatch: expected {expected_cuda}, got {torch.version.cuda}")

print(f"torch {torch.__version__}, CUDA {torch.version.cuda}")
print(f"triton {triton.__version__}")
print("flash_attn_interface: OK")
print("python runtime deps: OK")
PY

WORKDIR /workspace
# Rebuild and repin the published image digest before the next RunPod session.
CMD ["sleep", "infinity"]
Loading