diff --git a/.gitignore b/.gitignore index 3423c416a7..8a0242ec35 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,12 @@ data/manifest.json data/docs_selected.jsonl .mypy_cache/ .venv -logs/ \ No newline at end of file +logs/ +diagnostics/*/ +!diagnostics/README.md +# Agent tooling (Antigravity, Claude Code, Codex) +.agents/skills/ +.agents/workflows/ +.serena/ +*.pyc +SESSION.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000000..c8873eafc0 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,43 @@ +# Shared Agent Entry Point + +Start here for Claude Code, Codex, and Antigravity. + +## Read First + +1. `docs/campaign/AGENT_SYNC.md` +2. `CLAUDE.md` + +For deep context (campaign strategy, prior experiments, hardware state): + +3. `docs/codex-memory/BOOTSTRAP.md` — full bootstrap reading list + +## Purpose + +`docs/campaign/AGENT_SYNC.md` is the mutable source of truth for: + +- current objective +- current scope +- latest measured results +- next commands to run + +`CLAUDE.md` contains the standing coordination rules for sessions, updates, and disagreement handling. + +## Tool-Specific Config + +| Tool | Config | Skills | Workflows | +|------|--------|--------|-----------| +| Claude Code | `~/.claude/settings.json` + `.claude/settings.local.json` | `~/.claude/skills/` | `~/.claude/commands/` | +| Codex | `~/.codex/config.toml` | `~/.codex/skills/` | N/A | +| Antigravity | `~/.gemini/antigravity/mcp_config.json` | `.agents/skills/` (project-level) | `.agents/workflows/` | + +## Current Working Mode + +- Active goal: reproduce PR `#1610` directly and layer a full-vocab posterior corrector to push below 1.070 BPB +- Execution plan: `docs/campaign/PLAN_PR1610_CORRECTOR.md` (locked Revision 3) +- Source base: `#1610` `train_gpt.py` at SHA `ca191953` (NOT patched D variant) +- Non-record PR `#1598` remains open and frozen; do not edit unless reviewers request changes +- Best measured result: canonical D 5-seed mean TTT BPB `1.08129` (sigma = 0.00059) +- Target: <= 1.070 BPB via #1610 reproduction + posterior corrector +- Budget: $212 RunPod (~35 runs), deadline Apr 30 +- Fallback cascade defined in plan if corrector < 0.001 BPB gain +- Out of scope: more Pegasus resubmissions, paid OWC salvage on D stack, SLOT, pre-quant validation TTT, casefold tokenizers, D-variant patching diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000..791625e5b6 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,75 @@ +# Coordination Rules + +This repo uses one shared handoff protocol for Claude Code, Codex, and Antigravity. + +## Entry Point + +Every new session must read in this order: +1. `AGENTS.md` — shared entry point, current working mode +2. `docs/campaign/AGENT_SYNC.md` — mutable source of truth for objectives, results, next steps +3. This file (`CLAUDE.md`) — standing rules and operational constraints + +This file (`CLAUDE.md`) contains **stable standing rules only**. Do not duplicate mutable campaign state (current objective, latest metrics, next commands) that is already tracked in `AGENT_SYNC.md`. + +## Session Rules + +1. Treat `docs/campaign/AGENT_SYNC.md` as the current source of truth for: + - active objective + - current scope + - next commands to run + - latest measured results +2. Before starting new work, check `docs/campaign/artifacts/` and `records/` to avoid duplicating completed work. +3. If you change the objective, next step, or interpretation of results, update `docs/campaign/AGENT_SYNC.md`. +4. If you make a campaign-level decision or disagree with an earlier recommendation, record it in `docs/codex-memory/decisions.md`. +5. If a run produces a measured result, append one JSON record to `docs/campaign/results_log.jsonl`. Do not rewrite prior lines. +6. If you finish a meaningful session, update `docs/codex-memory/project-state.md` and `docs/codex-memory/next-session.md`. +7. If the task touches campaign strategy or prior experiments, also read: + - `docs/codex-memory/project-state.md` + - `docs/codex-memory/next-session.md` + - `docs/codex-memory/decisions.md` +8. For competition re-implementations, use source priority: + - `openai/parameter-golf` PR code first + - local repo code second + - papers and generic web sources only to resolve ambiguous math or API details +9. If a post-training export path looks wrong, debug it on the same checkpoint before spending more H100 time on retraining. + +## Working Agreement + +- Pegasus `8xH100` is the active development target. +- Pegasus `A100-80GB` is fallback or grant-supporting evidence, not the mainline path. +- RunPod is reserved for final validation only. +- `git clone` and `git pull` are the default sync path for remote workspaces. +- Use `rsync` only to push local uncommitted changes quickly. + +## Challenge Submission Rules + +These are stable public rules from the challenge README and should not be rediscovered every session. + +- Official leaderboard entry is **record-gated**, not top-5-open-entry. +- A record submission must beat the current official SOTA by at least `0.005` nats and provide enough logs for `p < 0.01`. +- Train and eval must each run under `10 minutes` on `8xH100`. +- Total artifact size is `16,000,000` bytes decimal for code plus compressed model. +- If a submission does not beat the current record bar, it is a non-record submission, not official leaderboard entry. + +## Pegasus Operational Rules + +These are stable constraints learned from operational experience. They apply to all Pegasus jobs. + +### Launcher +- **Never use `torchrun --standalone`** on Pegasus multi-GPU. It hangs at rendezvous. +- Use Slurm-native `srun` with manual rank env vars: `LOCAL_RANK=$SLURM_LOCALID`, `RANK=$SLURM_PROCID`, `WORLD_SIZE=$SLURM_NTASKS`. + +### Job output +- **Never use `| tail -1`** on Pegasus training or install commands. It hides errors and progress. +- Always set `PYTHONUNBUFFERED=1` or use `python -u` to prevent output buffering. + +### Allocation shape +- **Always include `--nodes=1`** for challenge-shaped `8xH100` runs. Without it, Slurm may split across nodes, breaking NVSwitch locality. +- Use `--ntasks=8 --gpus-per-task=1 --gpu-bind=none` (not `--gpus=8`). +- If a job lands on multiple nodes, cancel and relaunch with `--nodes=1`. + +### FA3 container path +- Saved FA3 container: `/netscratch/$USER/containers/pytorch_25.02_fa3.sqsh` +- **Do not use `--no-deps`** for FA3 wheel on stock NGC 25.02. The container's torch 2.7.0 is ABI-incompatible with FA3 (`undefined symbol: aoti_torch_abi_version`). +- Do not do ad hoc per-job `pip install` of FA3 once the saved container exists. +- See `docs/campaign/PEGASUS_H100_RUNBOOK.md` for full container build and benchmark commands. diff --git a/diagnostics/README.md b/diagnostics/README.md new file mode 100644 index 0000000000..af9c08c174 --- /dev/null +++ b/diagnostics/README.md @@ -0,0 +1,34 @@ +# Diagnostics Index + +This directory keeps local copies of analysis outputs pulled from Pegasus so +diagnostic state survives future training runs. + +## Current contents + +- `2026-03-31_05c_plus/` + - float and float-vs-int6 reports for the best measured branch +- `2026-03-31_05f/` + - cross-run comparison reports against `05c-plus` + +## Canonical utilities + +- `scripts/diagnostics/diagnose_weights.py` + - single-checkpoint weight statistics + - float-vs-int6 comparison on the same checkpoint +- `scripts/diagnostics/compress_probe.py` + - export-path feasibility probe for saved `.int6.ptz` artifacts + +## Typical commands + +From the repo root: + +```bash +python scripts/diagnostics/diagnose_weights.py final_model.pt +python scripts/diagnostics/diagnose_weights.py final_model.pt final_model.int6.ptz +python scripts/diagnostics/compress_probe.py diagnostics/2026-03-31_05c_plus/final_model.int6.ptz +``` + +## Notes + +- The authoritative preserved artifacts live on Pegasus under `/netscratch/$USER/parameter-golf/diagnostics/`. +- This directory is for pulled reports and local interpretation, not for live training outputs. diff --git a/docker/runpod-pr1610/Dockerfile b/docker/runpod-pr1610/Dockerfile new file mode 100644 index 0000000000..002c9cb49a --- /dev/null +++ b/docker/runpod-pr1610/Dockerfile @@ -0,0 +1,78 @@ +ARG BASE_IMAGE=runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2204 +FROM ${BASE_IMAGE} + +ARG VENV_DIR=/opt/pg-venv +ARG TORCH_VERSION=2.9.1+cu128 +ARG TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 +ARG TORCH_CUDA_VERSION=12.8 +ARG FLASH_ATTN_SPEC=flash-attn>=2.8,<2.9 +ARG FLASH_ATTN_WHEEL_URL= +ARG TORCH_CUDA_ARCH_LIST=9.0 +ARG MAX_JOBS=2 + +ENV DEBIAN_FRONTEND=noninteractive \ + VENV_DIR=${VENV_DIR} \ + PATH=${VENV_DIR}/bin:${PATH} \ + PYTHONUNBUFFERED=1 \ + PIP_NO_CACHE_DIR=1 \ + PIP_DISABLE_PIP_VERSION_CHECK=1 + +SHELL ["/bin/bash", "-lc"] + +RUN apt-get update && \ + apt-get install -y --no-install-recommends git && \ + rm -rf /var/lib/apt/lists/* + +RUN python3 -m venv "${VENV_DIR}" + +RUN python -m pip install --upgrade pip setuptools wheel && \ + python -m pip install --no-cache-dir --upgrade --force-reinstall \ + "torch==${TORCH_VERSION}" \ + --extra-index-url "${TORCH_INDEX_URL}" && \ + python -m pip install --no-cache-dir --upgrade \ + numpy \ + tqdm \ + huggingface-hub \ + setuptools \ + typing-extensions==4.15.0 \ + datasets \ + "fsspec>=2023.1.0,<=2026.2.0" \ + tiktoken \ + sentencepiece \ + zstandard \ + brotli \ + python-minifier \ + psutil + +# Prefer a prebuilt wheel when one is provided. Otherwise do the one-time +# source build here so fresh RunPod pods never rebuild flash-attn again. +RUN if [[ -n "${FLASH_ATTN_WHEEL_URL}" ]]; then \ + python -m pip install --no-cache-dir --force-reinstall --no-deps "${FLASH_ATTN_WHEEL_URL}"; \ + else \ + TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}" MAX_JOBS="${MAX_JOBS}" \ + python -m pip install --no-cache-dir --force-reinstall --no-build-isolation --no-deps "${FLASH_ATTN_SPEC}"; \ + fi + +RUN python - <