Skip to content

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439#2123

Closed
vaibhavmishra1 wants to merge 2 commits intoopenai:mainfrom
vaibhavmishra1:final_sub
Closed

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439#2123
vaibhavmishra1 wants to merge 2 commits intoopenai:mainfrom
vaibhavmishra1:final_sub

Conversation

@vaibhavmishra1
Copy link
Copy Markdown

Gated XSA + CaseOps + LQER g32/top4 + In-Timer N-gram Tilt

The default run is:

Results

The folder includes three successful 8xH100 logs.

Final result: 1.05933439 mean TTT BPB across three logs.
Final file size: max observed submission size 15,991,624 B (8,376 B under the 16,000,000-byte cap).

Seed Config Post-EMA BPB Quant BPB TTT BPB TTT loss TTT eval time Train stop Total submission size
42 g32/top4, rank2, prefix1000 1.06357301 1.07110057 1.05872644 2.31688588 521.1s 596.047s 15,984,321 B
0 g32/top4, rank2, prefix1000 1.06531335 1.07291783 1.06035228 2.32044383 475.6s 596.146s 15,987,537 B
1234 g32/top4, rank2, prefix1000 1.06335328 1.07134445 1.05892444 2.31731918 467.9s 598.091s 15,991,624 B
Mean all three logs 1.06407988 1.07178762 1.05933439 2.31821630 488.2s 596.761s 15,987,827 B

What Changed

Architecture and training

The architecture follows the PR #2018 lineage:

  • 11-layer, 512-dim, 8-head GQA transformer with 4 KV heads.
  • CaseOps SP8192 lossless-caps tokenizer and byte sidecar validation scoring.
  • Gated XSA on all layers with zero-init per-head gates.
  • Looping layers 3-5 after 35% of training.
  • Parallel final lane from layer 8.
  • SmearGate and Sparse Attention Gate enabled.
  • Fused CE training path, FA3 attention, partial RoPE, logit softcap.
  • EMA applied before quantization.

The training loop and optimizer routing are otherwise kept close to the PR #2018 base. The run stops on the 600s wallclock cap.

Quantization

The final quantization path is the PR #2018 GPTQ/LQER/AWQ-lite path with PR #2060's portable LQER retune:

Setting Value
MATRIX_BITS 6
EMBED_BITS 7
LQER_RANK 2
LQER_ASYM_GROUP 32
LQER_TOP_K 4
AWQ_LITE_ENABLED 1
AWQ_LITE_GROUP_SIZE 64
COMPRESSOR pergroup

The largest included g32/top4 run was 15,987,537 bytes, leaving 12,463 bytes of headroom under the 16,000,000-byte cap.

Evaluation

Evaluation keeps the conservative PR #2018 timing recipe:

Setting Value
EVAL_SEQ_LEN 2560
TTT_EVAL_SEQ_LEN 2560
PHASED_TTT_NUM_PHASES 1
PHASED_TTT_PREFIX_DOCS 1000
TTT_LORA_RANK 80
TTT_LORA_LR 0.0001
TTT_CHUNK_SIZE 48
NGRAM_TILT_ENABLED 1
NGRAM_HINT_PRECOMPUTE_OUTSIDE 0
TOKEN_ORDER 16
TOKEN_THRESHOLD 0.800
TOKEN_BOOST 2.625
WITHIN_BOOST 0
WORD_BOOST 0

The n-gram helper is inlined into train_gpt.py, so the evaluation logic is self-contained and counted with the submitted code. The helper is token-only, prefix-only, and its hint precompute happens inside the measured TTT eval timer.

Compliance Notes

  • Train time: all logs stop under the 600s training cap.
  • Eval time: all logs finish under the 600s eval cap.
  • Artifact size: all logs are below 16,000,000 bytes.
  • No network during train/eval: runtime uses local dataset shards, local tokenizer, and local Python/C helpers only.
  • Validation data is not accessed during training.
  • N-gram tilt is causal and score-time only; it reads the validation stream left-to-right and emits each hint from prefix state before observing the target.
  • TTT is score-first: each chunk is scored before its loss is used for adapter updates.

Reproduction

Run from this record folder or from the repository root. The run script sets the final default hyperparameters.

# optional: install Python dependencies
pip install -r requirements.txt

The CaseOps dataset can be downloaded with:

huggingface-cli download romeerp/parameter-golf-caseops-v1 \
  --repo-type dataset \
  --local-dir ./data/datasets/fineweb10B_sp8192_caseops/

Then run the final configuration:

SEED=42 ./run.sh

The expected data root is:

./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved

run.sh also supports overriding:

DATA_PATH=/path/to/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=/path/to/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
SEED=42 ./run.sh

Files

  • train_gpt.py: self-contained training, quantization, serialization, TTT eval, and in-timer n-gram tilt logic.
  • run.sh: final record-candidate command wrapper.
  • submission.json: leaderboard metadata.
  • requirements.txt: Python dependencies beyond the base environment.
  • tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model: tokenizer used by the CaseOps dataset.
  • lossless_caps.py and prepare_caseops_data.py: CaseOps transform and dataset-prep helpers.
  • train_seed42.log, train_seed0.log: claimed final g32/top4 reproduction logs.
  • train_seed1234.log: closely related seed-1234 run included in the results table; it used the earlier LQER top1/rank4/group64 settings.

Lineage and Credits

This submission is a recombination/tuning entry. Key public sources:

  • PR #2018 by simon-marcus: Gated XSA stack, token-only in-timer n-gram tilt, and the main code lineage.
  • PR #2060 by S0urC10ud: LQER/GPTQ retune ported here where supported.
  • PR #2007 by Elubrazione: Long-context/QK/AsymLogit lineage used by later stacks.
  • PR #1967 by ndokutovich: V21 + n-gram tilt + LeakyReLU base lineage.
  • PR #1948 by TimS-ml: LeakyReLU squared slope sweep.
  • PR #1953 by andrewbaggio1: TTT/QK tuning lineage.
  • PR #1923 by jorge-asenjo: asymmetric logit rescale.
  • PR #1908 by romeerp: AWQ-lite mixed precision.
  • PR #1855 by codemath3000: per-group compression and strong SP8192 stack.
  • PR #1514 by codemath3000: strict token-only n-gram precedent.
  • PR #1145 by AnirudhRahul: n-gram tilt with closed-form probability renormalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant