Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# SP1024 + Shared-V(last3) 3-seed non-record submission

This is a stable non-record 16MB submission based on the official SP1024 tokenizer and a compact transformer with structured skip fusion.

## Summary

This submission uses:

- official `fineweb_1024_bpe.model`
- standard FineWeb SP1024 dataset
- structured skip fusion (`BIFPN2_MODE=1`)
- XSA on the last 4 layers
- 2-gram scaffold with fade-out
- shared V across the last 3 layers

This submission is intended as a stable, rule-compliant baseline submission rather than a leaderboard-top attempt.

## Representative run

Representative seed: **2027**

Representative exact roundtrip BPB: **1.27717259**

Submission size: **15973626 bytes**

## 3-seed results

| seed | last_val_bpb | roundtrip_exact_val_bpb | submission_bytes |
|------|--------------|-------------------------|------------------|
| 1337 | 1.2791 | 1.28079096 | 15972114 |
| 2027 | 1.2755 | 1.27717259 | 15973626 |
| 3407 | 1.2779 | 1.27952108 | 15975453 |

3-seed mean exact roundtrip BPB: **1.27916154**

## Files

- `submission.json`: metadata for this submission
- `train.log`: representative training log
- `train_gpt.py`: training script snapshot used for this submission
- `config.json`: resolved config for the representative run
- `seed_runs.csv`: all 3 seed results
- `requirements.txt`: minimal environment dependencies

## Main configuration

Key settings:

- tokenizer: SP1024
- `BIFPN2_MODE=1`
- `XSA_ENABLED=1`
- `XSA_LAST_N_LAYERS=4`
- `NGRAM_MAX_N=2`
- `NGRAM_FADE_ENABLE=1`
- `CROSS_LAYER_KV_SHARING_ENABLED=1`
- `CROSS_LAYER_KV_SHARE_V=1`
- `CROSS_LAYER_KV_PAIRWISE=0`
- `CROSS_LAYER_KV_PARTIAL_HEAD=0`

## Notes

- This submission does **not** modify the tokenizer or dataset.
- This is a reproducibility-focused non-record submission under the 16MB artifact limit.
- The representative run uses seed 2027 because it was the best run among the 3 submission seeds.

## Reproduction

Typical command pattern:

```bash
python launchv3.py config_submission_sharev3_3seed.json \
--train-script mytrain_gpt_v2_1.py \
--output output/submission_sharev3_3seed \
--stop-mode steps \
--max-steps 3000 \
--only submission_seed2027
'''
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
{
"_comment_TRACK": "Stable non-record submission candidate under 16MB, using official SP1024 tokenizer and no tokenizer changes",
"_comment_DATA": "Official SP1024 data/tokenizer",
"DATA_PATH": "./data/datasets/fineweb10B_sp1024",
"TOKENIZER_PATH": "./data/tokenizers/fineweb_1024_bpe.model",
"VOCAB_SIZE": 1024,
"_comment_CORE": "Core model shape",
"NUM_LAYERS": 9,
"MODEL_DIM": 512,
"NUM_HEADS": 8,
"NUM_KV_HEADS": 4,
"MLP_MULT": 2,
"TIE_EMBEDDINGS": 1,
"ROPE_BASE": 10000.0,
"LOGIT_SOFTCAP": 30.0,
"QK_GAIN_INIT": 1.5,
"_comment_TRAIN": "Train schedule",
"GRAD_ACCUM_STEPS": 4,
"TRAIN_BATCH_TOKENS": 524288,
"TRAIN_SEQ_LEN": 1024,
"ITERATIONS": 20000,
"WARMUP_STEPS": 20,
"WARMDOWN_ITERS": 1200,
"STOP_MODE": "steps",
"MAX_TRAIN_STEPS": 3000,
"MAX_WALLCLOCK_SECONDS": 3600.0,
"_comment_OPTIM": "Optimizer",
"MATRIX_LR": 0.04,
"SCALAR_LR": 0.04,
"EMBED_LR": 0.6,
"HEAD_LR": 0.008,
"TIED_EMBED_LR": 0.05,
"TIED_EMBED_INIT_STD": 0.005,
"MUON_MOMENTUM": 0.95,
"MUON_BACKEND_STEPS": 5,
"MUON_MOMENTUM_WARMUP_START": 0.85,
"MUON_MOMENTUM_WARMUP_STEPS": 500,
"BETA1": 0.9,
"BETA2": 0.95,
"ADAM_EPS": 1e-08,
"GRAD_CLIP_NORM": 0.0,
"_comment_SKIP": "Best stable under-size stack",
"FDA_MODE": 0,
"BIFPN_MODE": 0,
"BIFPN2_MODE": 1,
"BIFPN_GROUP_COUNT": 8,
"BIFPN_BAND_WIDTH": 1,
"BIFPN_NORM_EPS": 0.0001,
"BIFPN_INIT_MAIN": 1.0,
"BIFPN_INIT_NEIGHBOR": 0.15,
"BIFPN_INIT_FAR": 0.0,
"_comment_STAB": "Stability toggles",
"SCALEDLM_HEAD": 1,
"SMEAR_MODE": 0,
"SMEAR_WINDOW": 4,
"SMEAR_GATE": 0,
"ROPE_DIMS": -1,
"LEARNABLE_ROPE": 0,
"LN_SCALE": 1,
"LEARNABLE_LN_SCALE": 0,
"AFFINE_NORM": 0,
"_comment_XSA": "Keep XSA on last 4 layers",
"XSA_ENABLED": 1,
"XSA_LAST_N_LAYERS": 4,
"XSA_EPS": 1e-06,
"_comment_VALUE_PATH": "Use plain shared V only; this was the under-16MB stable candidate",
"V_SKIP_ENABLED": 0,
"V_SKIP_LAST_N_LAYERS": 4,
"V_SKIP_MODE": "scalar",
"V_SKIP_GROUP_COUNT": 8,
"CROSS_LAYER_V_ENABLED": 0,
"CROSS_LAYER_V_LAST_N_LAYERS": 4,
"CROSS_LAYER_V_MODE": "residual",
"CROSS_LAYER_V_GROUP_COUNT": 4,
"_comment_MEMORY_PATH": "Share V across later layers, no K sharing",
"CROSS_LAYER_KV_SHARING_ENABLED": 1,
"CROSS_LAYER_KV_LAST_N_LAYERS": 3,
"CROSS_LAYER_KV_SHARE_K": 0,
"CROSS_LAYER_KV_SHARE_V": 1,
"CROSS_LAYER_KV_PAIRWISE": 0,
"CROSS_LAYER_KV_PARTIAL_HEAD": 0,
"CROSS_LAYER_KV_PARTIAL_HEAD_COUNT": 2,
"_comment_PLE": "Disabled for this stable submission",
"PLE_ENABLED": 0,
"PLE_TEMPORAL_CONV": 0,
"PLE_DIM": 32,
"PLE_MODE": "post_attn",
"PLE_TOKEN_SCALE_INIT": 1.0,
"PLE_CTX_SCALE_INIT": 1.0,
"PLE_RESID_SCALE_INIT": 0.01,
"_comment_MTP": "Disabled",
"MTP_NUM_HEADS": 0,
"MTP_LOSS_WEIGHT": 0.2,
"MTPHEAD_MLPMODE": 0,
"_comment_NGRAM": "Keep 2-gram scaffold + fade-out",
"NGRAM_VOCAB_SIZE": 2048,
"NGRAM_DIM": 128,
"NGRAM_MAX_N": 2,
"NGRAM_FADE_ENABLE": 1,
"NGRAM_FADE_START_FRAC": 0.15,
"NGRAM_FADE_END_FRAC": 0.45,
"NGRAM_FADE_MIN_SCALE": 0.0,
"_comment_EMA_QAT": "Keep EMA and conservative late QAT",
"EMA_ENABLED": 1,
"EMA_DECAY": 0.997,
"DYNAMIC_CLIP_PERCENTILES": "100.0,99.9999,99.9995,99.995,99.9",
"LATE_QAT_RATIO": 0.15,
"_comment_EVAL": "Submission run should use non-sliding eval for direct comparability",
"VAL_LOSS_EVERY": 1000,
"VAL_BATCH_SIZE": 524288,
"EVAL_USE_SLIDING_WINDOW": 0,
"EVAL_STRIDE": 1024,
"EVAL_BATCH_SEQS": 16,
"_comment_LOGGING": "Telemetry/logging",
"TELEMETRY_EVERY": 50,
"TRAIN_LOG_EVERY": 200,
"PROFILE_RUN": 0,
"PROFILE_WARMUP_STEPS": 5,
"PROFILE_ACTIVE_STEPS": 10,
"SEED": 2027
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
numpy
tqdm
torch
huggingface-hub
kernels
setuptools
typing-extensions==4.15.0
datasets
tiktoken
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
experiment,seed,last_val_bpb,roundtrip_val_bpb,roundtrip_exact_val_bpb,submission_bytes,stopped_step,output_dir
submission_seed1337,1337,1.2791,1.2808,1.28079096,15972114,3000,output/submission_sharev3_3seed/submission_seed1337_20260418_100202
submission_seed2027,2027,1.2755,1.2772,1.27717259,15973626,3000,output/submission_sharev3_3seed/submission_seed2027_20260418_103507
submission_seed3407,3407,1.2779,1.2795,1.27952108,15975453,3000,output/submission_sharev3_3seed/submission_seed3407_20260418_110812
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"title": "SP1024 + Shared-V(last3) + BIFPN2 + XSA4 + NGram Fade",
"author": "Kaikai Liu",
"github_id": "lkk688",
"track": "non-record-16mb",
"description": "Stable SP1024 non-record submission under the 16MB artifact cap.",
"val_bpb": 1.27717259,
"artifact_bytes": 15973626,
"representative_seed": 2027,
"seeds": [
1337,
2027,
3407
],
"tokenizer": "official fineweb_1024_bpe.model",
"dataset": "official fineweb10B_sp1024"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
output/submission_sharev3_3seed/submission_seed2027_20260418_103507/20260418_103511.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
Architecture: Discrete N-Gram Hash (Max N=2)
Architecture: StructuredGroupSignedBiFPN (groups=8, band=1)
model_params:17390313
world_size:1 grad_accum_steps:4
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:3600.000
seed:2027
Architecture Skip Mode: Symmetric U-Net
Enhancement: Discrete N-Gram Hash (Max N=2)
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
EMA Enabled: decay=0.997
Scheduled Late QAT to start at step 2550 (last 15.0%)
step:0/3000 val_loss:6.9310 val_bpb:4.1049 train_time:5ms step_avg:4.66ms
step:1/3000 train_loss:6.9310 train_time:4328ms step_avg:4328.03ms
step:2/3000 train_loss:6.7809 train_time:7824ms step_avg:3912.03ms
step:3/3000 train_loss:6.3509 train_time:8434ms step_avg:2811.31ms
step:4/3000 train_loss:6.0286 train_time:9048ms step_avg:2262.10ms
step:5/3000 train_loss:5.8585 train_time:9663ms step_avg:1932.65ms
step:6/3000 train_loss:5.7350 train_time:10276ms step_avg:1712.72ms
step:7/3000 train_loss:5.6178 train_time:10890ms step_avg:1555.74ms
step:8/3000 train_loss:5.5590 train_time:11507ms step_avg:1438.42ms
step:9/3000 train_loss:5.4568 train_time:12123ms step_avg:1346.98ms
step:10/3000 train_loss:5.3681 train_time:12735ms step_avg:1273.49ms
step:200/3000 train_loss:2.7164 train_time:130023ms step_avg:650.12ms
step:400/3000 train_loss:2.3737 train_time:253575ms step_avg:633.94ms
step:600/3000 train_loss:2.4822 train_time:377129ms step_avg:628.55ms
step:800/3000 train_loss:2.3391 train_time:500560ms step_avg:625.70ms
step:1000/3000 train_loss:2.3517 train_time:624011ms step_avg:624.01ms
step:1000/3000 val_loss:2.3286 val_bpb:1.3791 train_time:624012ms step_avg:624.01ms
step:1200/3000 train_loss:2.2892 train_time:747509ms step_avg:622.92ms
step:1400/3000 train_loss:2.3483 train_time:870940ms step_avg:622.10ms
step:1600/3000 train_loss:2.2245 train_time:994295ms step_avg:621.43ms
step:1800/3000 train_loss:2.2709 train_time:1117660ms step_avg:620.92ms
step:2000/3000 train_loss:2.2152 train_time:1240992ms step_avg:620.50ms
step:2000/3000 val_loss:2.2320 val_bpb:1.3219 train_time:1240994ms step_avg:620.50ms
step:2200/3000 train_loss:2.1446 train_time:1364383ms step_avg:620.17ms
step:2400/3000 train_loss:2.1720 train_time:1487687ms step_avg:619.87ms
[Step 2550] Activating Late QAT — enabling branchless STE quantization.
step:2600/3000 train_loss:2.2332 train_time:1610963ms step_avg:619.60ms
step:2800/3000 train_loss:2.1837 train_time:1734230ms step_avg:619.37ms
step:3000/3000 train_loss:2.1042 train_time:1857569ms step_avg:619.19ms
step:3000/3000 val_loss:2.1537 val_bpb:1.2755 train_time:1857570ms step_avg:619.19ms
peak memory allocated: 22182 MiB reserved: 24640 MiB
Applying EMA weights for final evaluation...
Serialized model: 67895209 bytes
Code size: 126855 bytes
Total submission size: 68022064 bytes
Serialized model int8+zlib: 15846771 bytes (payload:17577610 raw_torch:17627197 payload_ratio:3.86x)
Total submission size int8+zlib: 15973626 bytes
final_int8_zlib_roundtrip val_loss:2.1565 val_bpb:1.2772 eval_time:18650ms
final_int8_zlib_roundtrip_exact val_loss:2.15645243 val_bpb:1.27717259
Loading