Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688
Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688RoyiRa wants to merge 4 commits intoopenai:mainfrom
Conversation
|
Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future! |
|
@valerio-oai |
|
Thank you @Eppie for the correction, I should have been more careful! |
Summary
3-seed mean val_bpb: 1.0745 (std 0.021) | <15.5 MB | 8xH100 SXM, 600s
Results
Key Technique: 5-expert Logistic Context Mixer
GPU-vectorized online context mixing using the Hedge algorithm. Five experts blend predictions in log-probability space during TTT eval:
N-gram tables built incrementally from already-scored tokens only (legal). Expert weights updated online via Hedge:
log_w -= eta * loss.Each expert produces an NLL for every token. The mixer maintains learned weights (one per expert) updated via the Hedge algorithm. At each position, the mixed prediction is:
mixed_NLL = -log(sum_k w_k * exp(-NLL_k))Training Budget
GPTQ calibration runs within the 600s training budget (18s reserved).
Reproduction
Credits