PR #1802
openRecord: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean)
by aamodbhattView on GitHub
val_bpb
1.0771
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: token embeddings
Architecture
depth recurrence
Encoder/decoder layer recurrence with repeated layers during generation/adaptation.
parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10]}
Partial RoPE
Uses rotary position embeddings on only part of the head dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
U-Net skip connections
Skip connections gated in a U-Net-like pattern.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"polar_express_ns_coefficients":true,"backend_steps":5}
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.015,"gradient_clip":1}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1984}
Test-Time Training
full TTT
parameters: {"phases":3,"learning_rate":0.015,"momentum":0.9,"cosine_decay":true,"score_before_update":true}
LR Schedule
warmdown
parameters: {"min_lr_floor":0.1}
cosine decay
parameters: {"applied_to":"training and TTT chunks"}
Compression
brotli
level: 11
Novel Contributions
- Multi-Phase Global TTT that scores all windows globally, trains all chunks, and repeats across phases
- Polar Express Newton-Schulz coefficients replacing fixed Muon coefficients
- MIN_LR warmdown floor at 0.10 to preserve learning updates late in training
- Combined SP8192, GPTQ SDClip quantization, and depth recurrence into a sub-16MB submission