val_bpb
1.1036
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.70 MB
Training Techniques
Architecture
weight tying
Tied input/output embeddings.
parameters: null
depth recurrence
12 physical layers with a 2-layer recurrence loop, yielding 16 effective layers.
parameters: {"physical_layers":12,"effective_layers":16,"recurrent_layers":2,"repeats":3}
XSA
Applied XSA in all layers.
parameters: null
U-Net skip connections
Used U-Net style encoder-decoder skip connections with learned gates.
parameters: null
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
LeakyReLU
Used LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
BigramHash
Added a zero-initialized bigram hash embedding trained during TTT.
parameters: {"dimensions":[16384,512]}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"backend_steps":5,"warmup":"0.92->0.99 over 1500 steps"}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"beta1":0.9,"beta2":0.95,"fused":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
SWA
parameters: {"start":"last 33%","frequency":5,"blend":"50/50 with EMA"}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"optimizer":"SGD","momentum":0.9,"learning_rate":0.01,"epochs":3,"gradient_clip":1}
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: 32768
LR Schedule
warmdown
parameters: {"frac":0.72}
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
Novel Contributions
- Custom sp9000 SentencePiece BPE tokenizer trained on competition data
- 12-layer Transformer with depth recurrence for 16 effective layers
- Code-level step-time optimization using foreach operations and layout/precomputation improvements
- Improved score-first TTT with tuned hyperparameters
- Zero-initialized bigram hash embedding trained during TTT