val_bpb
1.1182
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.93 MB
Training Techniques
Architecture
depth recurrence
Repeats layers 4 and 5 to create 13 virtual layers from 11 physical layers at zero parameter cost.
parameters: {"layers":[4,5],"physical_layers":11,"virtual_layers":13}
BigramHash
Adds a BigramHash module with vocabulary size 2048.
parameters: {"vocab_size":2048,"dim":128}
XSA
Uses XSA on the last 4 layers.
parameters: {"last_n_layers":4}
Partial RoPE
Applies rotary positional embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
MLP3x
Uses a 3x MLP block with LeakyReLU(0.5)^2.
parameters: {"activation":"LeakyReLU(0.5)^2"}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"all_blocks_unfrozen":true}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"description":"tight SWA weight averaging"}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_tokens":32768,"all_blocks_unfrozen":true}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Other
other
Legal score-first test-time training with backward-looking chunk adaptation; each chunk is scored before being trained on, and the last chunk is scored but never trained on.
parameters: {"chunks":1893}
Novel Contributions
- Depth recurrence on layers 4 and 5 to create 13 virtual layers from 11 physical layers with zero parameter cost
- First successful use of depth recurrence on the leaderboard
- Legal score-first SGD test-time training applied on top of the base model
- Combination of depth recurrence with SGD TTT to improve BPB from 1.1208 to 1.1182