val_bpb
1.0889
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.83MB
Training Techniques
Architecture
depth recurrence
Replaced unique transformer blocks with 3 shared blocks repeated across depth, using progressive repeats to reach 15 effective layers.
parameters: {"shared_blocks":3,"repeats":[2,3,4,5],"effective_layers":15}
weight tying
Shared weights across repeated blocks instead of unique layers.
parameters: null
U-Net skip connections
Removed baseline U-Net skip connections; Cross-Repeat Skip was used instead.
parameters: null
Value Residual
Added value embeddings mixed into the residual stream at each effective layer.
parameters: {"tables":2}
other
Cross-Repeat Skip: each block receives a weighted residual from its output in the previous repeat, making recurrence stateful.
parameters: {"learned_scales":true}
other
Loop embedding: learned per-layer vector added before each block as depth-wise positional encoding.
parameters: null
Quantization
int8
bits: 8
scope: model weights
Weight Averaging
SWA
parameters: {"checkpoints":38}
Evaluation
sliding window eval
parameters: {"stride":256,"window":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Adam":true}
Compression
zlib
level: null
Novel Contributions
- Progressive depth recurrence scaling study with shared-weight recurrence
- Cross-Repeat Skip to make recurrence stateful
- Value embeddings mixed into the residual stream
- Loop embedding as depth-wise positional encoding
- Large-scale SWA over 38 checkpoints
- Hedge Mixer evaluation adapted from prior submissions