val_bpb
1.1980
Architecture
Transformer
Optimizer
Muon + Adam
Artifact Size
12.83MB
Training Techniques
Architecture
depth recurrence
Replaced unique transformer blocks with shared blocks repeated multiple times to increase effective depth.
parameters: {"blocks":3,"repeats":4,"effective_layers":12}
Cross-Repeat Skip
Adds a weighted residual connection from each block's previous repeat output, making recurrence stateful.
parameters: null
XSA
Exclusive self-attention applied to the last 4 layers.
parameters: {"layers":4}
value embeddings
Two extra embedding tables mixed into the residual stream at each effective layer with learned scales.
parameters: {"tables":2}
loop embedding
Learned per-layer vector added before each block as depth-wise positional encoding.
parameters: null
KV head count
Uses 4 KV heads with 8 attention heads total.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.012,"scalar_lr":0.012,"tied_embed_lr":0.015,"grad_clip_norm":0.3}
Weight Averaging
SWA
parameters: {"collected_only_at_full_depth":true}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"window":1024,"stride":256}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Quantization
int8
bits: 8
scope: all
Other
other
Progressive depth training schedule that increases recurrence depth during training from 2 repeats to 3 repeats to 4 repeats.
parameters: {"phases":[{"repeats":2,"eff_depth":6},{"repeats":3,"eff_depth":9},{"repeats":4,"eff_depth":12}]}
other
DDP race condition fix for phase switching using all_reduce synchronization across ranks.
parameters: null
Novel Contributions
- Progressive depth training schedule that increases recurrence depth during training
- DDP phase-switch synchronization fix using all_reduce
- Stateful depth recurrence with Cross-Repeat Skip
- Use of XSA in the last 4 layers
- Value embeddings mixed into the residual stream
- Loop embedding as depth-wise positional encoding