val_bpb
1.2066
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.73 MB
Training Techniques
Architecture
depth recurrence
8 physical layers are looped twice to form 16 logical layers.
parameters: {"physical_layers":8,"loops":2,"logical_layers":16}
LeakyReLU
Uses LeakyReLU(0.5)^2 instead of standard ReLU^2 to improve gradient flow for negative pre-activations.
parameters: {"negative_slope":0.5}
Partial RoPE
Applies rotary positional embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Untied input and output embeddings for greater expressivity.
parameters: null
depth embeddings
Learned embeddings are added at each logical depth step to provide depth awareness.
parameters: {"logical_layers":16}
XSA
Cross-layer state aggregation with zero-initialized skip connections across logical layers.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"muon_momentum_warmup_start":0.9,"muon_momentum_warmup_steps":1500}
Compression
zlib
level: null
Regularization
logit softcap
parameters: {"value":15}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"warmup_steps":20}
Quantization
int8
bits: 8
scope: all
Novel Contributions
- Depth recurrence with 8 physical layers looped into 16 logical layers
- LeakyReLU(0.5)^2 activation to improve gradient flow
- Partial RoPE on 1/4 of head dimensions
- Untied input and output embeddings
- Depth embeddings for logical layer awareness
- Cross-layer state aggregation with zero-initialized skips
- Muon optimization with hardware-aligned MLP multiplier
- Int8 quantization plus zlib compression to fit the 16MB track