val_bpb
1.1444
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,999,891 bytes
Training Techniques
Architecture
depth recurrence
Reuses MLP blocks across passes with learned scalar gating per reused pass.
parameters: {"reused_blocks":[4,5],"source_block":3}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Adds a bigram hash embedding feature.
parameters: {"buckets":3072,"dim":112}
XSA
Uses XSA in the deepest layers.
parameters: {"layers":4}
weight tying
Ties input and output embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
parallel residuals
Uses parallel residual connections in later layers.
parameters: {"start_layer":7}
Quantization
late QAT
bits: 6
scope: MLP and attention 2D weights
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"polar_express_coefficients":true,"aol_preconditioning":true,"newton_schulz_iters":5}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"scalars and embeddings"}
Weight Averaging
EMA + SWA
parameters: {"swa_start_scale":0.2,"swa_interval_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- Depth recurrence with learned scalar gating for reused MLP passes
- Polar Express NS with AOL preconditioning inside Muon
- SWA blended with EMA
- Partial RoPE
- XSA on deep layers
- Parallel residuals in late blocks
- BigramHash feature
- Late int6 QAT
- Int6 + zstd-22 serialization fitting under the 16MB cap