val_bpb
1.1442
Architecture
Transformer
Optimizer
Muon/AdamW
Artifact Size
under 16 MB
Training Techniques
Architecture
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
MLP3x
3x MLP width
parameters: null
SmearGate
Uses SmearGate in the model stack
parameters: null
BigramHash
Uses BigramHash auxiliary component with vocabulary size 2048
parameters: {"vocab_size":2048}
KV head count
Uses 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Quantization
int6 mixed
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw_used":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Initialization
OrthoInit
Orthogonal initialization with muP-style output scaling
Evaluation
stride-based sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"momentum":0.9,"freeze_blocks":2}
Compression
zstd
level: 22
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
fixed learning rates
parameters: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Novel Contributions
- Adds full-model SGD test-time training on the dequantized checkpoint
- Uses EMA instead of SWA in the winning public training stack
- Applies XSA to the last 4 layers
- Uses stride-64 evaluation
- Tunes learning rates upward for matrix, scalar, and tied embedding parameters
- Includes compatibility fallbacks for FA3 to SDPA and manual GQA KV-head repeat