val_bpb
1.3900
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
—
Training Techniques
Test-Time Training
full TTT
parameters: {"document_isolated":true,"reset_at_bos":true}
Evaluation
temperature scaling
parameters: {"temperature_range":[0.9,1]}
Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation
parameters: {"squared":true,"negative_slope":0.5}
XSA
Uses XSA attention variant
parameters: {"version":4}
Partial RoPE
Applies partial rotary positional embeddings
parameters: {"train":"16/64"}
VE128
Uses VE128 component
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Regularization
LN scale
parameters: null
Novel Contributions
- Document-isolated TTT by resetting optimizer state at BOS document boundaries
- Temperature scaling during evaluation on the quantized model
- Evaluation of doc isolation as an adaptation improvement on the frontier architecture