val_bpb
1.1124
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4 MB
Training Techniques
Architecture
XSA
Extended self-attention applied in the last 4 layers.
parameters: {"layers":4}
MLP3x
3x MLP with relu-squared activation.
parameters: null
BigramHash
Bigram hashing with a fixed bucket vocabulary.
parameters: {"buckets":6144}
SmearGate
Learned token blending mechanism.
parameters: null
KV head count
8 attention heads with 4 KV heads using GQA.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
Int6 STE QAT
bits: 6
scope: all weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first full TTT
parameters: {"learning_rate":1,"epochs":30,"freeze_blocks":0,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":1600}
Regularization
weight decay
parameters: {"adamw_weight_decay":0.04}
Other
other
Late QAT enabled when lr_scale < 0.1.
parameters: {"enabled":true,"threshold":0.1}
Novel Contributions
- Aggressive TTT with SGD at LR=1.0 instead of the conventional 0.002
- Unfreezing all blocks during TTT to stabilize and improve high-learning-rate adaptation
- Extensive TTT hyperparameter sweep showing strong gains from higher LR and more epochs
- 3-seed validation result demonstrating a new record-level score
- Combining int6 quantization with zstd compression to fit the artifact budget