val_bpb
1.1978
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmdown_schedule":true}
Quantization
mixed int6
bits: 6
scope: artifact
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"decay":[0.995,0.997]}
Architecture
MLP width
Expanded MLP width from 3x to 3.5x
parameters: {"from":3,"to":3.5}
LeakyReLU
Used LeakyReLU squared activation
parameters: {"power":2,"slope":0.5}
BigramHash
Character bigram hash embeddings
parameters: {"dimensions":4096}
MLP4x
Removed bigram and used a larger MLP
parameters: null
MHA
Added full multi-head attention
parameters: {"kv_heads":8}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 8192
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Test-Time Training
full TTT
parameters: {"chunk":65536}
Novel Contributions
- Muon optimizer tuning with weight decay, momentum, and warmdown schedule
- Mixed-precision int6 quantization to fit the artifact under 16MB
- Sliding window evaluation with stride 64
- EMA weight averaging
- BigramHash character embeddings
- Sequence packing to 8192 tokens
- MLP width expansion and LeakyReLU squared activation
- Full multi-head attention with 8 KV heads
- Test-time training with large chunk size