val_bpb
0.5588
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,302,060 bytes
Training Techniques
Architecture
LeakyReLU
Replaced ReLU(x)^2 with LeakyReLU(x, 0.5)^2 in all MLP blocks to avoid dead neurons while keeping squared non-negative outputs.
parameters: {"negative_slope":0.5}
BigramHash
Expanded BigramHashEmbedding capacity to reduce hash collisions.
parameters: {"buckets":3072}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Test-Time Training
LoRA TTT
parameters: {"epochs":8}
Weight Averaging
SWA
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zlib
level: null
Novel Contributions
- LeakyReLU(0.5)^2 activation replacement in MLP blocks
- Increased training and TTT context length from 1024 to 2048
- Expanded BigramHashEmbedding capacity from 2048 to 3072 buckets
- Extended warmdown schedule from 3000 to 3500 iterations
- LoRA-based test-time training with K-projection and Min-NLL epoch selection