val_bpb
1.1890
Architecture
Transformer
Optimizer
Muon
Artifact Size
11MB
Training Techniques
Quantization
INT6 QAT
bits: 6
scope: all
Architecture
BigramHash
Bigram hashing with 4096 buckets and 128 embedding dimension
parameters: {"buckets":4096,"embedding_dim":128}
SmearGate
SmearGate mechanism applied
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
MLP3x
MLP with 3x expansion
parameters: {"expansion_factor":3}
LeakyReLU(0.5)^2 activation
LeakyReLU with negative slope 0.5 squared to preserve negative gradient flow
parameters: {"negative_slope":0.5}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"Muon_scope":"matrices","Adam_scope":"scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
auto warmdown
parameters: {"warmdown_fraction":0.15}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"target":"attention Q and V projections","epochs_per_document":10,"learning_rate_decay":"cosine decay from 0.01 to 0.0001","mode":"backward-looking (score-first)","per_document_reset":true,"last_chunk_no_train":true,"documents_less_than_512_tokens_no_TTT":true}
Novel Contributions
- Backward-looking (score-first) per-document LoRA test-time training
- Use of LoRA with rank 8 to constrain adaptation subspace and prevent overfitting on quantized models
- Per-document independent LoRA with reset between documents to avoid cross-contamination
- INT6 quantization-aware training (QAT) applied uniformly
- Combination of Muon optimizer for matrices and Adam for scalars
- LeakyReLU(0.5)^2 activation to preserve negative gradient flow
- U-Net style skip connections in transformer architecture