val_bpb
1.1160
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.75 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP (int5), attention (int6)
Architecture
BigramHash
Hashing bigrams into 10240 buckets with 128 dim embeddings
parameters: {"buckets":10240,"embedding_dim":128}
SmearGate
Gating mechanism applied in the model
parameters: null
value residual
Residual connection on value vectors
parameters: null
gated attention
Attention mechanism with gating
parameters: null
MLP3x with LeakyReLU(0.5)^2
Three-layer MLP with squared LeakyReLU activation
parameters: {"activation":"LeakyReLU(0.5)^2","layers":3}
weight tying
Tied embeddings
parameters: null
U-Net skip connections
Skip connections inspired by U-Net architecture
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.995}
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"scope":"Q, V projections + LM head across all layers","batch_size":64,"per_document_reset":true,"optimizer":"Adam","adam_betas":[0.9,0.95],"chunk_size":256,"epochs":3,"scoring":"final epoch only","document_split":"BOS boundaries"}
Novel Contributions
- Batched per-document LoRA test-time training with rank-8 LoRA on Q/V/LM-head across all layers
- 64 documents batched in parallel for LoRA TTT with per-document fresh initialization and optimizer reset
- Use of mixed int5 (MLP) and int6 (attention) quantization combined with zstd-22 compression
- Architecture modifications including BigramHash, SmearGate, value residual, gated attention, U-Net skip connections, and 3x MLP with LeakyReLU(0.5)^2
- EMA weight averaging with decay 0.995
- Efficient training with Muon optimizer combined with AdamW