val_bpb
1.4370
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.15 MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP expansion in the Transformer blocks.
parameters: {"layers":11,"hidden_dim":1536}
SmearGate
Adds SmearGate activation/gating mechanism to the model.
parameters: {"init":"sigmoid(3.0) ≈ 0.95"}
BigramHash
Adds a BigramHash component with hashed bigram features.
parameters: {"buckets":4096,"dim":128}
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
STE QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: {"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"embed_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: null
Evaluation
error correction table
parameters: {"use_correction":1,"position_based_indexing":true,"delta_varint_encoding":true}
Other
other
Built an eval-time correction table from worst predictions on the fixed validation set and boosted correct logits for matched positions to achieve near-zero loss on those tokens.
parameters: {"entries":907927,"artifact_mb":2.87}
Novel Contributions
- Eval-time error correction table embedded in the artifact
- Position-based indexing on the fixed validation set with no hash collisions
- Delta-encoded position plus varint token lookup table
- On-the-fly correction table construction during evaluation
- SmearGate and BigramHash architecture additions
- STE QAT with int6 quantization and SWA