val_bpb
1.1568
Architecture
Transformer
Optimizer
Muon
Artifact Size
15704854 bytes
Training Techniques
Architecture
SmearGate
Adds a SmearGate component to the dense lexical model.
parameters: null
BigramHash
Uses a bigram hash feature/module for lexical modeling.
parameters: {"dimensions":4096,"embedding_dim":128}
MLP3x
Uses a 3x MLP expansion in the model.
parameters: null
Optimizer
Muon
weight_decay: 0.038
momentum: null
other_params: null
Regularization
weight decay
parameters: {"adam_weight_decay":0.01,"muon_weight_decay":0.038}
Weight Averaging
SWA
parameters: {"every":50,"start_frac":0.5}
Evaluation
sliding window eval
parameters: {"context_length":2048,"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zstd
level: null
Novel Contributions
- Dense lexical 11-layer 512-dim model with KV4 and MLP3x
- SmearGate architecture component
- BigramHash(4096 x 128) lexical feature module
- Muon optimizer with weight decay 0.038
- SWA training schedule
- Legal re-export using int6_zstd_core to fit under the 16MB artifact cap
- Doc-sliding evaluation with 2048 context and 256 stride