val_bpb
0.9076
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.32 MB
Training Techniques
Architecture
MLP3x
10-layer Transformer with 3x LeakyReLU MLP blocks
parameters: {"layers":10,"d_model":512}
GQA
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
BigramHash
Bigram hash-based component used in the model
parameters: {"buckets":4096,"dim":128}
SmearGate
Gating mechanism included in the architecture
parameters: null
Value Residual
Residual pathway applied to values
parameters: null
Gated Attention
Attention mechanism with gating
parameters: null
XSA
XSA used in the last 4 layers
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":"16/64"}
LN Scale
LayerNorm scaling included in the architecture
parameters: null
U-Net skip connections
Skip connections inspired by U-Net added to the Transformer
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
logit softcap
Logit softcapping applied to outputs
parameters: {"value":30}
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"lr":0.03}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
multi-order n-gram backoff
parameters: {"orders":"2-7","score_first":true,"backward_looking":true,"entropy_adaptive_alpha":true}
Other
other
Systematic hyperparameter screening to find MATRIX_LR=0.03 as the best setting
parameters: {"experiments":74,"screening_steps":"10-12"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Improved the previous PR #802 result by changing MATRIX_LR from 0.02 to 0.03
- Systematic hyperparameter screening identified MATRIX_LR=0.03 as the strongest training hyperparameter improvement
- Uses multi-order n-gram backoff evaluation with score-first backward-looking cache
- Entropy-adaptive alpha mixing for n-gram backoff evaluation
- Combines a 10-layer Transformer with BigramHash, SmearGate, value residuals, gated attention, and mixed int5/int6 quantization