val_bpb
1.3267
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.3 MB
Training Techniques
Quantization
STE QAT int6
bits: 6
scope: all weights
Architecture
MLP3x
Uses a 3x MLP expansion in an 11-layer Transformer backbone.
parameters: {"layers":11,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":3}
GQA
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
SmearGate
Adds a SmearGate module at the embedding layer to inject additional signal.
parameters: null
BigramHash
Adds a compact bigram hash embedding for extra context.
parameters: {"bigram_vocab_size":2048,"bigram_dim":96}
Initialization
OrthoInit
Orthogonal initialization for large matrices with scaled projection weights.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_end":0.99}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"used_for":"token/scalar optimizers"}
Weight Averaging
SWA
parameters: {"checkpoints":7}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"fraction":0.15,"wallclock_based":true}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}
Other
other
Wallclock-fraction warmdown to avoid iter-based scheduling issues under torch.compile overhead.
parameters: {"last_fraction":0.15}
Novel Contributions
- Int6 grouped quantization for all weights
- STE fake-quantization QAT during the last 15% of wallclock
- Wallclock-fraction warmdown that fixes iter-based scheduling under torch.compile overhead
- SWA with 7 checkpoints during warmdown
- Compact BigramHash embedding and SmearGate additions
- Orthogonal initialization for large matrices
- Sliding-window evaluation with stride 64
- zstd-22 artifact compression