val_bpb
1.1725
Architecture
MLP3x with SmearGate and BigramHash
Optimizer
Muon
Artifact Size
16,399,881 bytes
Training Techniques
Quantization
int6
bits: 6
scope: baseline model weights
Architecture
MLP3x
Uses a widened/deeper MLP-heavy baseline architecture.
parameters: null
SmearGate
Adds a gating mechanism to the baseline model.
parameters: null
BigramHash
Adds a bigram hashing component to the baseline.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.9998,"start_frac":0.8,"enabled":true}
Evaluation
sliding window eval
parameters: null
Other
other
Adaptive export-time pruning search to choose the smallest pruning ratio that meets an artifact byte budget.
parameters: {"prune_candidates":[0,0.01,0.02,0.03,0.04,0.05],"target_artifact_bytes":15950000}
Novel Contributions
- Late-stage EMA for weight smoothing before export
- Adaptive export-time pruning search under a byte budget
- Budget-aware selection of the smallest pruning ratio that meets the target artifact size
- Validation of a strong non-record result under constrained compute