PR #424

open

Add non-record EMA and adaptive export exploration

by someone114514View on GitHub
val_bpb
1.1725
Architecture
MLP3x with SmearGate and BigramHash
Optimizer
Muon
Artifact Size
16,399,881 bytes

Training Techniques

Quantization
int6
bits: 6
scope: baseline model weights
Architecture
MLP3x
Uses a widened/deeper MLP-heavy baseline architecture.
parameters: null
SmearGate
Adds a gating mechanism to the baseline model.
parameters: null
BigramHash
Adds a bigram hashing component to the baseline.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.9998,"start_frac":0.8,"enabled":true}
Evaluation
sliding window eval
parameters: null
Other
other
Adaptive export-time pruning search to choose the smallest pruning ratio that meets an artifact byte budget.
parameters: {"prune_candidates":[0,0.01,0.02,0.03,0.04,0.05],"target_artifact_bytes":15950000}

Novel Contributions

  • Late-stage EMA for weight smoothing before export
  • Adaptive export-time pruning search under a byte budget
  • Budget-aware selection of the smallest pruning ratio that meets the target artifact size
  • Validation of a strong non-record result under constrained compute