PR #424

open

Add non-record EMA and adaptive export exploration

by someone114514View on GitHub

val_bpb

1.1725

Architecture

MLP3x with SmearGate and BigramHash

Optimizer

Muon

Artifact Size

16,399,881 bytes

Training Techniques

Quantization

int6

bits: 6

scope: baseline model weights

Architecture

MLP3x

Uses a widened/deeper MLP-heavy baseline architecture.

parameters: null

SmearGate

Adds a gating mechanism to the baseline model.

parameters: null

BigramHash

Adds a bigram hashing component to the baseline.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.9998,"start_frac":0.8,"enabled":true}

Evaluation

sliding window eval

parameters: null

Other

other

Adaptive export-time pruning search to choose the smallest pruning ratio that meets an artifact byte budget.

parameters: {"prune_candidates":[0,0.01,0.02,0.03,0.04,0.05],"target_artifact_bytes":15950000}

Late-stage EMA for weight smoothing before export
Adaptive export-time pruning search under a byte budget
Budget-aware selection of the smallest pruning ratio that meets the target artifact size
Validation of a strong non-record result under constrained compute