PR #660

open

Non-record: Soft MoE Exploration — Dense Gating Fixes Sparse Router Collapse Under 16MB (WIP, val_bpb=1.1826)

by HugoOchoaLPView on GitHub
val_bpb
1.1826
Architecture
Transformer
Optimizer
Artifact Size
17.3MB

Training Techniques

Architecture
Soft MoE
Dense mixture-of-experts gating where all experts run on all tokens with learned soft weights, avoiding sparse router collapse and enabling compile-friendly execution.
parameters: {"num_experts":2,"moe_layers":"last 2 layers","moe_start_layer":8}
SmearGate
Gating mechanism used with the MoE setup.
parameters: null
BigramHash
Bigram hashing feature module used in the model.
parameters: {"dimensions":128,"hash_size":10240}
Weight Averaging
EMA
parameters: {"decay":0.998}
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Selective MoE applied only to deeper layers to reduce parameter overhead and fit under the 16MB constraint.
parameters: {"moe_start_layer":8}

Novel Contributions

  • Dense Soft MoE variant that avoids sparse router collapse
  • Compile-friendly MoE design that works with torch.compile
  • Selective application of MoE only in the last layers to reduce overhead
  • Use of SmearGate and BigramHash in the model
  • EMA replacing SWA for weight averaging
  • Mixed int5 MLP / int6 attention quantization with zstd-22 compression