PR #660

open

Non-record: Soft MoE Exploration — Dense Gating Fixes Sparse Router Collapse Under 16MB (WIP, val_bpb=1.1826)

by HugoOchoaLPView on GitHub

val_bpb

1.1826

Architecture

Transformer

Optimizer

—

Artifact Size

17.3MB

Training Techniques

Architecture

Soft MoE

Dense mixture-of-experts gating where all experts run on all tokens with learned soft weights, avoiding sparse router collapse and enabling compile-friendly execution.

parameters: {"num_experts":2,"moe_layers":"last 2 layers","moe_start_layer":8}

SmearGate

Gating mechanism used with the MoE setup.

parameters: null

BigramHash

Bigram hashing feature module used in the model.

parameters: {"dimensions":128,"hash_size":10240}

Weight Averaging

EMA

parameters: {"decay":0.998}

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

Other

other

Selective MoE applied only to deeper layers to reduce parameter overhead and fit under the 16MB constraint.

parameters: {"moe_start_layer":8}

Novel Contributions

Dense Soft MoE variant that avoids sparse router collapse
Compile-friendly MoE design that works with torch.compile
Selective application of MoE only in the last layers to reduce overhead
Use of SmearGate and BigramHash in the model
EMA replacing SWA for weight averaging
Mixed int5 MLP / int6 attention quantization with zstd-22 compression