PR #480

open

Non-record: MoE exploration + multi-bit quantization analysis

val_bpb

1.1456

Architecture

Transformer

Optimizer

—

Artifact Size

15.14 MB

Training Techniques

Quantization

mixed int6/int5/int4

bits: 6

scope: MLP and attention

Architecture

MoE

2-expert soft-routing mixture-of-experts replacing the dense MLP for parameter expansion

parameters: {"experts":2,"expert_multiplier":1.5}

Weight Averaging

SWA

parameters: {"checkpoints":30}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Quantization comparison across multiple attention/MLP bit-width configurations on the same trained dense model

parameters: {"configurations":["attn6_mlp6","attn6_mlp5","attn6_mlp4","attn5_mlp5","attn5_mlp4"]}

Preliminary negative result for a 2-expert soft-routing MoE under the 16MB artifact cap
Leaderboard-relevant comparison of multi-bit post-training quantization on the same dense model
Evidence that int5 MLP quantization is viable while int4 MLP quantization is destructive in this setup
Partial MoE training log and checkpoint table documenting the observed degradation relative to dense control