PR #2102

open

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

by MaxIv25View on GitHub
val_bpb
1.1092
Architecture
Transformer
Optimizer
Artifact Size
15.02 MB

Training Techniques

Architecture
depth recurrence
Loops layers 3-5 twice, reusing the same block stack at multiple depths.
parameters: {"loop_start":3,"loop_end":5,"loops":2}
U-Net skip connections
Skip connections used in the transformer backbone.
parameters: null
SmearGate
Gating mechanism included in the architecture.
parameters: null
BigramHash
Bigram-based hashing component included in the architecture.
parameters: null
XSA
Attention/sequence module included in the architecture.
parameters: null
MoE
Upcycles dense MLP layers into a 2-expert top-1 mixture-of-experts module on layers 4-5 during training.
parameters: {"layers":[4,5],"experts":2,"routing":"top-1","enable_at":0.3}
parallel residuals
Uses parallel residual connections from layer 6 onward.
parameters: {"start_layer":6}
Causal Bigram Blending
Eval-time online causal bigram prior blending.
parameters: {"lambda":0.03}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • MoE upcycling combined with depth recurrence
  • Analysis of a severe quantization gap when MoE and looping are used together
  • Causal Bigram Blending as an eval-time improvement
  • Observation that MoE quantizes normally without looping but degrades sharply with looping
  • Proposal of depth-aware or per-depth quantization as a potential fix