PR #2102

open

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

by MaxIv25View on GitHub

val_bpb

1.1092

Architecture

Transformer

Optimizer

—

Artifact Size

15.02 MB

Training Techniques

Architecture

depth recurrence

Loops layers 3-5 twice, reusing the same block stack at multiple depths.

parameters: {"loop_start":3,"loop_end":5,"loops":2}

U-Net skip connections

Skip connections used in the transformer backbone.

parameters: null

SmearGate

Gating mechanism included in the architecture.

parameters: null

BigramHash

Bigram-based hashing component included in the architecture.

parameters: null

XSA

Attention/sequence module included in the architecture.

parameters: null

MoE

Upcycles dense MLP layers into a 2-expert top-1 mixture-of-experts module on layers 4-5 during training.

parameters: {"layers":[4,5],"experts":2,"routing":"top-1","enable_at":0.3}

parallel residuals

Uses parallel residual connections from layer 6 onward.

parameters: {"start_layer":6}

Causal Bigram Blending

Eval-time online causal bigram prior blending.

parameters: {"lambda":0.03}

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

MoE upcycling combined with depth recurrence
Analysis of a severe quantization gap when MoE and looping are used together
Causal Bigram Blending as an eval-time improvement
Observation that MoE quantizes normally without looping but degrades sharply with looping
Proposal of depth-aware or per-depth quantization as a potential fix