PR #2102
openNon-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis
by MaxIv25View on GitHub
val_bpb
1.1092
Architecture
Transformer
Optimizer
—
Artifact Size
15.02 MB
Training Techniques
Architecture
depth recurrence
Loops layers 3-5 twice, reusing the same block stack at multiple depths.
parameters: {"loop_start":3,"loop_end":5,"loops":2}
U-Net skip connections
Skip connections used in the transformer backbone.
parameters: null
SmearGate
Gating mechanism included in the architecture.
parameters: null
BigramHash
Bigram-based hashing component included in the architecture.
parameters: null
XSA
Attention/sequence module included in the architecture.
parameters: null
MoE
Upcycles dense MLP layers into a 2-expert top-1 mixture-of-experts module on layers 4-5 during training.
parameters: {"layers":[4,5],"experts":2,"routing":"top-1","enable_at":0.3}
parallel residuals
Uses parallel residual connections from layer 6 onward.
parameters: {"start_layer":6}
Causal Bigram Blending
Eval-time online causal bigram prior blending.
parameters: {"lambda":0.03}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- MoE upcycling combined with depth recurrence
- Analysis of a severe quantization gap when MoE and looping are used together
- Causal Bigram Blending as an eval-time improvement
- Observation that MoE quantizes normally without looping but degrades sharply with looping
- Proposal of depth-aware or per-depth quantization as a potential fix