PR #340

open

V2 Prototype: SwiGLU + Dropout + MuonWD + MidLayerLoop

by starfly-webView on GitHub
val_bpb
1.2182
Architecture
Transformer
Optimizer
Muon
Artifact Size
4.8 MB

Training Techniques

Optimizer
Muon
weight_decay: 0.1
momentum: null
other_params: null
Regularization
dropout
parameters: {"rate":0.1,"scope":"attention and MLP blocks"}
Architecture
SwiGLU
Replaces squared-ReLU MLP activation with SwiGLU.
parameters: null
depth recurrence
Loops only the middle layers rather than all layers uniformly.
parameters: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Weight Averaging
EMA
parameters: null

Novel Contributions

  • SwiGLU MLP upgrade
  • 10% dropout applied to attention and MLP blocks
  • Muon weight decay regularization
  • middle-layer looping / targeted depth recurrence
  • post-training int8 + zlib artifact compression
  • EMA architectural enhancement mentioned in the branch README