PR #340

open

V2 Prototype: SwiGLU + Dropout + MuonWD + MidLayerLoop

val_bpb

1.2182

Architecture

Transformer

Optimizer

Muon

Artifact Size

4.8 MB

Training Techniques

Optimizer

Muon

weight_decay: 0.1

momentum: null

other_params: null

Regularization

dropout

parameters: {"rate":0.1,"scope":"attention and MLP blocks"}

Architecture

SwiGLU

Replaces squared-ReLU MLP activation with SwiGLU.

parameters: null

depth recurrence

Loops only the middle layers rather than all layers uniformly.

parameters: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Weight Averaging

EMA

parameters: null