PR #1573

open

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)

by shivangbavejaView on GitHub
val_bpb
1.1464
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.93 MB

Training Techniques

Architecture
depth recurrence
Recycled-core depth with layers replayed to create more virtual layers than physical layers.
parameters: {"layers":12,"virtual_layers":14,"replayed_layers":[3,4]}
LeakyReLU
LeakyReLU(0.5)^2 MLP activation.
parameters: {"negative_slope":0.5}
Gated Attention
Per-head sigmoid gate on attention output.
parameters: null
Value Residual
First-layer values are blended into subsequent layers.
parameters: null
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Partial rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LN scale
LayerNorm scale uses 1/sqrt(layer+1).
parameters: null
U-Net skip connections
Learned skip weights provide U-Net style skip connections.
parameters: null
BigramHash
Bigram hash embedding with 2048 buckets.
parameters: {"buckets":2048}
SmearGate
Per-dimension sigmoid gating.
parameters: null
VE128
Value embeddings used in later layers.
parameters: {"dimensions":128,"layers":[10,11]}
Quantization
GPTQ-lite
bits: 5
scope: MLP and attention weights
Compression
lzma
level: 9
Weight Averaging
SWA
parameters: {"every":50,"snapshots":13}
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.025}
Adam
weight_decay: 0.04
momentum: null
other_params: {"scope":"embeddings/scalars"}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}

Novel Contributions

  • Recycled-core depth with replayed layers to create 14 virtual layers from 12 physical layers
  • Int5 quantization of both MLP and attention weights to fit under the 16 MB limit
  • Gated attention with value residual blending
  • LeakyReLU(0.5)^2 MLP activation
  • Combination of SWA and EMA with Muon optimization
  • Partial RoPE, XSA, BigramHash, SmearGate, and value embeddings in later layers