PR #175

open

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean)

by anthony-maioView on GitHub
val_bpb
1.1229
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.89 MB

Training Techniques

Architecture
MLP3x
Expanded MLP to 3x width with LeakyReLU(0.5)^2 activation instead of standard ReLU^2.
parameters: {"expansion":3,"hidden_dim":1536,"negative_slope":0.5}
VRL
Value Residual Learning: layer 0 value output is blended into subsequent attention layers via learned sigmoid gates.
parameters: {"layers":11,"gate_init":-1.5,"initial_mixing":0.18}
BigramHash
Uses BigramHash embeddings to improve token representation capacity.
parameters: {"dimensions":2048}
Partial RoPE
Applies rotary positional embeddings only partially across dimensions.
parameters: {"train_length":null,"eval_length":null}
XSA
Uses XSA4 attention variant.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"tight":true}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: 6
Initialization
OrthoInit
Orthogonal initialization used for model weights.
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(i+1)"}
Other
other
Late QAT with STE threshold 0.15.
parameters: {"threshold":0.15}

Novel Contributions

  • LeakyReLU(0.5)^2 activation swap to preserve negative gradient flow and improve BPB
  • Value Residual Learning (VRL) with sigmoid-gated mixing of layer 0 values into later attention layers
  • Switch from zstd to lzma compression to recover artifact headroom
  • Restoring MLP 3x expansion and BigramHash 2048 capacity within the 16MB limit