PR #657

open

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1234

by anthony-maioView on GitHub
val_bpb
1.1234
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.89 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
LeakyReLU(0.5)^2
One-line activation swap preserving negative gradient flow, replacing standard relu^2 with leaky relu squared
parameters: null
Value Residual Learning (VRL)
Layer 0's value output blended into all subsequent layers via learned sigmoid gates to combat attention concentration
parameters: {"layers":11,"initial_gate_bias":-1.5,"initial_mixing":"approx 18%"}
BigramHash
BigramHash with 2048 buckets used in model
parameters: {"buckets":2048}
XSA4
Cross-Shaped Attention with 4 heads
parameters: {"heads":4}
Partial RoPE
Rotary Positional Embeddings applied partially with 16/64 dimensions
parameters: {"dimensions":"16/64"}
MLP 3×
MLP applied three times with LeakyReLU(0.5)^2 activation
parameters: {"count":3}
SmearGate
SmearGate mechanism included
parameters: null
U-Net skips
U-Net style skip connections with 5 encoder and 6 decoder layers
parameters: {"encoder_skips":5,"decoder_skips":6}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_scale_max":0.2}
Compression
lzma
level: 6
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"warmdown_steps":3500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(i+1)"}
Initialization
OrthoInit
Orthogonal initialization used
Other
other
Late Quantization Aware Training (QAT) with STE at threshold 0.15
parameters: null
other
FlashAttention 3 Hopper native kernels used
parameters: null

Novel Contributions

  • LeakyReLU(0.5)^2 activation replacing standard relu^2 to preserve negative gradient flow and improve BPB by ~0.002
  • Value Residual Learning (VRL) blending layer 0's value output into all subsequent layers via learned sigmoid gates to combat attention concentration
  • Switching compression from zstd-22 to stdlib lzma, achieving 2-5% tighter compression on quantized weights enabling larger MLP and BigramHash capacity under 16MB limit