PR #657

open

Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1234

by anthony-maioView on GitHub

val_bpb

1.1234

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.89 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

LeakyReLU(0.5)^2

One-line activation swap preserving negative gradient flow, replacing standard relu^2 with leaky relu squared

parameters: null

Value Residual Learning (VRL)

Layer 0's value output blended into all subsequent layers via learned sigmoid gates to combat attention concentration

parameters: {"layers":11,"initial_gate_bias":-1.5,"initial_mixing":"approx 18%"}

BigramHash

BigramHash with 2048 buckets used in model

parameters: {"buckets":2048}

XSA4

Cross-Shaped Attention with 4 heads

parameters: {"heads":4}

Partial RoPE

Rotary Positional Embeddings applied partially with 16/64 dimensions

parameters: {"dimensions":"16/64"}

MLP 3×

MLP applied three times with LeakyReLU(0.5)^2 activation

parameters: {"count":3}

SmearGate

SmearGate mechanism included

parameters: null

U-Net skips

U-Net style skip connections with 5 encoder and 6 decoder layers

parameters: {"encoder_skips":5,"decoder_skips":6}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_scale_max":0.2}

Compression

lzma

level: 6

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"warmdown_steps":3500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(i+1)"}

Initialization

OrthoInit

Orthogonal initialization used

Other

other

Late Quantization Aware Training (QAT) with STE at threshold 0.15

parameters: null

other

FlashAttention 3 Hopper native kernels used

parameters: null

Novel Contributions

LeakyReLU(0.5)^2 activation replacing standard relu^2 to preserve negative gradient flow and improve BPB by ~0.002
Value Residual Learning (VRL) blending layer 0's value output into all subsequent layers via learned sigmoid gates to combat attention concentration
Switching compression from zstd-22 to stdlib lzma, achieving 2-5% tighter compression on quantized weights enabling larger MLP and BigramHash capacity under 16MB limit