PR #638

open

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed)

by Asukabot0View on GitHub

val_bpb

1.1164

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,941,860 bytes

Training Techniques

Quantization

int6 per-row

bits: 6

scope: all

Architecture

XSA

Exclusive Self-Attention on all 11 layers removing self-position bias

parameters: {"layers":11}

LeakyReLU(0.5)^2

LeakyReLU with negative slope 0.5 squared replaces ReLU^2, preserves negative gradient flow

parameters: {"negative_slope":0.5}

Value Residual

Layer 0 value output mixed into subsequent layers via learned sigmoid gates

parameters: null

Gated Attention

Per-head sigmoid gates on attention output

parameters: null

SmearGate

Additional gating mechanism

parameters: null

BigramHash

Hashing technique with 4096 buckets

parameters: {"buckets":4096}

Partial RoPE

Rotary positional embeddings applied partially on 16/64 dimensions

parameters: {"train_dims":16,"total_dims":64}

U-Net skip connections

Skip connections inspired by U-Net architecture

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 21

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Applying Exclusive Self-Attention (XSA) on all 11 layers instead of just last 4, improving BPB by 0.006
Replacing ReLU^2 with LeakyReLU(0.5)^2 activation to preserve negative gradient flow with zero overhead and -0.003 BPB improvement
Introducing Value Residual (VR) where layer 0 value output is mixed into subsequent layers via learned sigmoid gates, improving BPB by 0.002
Using Gated Attention (GA) with per-head sigmoid gates on attention output
Combining SmearGate, BigramHash(4096), Partial RoPE (16/64 dims), and U-Net skip connections for architectural improvements
Employing int6 per-row quantization combined with zstd-21 compression to fit artifact under 16MB
Using Muon optimizer with momentum warmup and warmdown schedule of 3500 steps
Demonstrating a non-TTT submission within 0.001 BPB of current non-TTT SOTA