PR #638
openRecord: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed)
by Asukabot0View on GitHub
val_bpb
1.1164
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,941,860 bytes
Training Techniques
Quantization
int6 per-row
bits: 6
scope: all
Architecture
XSA
Exclusive Self-Attention on all 11 layers removing self-position bias
parameters: {"layers":11}
LeakyReLU(0.5)^2
LeakyReLU with negative slope 0.5 squared replaces ReLU^2, preserves negative gradient flow
parameters: {"negative_slope":0.5}
Value Residual
Layer 0 value output mixed into subsequent layers via learned sigmoid gates
parameters: null
Gated Attention
Per-head sigmoid gates on attention output
parameters: null
SmearGate
Additional gating mechanism
parameters: null
BigramHash
Hashing technique with 4096 buckets
parameters: {"buckets":4096}
Partial RoPE
Rotary positional embeddings applied partially on 16/64 dimensions
parameters: {"train_dims":16,"total_dims":64}
U-Net skip connections
Skip connections inspired by U-Net architecture
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 21
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization
Regularization
layerwise LN scale
parameters: null
Novel Contributions
- Applying Exclusive Self-Attention (XSA) on all 11 layers instead of just last 4, improving BPB by 0.006
- Replacing ReLU^2 with LeakyReLU(0.5)^2 activation to preserve negative gradient flow with zero overhead and -0.003 BPB improvement
- Introducing Value Residual (VR) where layer 0 value output is mixed into subsequent layers via learned sigmoid gates, improving BPB by 0.002
- Using Gated Attention (GA) with per-head sigmoid gates on attention output
- Combining SmearGate, BigramHash(4096), Partial RoPE (16/64 dims), and U-Net skip connections for architectural improvements
- Employing int6 per-row quantization combined with zstd-21 compression to fit artifact under 16MB
- Using Muon optimizer with momentum warmup and warmdown schedule of 3500 steps
- Demonstrating a non-TTT submission within 0.001 BPB of current non-TTT SOTA