PR #535
openRecord: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)
by raahilshahView on GitHub
val_bpb
1.1204
Architecture
Transformer
Optimizer
Muon (matrices) and AdamW (embeddings and scalars)
Artifact Size
15.85 MB
Training Techniques
Quantization
Full GPTQ
bits: 6
scope: all weights except small tensors and tok_emb.weight (fp16)
QAT-export alignment
bits: 6
scope: per-row clipping with quantile(0.9995) in STE and export quantizer
Architecture
LeakyReLU(0.5)² activation
Replaces relu² in MLP to prevent dead neurons and double effective MLP capacity
parameters: null
XSA4
Exclusive Self-Attention on last 4 layers
parameters: {"layers":4}
Partial RoPE
Partial Rotary Positional Embeddings with NTK-aware scaling
parameters: {"dimensions":"16/64"}
LN Scale
LayerNorm scale factor 1/sqrt(layer_idx+1)
parameters: null
SmearGate
Temporal gating mechanism
parameters: null
BigramHash
Bigram hashing with 2048 buckets and 128-dim embedding
parameters: {"buckets":2048,"dimensions":128}
U-Net skips
U-Net style skip connections with 5 encoder and 6 decoder skips
parameters: {"encoder_skips":5,"decoder_skips":6}
EMA
Exponential Moving Average with decay 0.997
parameters: {"decay":0.997}
Weight Averaging
Tight SWA
parameters: {"frequency_steps":50,"scale_threshold":0.2}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"scope":"matrices"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025,"scope":"embeddings and scalars"}
Regularization
weight decay
parameters: {"weight_decay":0.04}
gradient clipping
parameters: {"clip_value":0.3}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
none
parameters: null
Initialization
Orthogonal init
Novel Contributions
- LeakyReLU(0.5)² activation replacing relu² to prevent dead neurons and double effective MLP capacity
- Full GPTQ quantization with Hessian calibration reducing quantization gap by 31%
- QAT-export alignment using quantile(0.9995) clipping to match STE fake-quantizer and export quantizer