PR #979
openRecord: 1.1387 BPB — 11L LeakyReLU² + Early QAT@0.5 + GPTQ-lite + EMA
by 0xadvaitView on GitHub
val_bpb
1.1387
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.6 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU(0.5) squared activation in the MLPs.
parameters: {"squared":true,"negative_slope":0.5}
MLP3x
Uses 3x MLP expansion.
parameters: {"expansion":3}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Adds U-Net style encoder-decoder skip connections.
parameters: {"encoder_layers":5,"decoder_layers":6}
weight tying
Ties input embeddings and output embeddings.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw_lr_embeddings":0.035,"adamw_lr_scalars":0.025,"momentum_warmup":"0.85->0.95"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
STE QAT
bits: 6
scope: attn/MLP weights
GPTQ-lite
bits: 6
scope: attn/MLP weights
int8
bits: 8
scope: embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: 22
Novel Contributions
- Early QAT starting at LR scale < 0.5 to allow ~1400 QAT steps before cutoff
- Reduced post-quantization gap from 0.28 BPB to 0.004 BPB
- 11-layer Transformer with LeakyReLU(0.5)^2 MLPs and U-Net skip connections
- GPTQ-lite per-row clip percentile search for int6 export
- Achieved 1.1387 BPB mean over 3 seeds with stride-64 sliding window evaluation