PR #1263

open

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean)

val_bpb
0.9354
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8 MB

Training Techniques

Architecture
LeakyReLU
LeakyReLU(0.5) squared MLP activation with 3x expansion
parameters: {"negative_slope":0.5,"squared":true,"mlp_expansion":3}
SmearGate
Embedding augmentation using SmearGate
parameters: null
BigramHash
Embedding augmentation using BigramHash
parameters: null
XSA
Cross-sequence attention applied on all layers
parameters: {"layers":11}
GQA
Grouped query attention with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied embeddings / tied output projection
parameters: null
RoPE
Rotary positional embeddings
parameters: null
U-Net skip connections
U-Net style skip connections with learned skip weights
parameters: null
logit softcap
Softcapped logits with scale 30.0
parameters: {"softcap":30}
QK-Gain
Attention gain initialization
parameters: {"init":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalars_embeddings":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for_slot":true}
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997,"swa_every":50,"swa_start_step":4600}
Quantization
GPTQ
bits: 6
scope: full model
late QAT
bits: 6
scope: full model
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"steps":16,"learning_rate":0.008,"min_learning_rate":0.0008}
Regularization
logit softcap
parameters: {"softcap":30}
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • LeakyReLU(0.5)^2 MLP with 3x expansion
  • XSA applied on all 11 layers
  • QK-Gain initialization at 4.0
  • Full GPTQ int6 quantization with zstd compression
  • SLOT evaluation with per-sample delta and logit bias optimization
  • Scored-position masking for SLOT loss
  • EMA plus Tight SWA training recipe