PR #792

open

11L LeakyReLU² + XSA-all + Full GPTQ + 5-gram Backoff (1.0340 BPB)

val_bpb
1.0340
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,903,061 bytes

Training Techniques

Architecture
LeakyReLU²
Uses LeakyReLU(0.5) squared in the MLP instead of relu² to improve gradient flow.
parameters: {"negative_slope":0.5}
XSA
Cross-sequence attention applied to all transformer layers instead of only the last few layers.
parameters: {"layers":11}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
n-gram backoff
parameters: {"order":5,"backoff_orders":[5,4,3,2],"entropy_adaptive":true}
Test-Time Training
score-first TTT
parameters: {"cache_update_after_scoring":true}
Other
other
Entropy-adaptive alpha blending between model predictions and n-gram cache.
parameters: {"alpha_low":0.05,"alpha_high":0.4,"entropy_threshold":4}

Novel Contributions

  • LeakyReLU(0.5)² MLP activation
  • XSA applied to all 11 layers
  • Full Hessian-based GPTQ with actorder and Cholesky error compensation
  • 5-gram multi-order backoff with separate hash tables per order
  • Entropy-adaptive alpha for n-gram/model mixing
  • Score-first n-gram cache protocol