PR #763

open

Record: 11L XSA-all + backoff 7-gram (mean val_bpb=0.9917)

by hypery11View on GitHub
val_bpb
0.9917
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.99 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied to all layers
parameters: {"layers":11}
LeakyReLU^2 MLP
MLP uses LeakyReLU(0.5)^2 activation with 3x expansion
parameters: {"expansion":3}
BigramHash
BigramHash feature module
parameters: {"dimensions":10240}
SmearGate
SmearGate gating mechanism
parameters: null
Value Residual
Adds value residual connections
parameters: null
Gated Attention
Uses gated attention mechanism
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: null
Evaluation
multi-order backoff n-gram eval cache
parameters: {"orders":[2,3,4,5,6,7],"fallback":"highest-order-first","alpha":0.4,"buckets_per_order":"4M","score_first":true,"deterministic":true,"no_ttt":true}

Novel Contributions

  • 11-layer Transformer with XSA applied to all layers
  • Multi-order backoff n-gram evaluation cache from orders 2 through 7
  • Highest-order-first fallback with fixed alpha=0.40
  • Score-first deterministic evaluation with no test-time training
  • GPTQ-lite int6 quantization combined with zstd-22 compression
  • EMA plus Tight SWA plus Late QAT training recipe