PR #758

open

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)

by hypery11View on GitHub
val_bpb
1.0465
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.99 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied on all 11 layers
parameters: {"layers":11}
LeakyReLU(0.5)^2 MLP
MLP uses LeakyReLU with slope 0.5 squared, with 3x expansion
parameters: {"expansion":3}
BigramHash
Bigram hash feature module
parameters: {"dimensions":10240}
SmearGate
Gating mechanism used in the architecture
parameters: null
Value Residual
Adds value residual connections
parameters: null
Gated Attention
Attention mechanism with gating
parameters: null
U-Net skip
U-Net style skip connections
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
Regularization
LN scaling
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Evaluation
backward-looking eval cache
parameters: {"order":7,"alpha":0.4,"buckets":4000000,"min_count":2,"deterministic":true,"score_first":true}
Test-Time Training
score-first TTT
parameters: {"deterministic":true,"enabled":false}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: null

Novel Contributions

  • 11-layer Transformer with XSA applied to all layers
  • 7-gram backward-looking evaluation cache with fixed alpha and hash buckets
  • GPTQ-lite int6 quantization combined with zstd-22 compression
  • EMA, Tight SWA, and Late QAT training pipeline
  • Use of BigramHash and SmearGate architectural components
  • Score-first deterministic evaluation without TTT