PR #1185
open[10min_16mb] 0.9641 BPB: LeakyReLU² + Score-First TTT + N-gram Backoff Cache
by skoustav35View on GitHub
val_bpb
0.9641
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,989,583 bytes
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
XSA
Exclusive self-attention used in later layers.
parameters: {"layers":[6,7,8,9,10]}
weight tying
Tied input and output embeddings.
parameters: null
Value Residual
Adds value residual connections.
parameters: null
Gated Attention
Uses gated attention blocks.
parameters: null
SmearGate
Applies SmearGate gating mechanism.
parameters: null
VE128
Value embedding enabled on selected layers.
parameters: {"layers":[8,9,10],"dimension":128}
BigramHash
Uses hashed bigram features.
parameters: {"size":2048}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
MLP3x
Uses 3x MLP width.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9985}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Muon
weight_decay: null
momentum: null
other_params: null
Adam
weight_decay: null
momentum: null
other_params: {"split":true}
Quantization
int6
bits: 6
scope: per-row
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"chunk_size":32000,"stride":64,"optimizer":"SGD"}
Evaluation
n-gram backoff cache
parameters: {"orders":[2,3,4,5,6,7,8,9],"backoff":"highest matching order","smoothing":"Laplace add-1","entropy_adaptive_alpha":true}
Regularization
LN scale
parameters: null
Novel Contributions
- LeakyReLU squared architecture with gated attention and value residuals
- Legal score-first test-time training that scores tokens before updates
- Eval-time multi-order n-gram backoff cache with Laplace smoothing
- Entropy-adaptive alpha scaling for blending neural and n-gram probabilities
- Int6 per-row quantization with LZMA compression to fit the 16MB limit