PR #1033

open

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT

by Naazimsnh02View on GitHub
val_bpb
0.4311
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8/4 heads.
parameters: {"heads":"8/4"}
BigramHash
Bigram hash cache/embedding used in the model and eval stack.
parameters: {"buckets":2048}
XSA
XSA attention used in the architecture.
parameters: {"last_layers":4}
LeakyReLU
LeakyReLU squared activation.
parameters: {"slope":0.5}
Partial RoPE
Rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":16}
U-Net skip connections
Skip connections in a U-Net style arrangement.
parameters: null
SmearGate
SmearGate component in the architecture.
parameters: null
Value Residual
Value residual learning / value residual connections.
parameters: null
depth recurrence
Repeated layers to create virtual depth without extra parameters.
parameters: {"layers":[4,5]}
VE128
Value embedding module.
parameters: {"dimension":128,"layers":[9,10]}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"epochs":3}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Complementary training that downweights tokens a bigram predictor would already get right.
parameters: {"complement_alpha":0.5}
other
Entropy-adaptive alpha for mixing neural and n-gram predictions during evaluation.
parameters: {"formula":"0.20 + 0.55 * sigmoid(2 * (H - 3.0))"}
other
Backoff n-gram mixer with orders 2-10 and greedy cascade.
parameters: {"orders":"2-10","buckets":4000000}

Novel Contributions

  • Complementary training that focuses the neural model on tokens statistical caches cannot predict well
  • Backoff n-gram mixer with adaptive entropy-based alpha
  • Score-first LoRA TTT on already-evaluated tokens
  • Depth recurrence to increase virtual depth without extra parameters