PR #1172

closed

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.1015
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.65 MB

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"steps":8}
Architecture
XSA
Applied XSA to all layers of the model.
parameters: {"layers":11}
BigramHash
Expanded bigram embedding representation.
parameters: {"buckets":2816,"dimensions":160}
U-Net skip connections
Used sigmoid-gated lerp skip connections instead of simple addition.
parameters: null
LeakyReLU
Used LeakyReLU^2 MLP activation.
parameters: {"slope":0.5}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: late
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"split_lr":true,"early_layers_lr":0.025,"late_layers_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Compression
brotli
level: 11
lzma
level: 2
Other
other
Code minification with pyminify and a self-extracting wrapper to reduce code size.
parameters: null

Novel Contributions

  • SLOT test-time optimization on frozen hidden states with an additive delta vector
  • Split early/late Muon learning rates
  • Sigmoid-gated skip connections
  • Soft-round QAT with alpha ramp
  • BigramHash dimension expansion to 160
  • Brotli-11 compression with byte-shuffle
  • Reduced GPTQ calibration reserve time