PR #1172
closedRecord: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.1015
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.65 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"steps":8}
Architecture
XSA
Applied XSA to all layers of the model.
parameters: {"layers":11}
BigramHash
Expanded bigram embedding representation.
parameters: {"buckets":2816,"dimensions":160}
U-Net skip connections
Used sigmoid-gated lerp skip connections instead of simple addition.
parameters: null
LeakyReLU
Used LeakyReLU^2 MLP activation.
parameters: {"slope":0.5}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: late
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"split_lr":true,"early_layers_lr":0.025,"late_layers_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Compression
brotli
level: 11
lzma
level: 2
Other
other
Code minification with pyminify and a self-extracting wrapper to reduce code size.
parameters: null
Novel Contributions
- SLOT test-time optimization on frozen hidden states with an additive delta vector
- Split early/late Muon learning rates
- Sigmoid-gated skip connections
- Soft-round QAT with alpha ramp
- BigramHash dimension expansion to 160
- Brotli-11 compression with byte-shuffle
- Reduced GPTQ calibration reserve time