PR #1488
openRecord: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)
by ndokutovichView on GitHub
val_bpb
0.8265
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.76 MB
Training Techniques
Architecture
GQA
Grouped query attention with 8/4 configuration
parameters: {"heads":"8/4"}
BigramHash
Bigram hash embedding component
parameters: {"dimensions":128,"size":1024}
SmearGate
SmearGate architectural component
parameters: null
U-Net skip connections
U-Net style skip connections in the model
parameters: null
Value Residual
Value Residual Learning
parameters: null
ReLU²
Squared LeakyReLU / relu-squared style MLP activation
parameters: null
XSA
XSA-all attention/sequence component
parameters: null
MLP3x
3x MLP expansion
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Compression
lzma
level: null
Test-Time Training
AdamW TTT
parameters: {"epochs":10,"learning_rate":0.00045,"freeze_blocks":1}
Evaluation
sliding window eval
parameters: {"stride":96}
LR Schedule
cosine decay
parameters: {"start":0.012,"end":0.001}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
SLOT evaluation with per-window delta and logit_bias optimized for 24 AdamW steps, then discarded
parameters: {"steps":24}
Novel Contributions
- First combination of pre-quant AdamW TTT with SLOT hidden-state optimization
- Pre-quant TTT baked into the artifact before quantization
- Improved base sliding BPB before SLOT, enabling a stronger final SLOT score
- QK_GAIN_INIT sweep to 5.25 compared with prior PR #1313