PR #1376

open

Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean)

by stukenovView on GitHub
val_bpb
0.7094
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,930,472 bytes

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":6,"freeze_first_blocks":2,"learning_rate":0.0005}
score-first TTT
parameters: {"steps":24,"learning_rate_start":0.024,"learning_rate_min":0.001,"stride":96}
Quantization
GPTQ
bits: 6
scope: all
GPTQ
bits: 6
scope: full Hessian
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":96}
Architecture
BigramHash
Bigram hash embedding used in the base architecture
parameters: {"vocab_size":1536,"dimension":128}
XSA
XSA enabled in all layers
parameters: {"layers":"all"}
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"slope":0.5,"mlp_multiplier":3}
Partial RoPE
Partial rotary positional embedding
parameters: {"partial":"16/64"}
VE128
VE128 component in the architecture
parameters: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Regularization
LN scale
parameters: null

Novel Contributions

  • Per-sample SLOT-24 optimization with frozen model weights
  • Pre-quant AdamW test-time training before GPTQ quantization
  • Stride-96 SLOT evaluation to reduce windows and fit more optimization within budget
  • Full Hessian GPTQ int6 with lzma compression
  • Record 3-seed mean val_bpb of 0.7094