PR #1364
openRecord: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)
by stukenovView on GitHub
val_bpb
1.1025
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,985,137 bytes
Training Techniques
Test-Time Training
full TTT
parameters: {"epochs":6,"freeze_first_blocks":2,"learning_rate_start":0.0005,"learning_rate_end":0.00005}
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"TTT"}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"every_steps":50}
Architecture
QK-Gain
Per-head QK gain scaling
parameters: {"gain":4}
XSA
XSA applied to all layers
parameters: {"layers":11}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"mlp_multiplier":3}
GQA
Grouped query attention
parameters: {"layers":11,"heads":8,"kv_heads":4}
RoPE
Partial rotary positional embedding
parameters: {"dimensions":16,"total_dimensions":64}
VE128
VE128 used in selected layers
parameters: {"layers":[9,10]}
SmearGate
SmearGate with BigramHash embedding
parameters: {"embedding_size":"2048x128"}
BigramHash
Bigram hash embedding
parameters: {"embedding_size":"2048x128"}
Regularization
LN scale
parameters: {"rule":"1/sqrt(layer+1)"}
Novel Contributions
- Pre-quantization AdamW TTT on the full-precision EMA model before GPTQ
- TTT-adapted weights quantize cleanly after full Hessian GPTQ
- Record val_bpb of 1.1025 with a 3-seed mean
- Combination of pre-quant TTT with QK-Gain 4.0 and full Hessian GPTQ