PR #1487

open

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)

by ndokutovichView on GitHub
val_bpb
1.0600
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.95 MB

Training Techniques

Architecture
depth recurrence
3-layer depth recurrence in an 11L/13 virtual-layer Transformer with parallel residuals.
parameters: {"layers":3,"virtual_layers":13}
U-Net skip connections
Parallel residual/skip connections starting from layer 7.
parameters: {"start_layer":7}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: weights + embeddings
mixed int6/int8
bits: null
scope: weights and embeddings
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"prequant_ttt":true}
Test-Time Training
full TTT
parameters: {"epochs":10,"learning_rate":0.00045,"freeze_blocks":1,"schedule":"cosine"}
LR Schedule
cosine decay
parameters: null
Other
other
QK gain tuning with QK_GAIN_INIT set to 5.25.
parameters: {"qk_gain_init":5.25}

Novel Contributions

  • Pre-quantization TTT on validation data baked into the artifact
  • Hyperparameter tuning of pre-quant TTT (QK gain, epochs, freeze blocks, learning rate)
  • 3-seed mean record result with very low variance
  • Mixed GPTQ quantization with int6 weights and int8 embeddings
  • EMA-based full-stack submission with depth recurrence and parallel residuals