PR #757

open

Record: Aggressive SGD TTT (3-seed mean val_bpb=1.1124)

by fieldingView on GitHub

val_bpb

1.1124

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4 MB

Training Techniques

Architecture

XSA

Extended self-attention applied in the last 4 layers.

parameters: {"layers":4}

MLP3x

3x MLP with relu-squared activation.

parameters: null

BigramHash

Bigram hashing with a fixed bucket vocabulary.

parameters: {"buckets":6144}

SmearGate

Learned token blending mechanism.

parameters: null

KV head count

8 attention heads with 4 KV heads using GQA.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

Int6 STE QAT

bits: 6

scope: all weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first full TTT

parameters: {"learning_rate":1,"epochs":30,"freeze_blocks":0,"momentum":0.9}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":1600}

Regularization

weight decay

parameters: {"adamw_weight_decay":0.04}

Other

other

Late QAT enabled when lr_scale < 0.1.

parameters: {"enabled":true,"threshold":0.1}

Novel Contributions

Aggressive TTT with SGD at LR=1.0 instead of the conventional 0.002
Unfreezing all blocks during TTT to stabilize and improve high-learning-rate adaptation
Extensive TTT hyperparameter sweep showing strong gains from higher LR and more epochs
3-seed validation result demonstrating a new record-level score
Combining int6 quantization with zstd compression to fit the artifact budget