PR #301

open

Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)

by lookin-zzView on GitHub
val_bpb
1.1807
Architecture
GPT
Optimizer
SGD
Artifact Size
15,781,354 bytes

Training Techniques

Quantization
STE QAT
bits: 6
scope: all weights
Architecture
MLP
Increased MLP hidden size to 1472.
parameters: {"hidden_size":1472}
tied embeddings
Used FP16 tied embeddings.
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":2}
Regularization
weight decay
parameters: {"adam_wd":0,"muon_wd":0}

Novel Contributions

  • Int6 STE QAT with a small quantization gap
  • MLP hidden size increased to 1472 while fitting within the 16MB artifact budget
  • Aggressive warmdown training schedule
  • FP16 tied embeddings
  • Batched sliding-window evaluation with stride 64
  • Full-weight test-time training on validation data
  • Freezing the first two blocks during TTT