PR #263

open

Non-record: TTT + QAT on Consumer GPU (val_bpb=1.5382)

by Dannybc123View on GitHub
val_bpb
1.5382
Architecture
Transformer
Optimizer
SGD
Artifact Size
11.5 MB

Training Techniques

Quantization
QAT
bits: 8
scope: weights during training
Test-Time Training
full TTT
parameters: {"learning_rate":0.0001,"steps":1,"scope":["attn.proj.weight","mlp.proj.weight"]}
Architecture
tied embeddings
Uses tied input/output embeddings in the baseline Transformer architecture.
parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4,"mlp_multiplier":2}
Other
other
Autoresearch loop with AI-assisted autonomous experimentation, iterating on train_gpt.py via one-variable-at-a-time experiments.
parameters: {"experiments":15}

Novel Contributions

  • Fake quantization-aware training (QAT) using a straight-through estimator in CastedLinear.forward()
  • Test-time training on both attention and MLP output projections during evaluation
  • Finding that QAT made training faster while improving post-quantization quality
  • Consumer-GPU-only autonomous experimentation workflow