PR #263

open

Non-record: TTT + QAT on Consumer GPU (val_bpb=1.5382)

val_bpb

1.5382

Architecture

Transformer

Optimizer

SGD

Artifact Size

11.5 MB

Training Techniques

Quantization

QAT

bits: 8

scope: weights during training

Test-Time Training

full TTT

parameters: {"learning_rate":0.0001,"steps":1,"scope":["attn.proj.weight","mlp.proj.weight"]}

Architecture

tied embeddings

Uses tied input/output embeddings in the baseline Transformer architecture.

parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4,"mlp_multiplier":2}

Other

other

Autoresearch loop with AI-assisted autonomous experimentation, iterating on train_gpt.py via one-variable-at-a-time experiments.

parameters: {"experiments":15}

Fake quantization-aware training (QAT) using a straight-through estimator in CastedLinear.forward()
Test-time training on both attention and MLP output projections during evaluation
Finding that QAT made training faster while improving post-quantization quality
Consumer-GPU-only autonomous experimentation workflow