val_bpb
1.5382
Architecture
Transformer
Optimizer
SGD
Artifact Size
11.5 MB
Training Techniques
Quantization
QAT
bits: 8
scope: weights during training
Test-Time Training
full TTT
parameters: {"learning_rate":0.0001,"steps":1,"scope":["attn.proj.weight","mlp.proj.weight"]}
Architecture
tied embeddings
Uses tied input/output embeddings in the baseline Transformer architecture.
parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4,"mlp_multiplier":2}
Other
other
Autoresearch loop with AI-assisted autonomous experimentation, iterating on train_gpt.py via one-variable-at-a-time experiments.
parameters: {"experiments":15}
Novel Contributions
- Fake quantization-aware training (QAT) using a straight-through estimator in CastedLinear.forward()
- Test-time training on both attention and MLP output projections during evaluation
- Finding that QAT made training faster while improving post-quantization quality
- Consumer-GPU-only autonomous experimentation workflow