PR #301
openNon-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)
by lookin-zzView on GitHub
val_bpb
1.1807
Architecture
GPT
Optimizer
SGD
Artifact Size
15,781,354 bytes
Training Techniques
Quantization
STE QAT
bits: 6
scope: all weights
Architecture
MLP
Increased MLP hidden size to 1472.
parameters: {"hidden_size":1472}
tied embeddings
Used FP16 tied embeddings.
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":2}
Regularization
weight decay
parameters: {"adam_wd":0,"muon_wd":0}
Novel Contributions
- Int6 STE QAT with a small quantization gap
- MLP hidden size increased to 1472 while fitting within the 16MB artifact budget
- Aggressive warmdown training schedule
- FP16 tied embeddings
- Batched sliding-window evaluation with stride 64
- Full-weight test-time training on validation data
- Freezing the first two blocks during TTT