PR #601

open

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB

by anantdgoelView on GitHub
val_bpb
1.1418
Architecture
11-layer GPT
Optimizer
Muon (for matrices) and Adam (for scalars/embeddings)
Artifact Size
15.7 MB

Training Techniques

Quantization
STE QAT (late QAT) + Full GPTQ + Int5 MLP re-quantization + GPTQ-lite
bits: 6
scope: all linear layers with special Int5 re-quantization for MLP
Architecture
Value Residual (VR)
Layer-0 V vector shortcut blended with current layer V to improve deep attention signal flow
parameters: null
Gated Attention (GA)
Per-head learned sigmoid gate after scaled dot-product attention to modulate head contributions
parameters: null
XSA
Cross-sequence attention applied in first 4 layers
parameters: {"layers":4}
BigramHash embeddings
Bigram hash embeddings with 1024 buckets
parameters: {"buckets":1024}
Partial RoPE
Rotary positional embeddings applied partially (16 dims)
parameters: {"dimensions":16}
SmearGate
Attention gating mechanism
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}
Adam
weight_decay: 0.04
momentum: null
other_params: {"lr_scalars":0.025,"lr_embeddings":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.04}
Initialization
OrthoInit
Orthogonal initialization of weights
Compression
zstd
level: null
Evaluation
stride-based eval
parameters: {"stride":128}
Test-Time Training
SGD TTT (legal, cosine, per-layer)
parameters: null

Novel Contributions

  • Value Residual (VR): Layer-0 V vector shortcut for deep attention signal flow improving BPB by -0.015
  • Gated Attention (GA): Per-head learned sigmoid gate after SDPA improving BPB by -0.003
  • Late QAT: LR-threshold-based fake-quantize during final ~5% of training to adapt weights to int6 quantization
  • Full GPTQ + Int5 MLP post-training quantization: Hessian-aware quantization with int5 MLP re-quantization improving BPB by -0.028 and reducing artifact size by 3.6 MB
  • Finding that Test-Time Training (TTT) hurts performance on GPTQ-quantized models due to incompatibility with gradient-based adaptation