PR #601

open

Non-record: VR + GA + Late QAT + Full GPTQ — 1.1418 BPB, 15.7 MB

by anantdgoelView on GitHub

val_bpb

1.1418

Architecture

11-layer GPT

Optimizer

Muon (for matrices) and Adam (for scalars/embeddings)

Artifact Size

15.7 MB

Training Techniques

Quantization

STE QAT (late QAT) + Full GPTQ + Int5 MLP re-quantization + GPTQ-lite

bits: 6

scope: all linear layers with special Int5 re-quantization for MLP

Architecture

Value Residual (VR)

Layer-0 V vector shortcut blended with current layer V to improve deep attention signal flow

parameters: null

Gated Attention (GA)

Per-head learned sigmoid gate after scaled dot-product attention to modulate head contributions

parameters: null

XSA

Cross-sequence attention applied in first 4 layers

parameters: {"layers":4}

BigramHash embeddings

Bigram hash embeddings with 1024 buckets

parameters: {"buckets":1024}

Partial RoPE

Rotary positional embeddings applied partially (16 dims)

parameters: {"dimensions":16}

SmearGate

Attention gating mechanism

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"backend_steps":5}

Adam

weight_decay: 0.04

momentum: null

other_params: {"lr_scalars":0.025,"lr_embeddings":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.04}

Initialization

OrthoInit

Orthogonal initialization of weights

Compression

zstd

level: null

Evaluation

stride-based eval

parameters: {"stride":128}

Test-Time Training

SGD TTT (legal, cosine, per-layer)

parameters: null

Novel Contributions

Value Residual (VR): Layer-0 V vector shortcut for deep attention signal flow improving BPB by -0.015
Gated Attention (GA): Per-head learned sigmoid gate after SDPA improving BPB by -0.003
Late QAT: LR-threshold-based fake-quantize during final ~5% of training to adapt weights to int6 quantization
Full GPTQ + Int5 MLP post-training quantization: Hessian-aware quantization with int5 MLP re-quantization improving BPB by -0.028 and reducing artifact size by 3.6 MB
Finding that Test-Time Training (TTT) hurts performance on GPTQ-quantized models due to incompatibility with gradient-based adaptation