PR #297

open

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)

by davidpuertolasView on GitHub

val_bpb

1.1629

Architecture

GPT-style Transformer

Optimizer

Muon + AdamW

Artifact Size

15,948,643 bytes

Training Techniques

Quantization

STE QAT

bits: 6

scope: MLP and attention weight matrices / full model quantized artifact

Architecture

MLP3x

Expanded feed-forward network width to 3x the model dimension.

parameters: {"hidden":1536,"model_dim":512,"layers":9}

SmearGate

Learned gate blending current token embedding with previous token embedding for cheap bigram-like signal.

parameters: null

BigramHash

Hashed bigram embedding path keyed by adjacent token pairs.

parameters: {"buckets":4096,"dim":128}

Initialization

OrthoInit

Orthogonal initialization with Overtone-style / muP-style scaling.

Weight Averaging

SWA

parameters: {"start_frac":0.5,"every":200}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.02,"tied_embed_lr":0.03}

AdamW

weight_decay: 0.01

momentum: null

other_params: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.0003,"momentum":0.95}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"qat_start_frac":0.85,"qat_lr_factor":0.5,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_weight_decay":0.038,"adamw_weight_decay":0.01}

Novel Contributions

Late STE QAT activated only in the last ~15% of wallclock to reduce quantization noise during most of training.
Int6 per-row quantization with zstd level 22 compression to fit under the 16MB artifact cap.
3x MLP expansion (hidden size 1536) combined with SmearGate and BigramHash architectural additions.
Orthogonal / Overtone-style initialization for large matrices.
SWA over the second half of warmdown before quantization.
Full-model SGD test-time training instead of LoRA TTT.