PR #667

open

Non-record: Fixed Bank QAT + XSA5 + Label Smoothing (1.1352)

val_bpb

1.1352

Architecture

GPT

Optimizer

Parallel Muon

Artifact Size

15.44 MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all bank parameters

Architecture

XSA

Expanded XSA from the last 4 layers to the last 5 layers.

parameters: {"layers":5}

Regularization

label smoothing

parameters: {"value":0.05}

Test-Time Training

full TTT

parameters: {"learning_rate":0.003,"momentum":0.95,"epochs":3,"chunk_tokens":32768}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Bank QAT fix implemented directly in GPT.forward() using STE int6 fake-quantization for bank parameters, with torch.compile reset/recompile.

parameters: {"recompile_cost_seconds":50,"overhead_ms_per_step":5}

Fixed broken Bank QAT by implementing STE int6 fake-quantization directly in GPT.forward() for bank parameters.
Expanded XSA from 4 layers to 5 layers.
Added label smoothing of 0.05.
Tuned TTT hyperparameters to learning rate 0.003 and momentum 0.95.
Reported that the QAT fix was too expensive due to recompilation overhead and reduced training steps.