PR #488

open

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)

by pkim02View on GitHub

val_bpb

1.3267

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.3 MB

Training Techniques

Quantization

STE QAT int6

bits: 6

scope: all weights

Architecture

MLP3x

Uses a 3x MLP expansion in an 11-layer Transformer backbone.

parameters: {"layers":11,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":3}

GQA

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

SmearGate

Adds a SmearGate module at the embedding layer to inject additional signal.

parameters: null

BigramHash

Adds a compact bigram hash embedding for extra context.

parameters: {"bigram_vocab_size":2048,"bigram_dim":96}

Initialization

OrthoInit

Orthogonal initialization for large matrices with scaled projection weights.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_end":0.99}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"used_for":"token/scalar optimizers"}

Weight Averaging

SWA

parameters: {"checkpoints":7}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"fraction":0.15,"wallclock_based":true}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}

Other

other

Wallclock-fraction warmdown to avoid iter-based scheduling issues under torch.compile overhead.

parameters: {"last_fraction":0.15}

Novel Contributions

Int6 grouped quantization for all weights
STE fake-quantization QAT during the last 15% of wallclock
Wallclock-fraction warmdown that fixes iter-based scheduling under torch.compile overhead
SWA with 7 checkpoints during warmdown
Compact BigramHash embedding and SmearGate additions
Orthogonal initialization for large matrices
Sliding-window evaluation with stride 64
zstd-22 artifact compression