PR #192

open

Record: 11L Int6 QAT + SmearGate + WD 0.038 (val_bpb=1.1502)

by baudrillardsgh0stView on GitHub

val_bpb

1.1502

Architecture

GPT

Optimizer

Muon

Artifact Size

15.50 MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all

Architecture

SmearGate

Learned gate blending current and previous token embeddings

parameters: {"params":513}

MLP3x

Transformer MLP widened to 3x

parameters: {"multiplier":3}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Uses fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"batch":32}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"value":0.038}

Other

other

Int6-in-int8 container storage with restricted-range zstd compression

parameters: {"container":"int8","value_range":[-32,31]}

other

FP16 tied embedding passthrough

parameters: null

Novel Contributions

11-layer GPT enabled by int6 compression
STE int6 quantization-aware training
SmearGate learned embedding blend between current and previous token
Decoupled Muon weight decay tuned for int6 quantization
Int6-in-int8 storage with zstd-22 compression
Sliding window evaluation with stride 64