PR #194

open

Record: 11L Int6 QAT + SmearGate + SWA + SAM: 1.1480 BPB (3-seed mean)

by baudrillardsgh0stView on GitHub

val_bpb

1.1480

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.33 MiB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all weights with fp16 tied embeddings

Architecture

SmearGate

Per-dimension learned gate blending current and previous token embeddings.

parameters: {"dimensions":512}

MLP3x

Expanded MLP hidden size to 3x the model dimension.

parameters: {"multiplier":3}

tied embeddings

Input embeddings and output projection are tied, with embeddings kept in fp16 passthrough.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}

Weight Averaging

SWA

parameters: {"every_steps":50,"start_frac":0.5}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Initialization

OrthoInit

Orthogonal initialization used to support SmearGate and improve training stability.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.038}

Other

other

Sharpness-Aware Minimization (SAM) applied during training to flatten the loss landscape and improve quantization robustness.

parameters: {"rho":0.05,"frac":0.03}

Novel Contributions

First introduction of SAM to the competition
Per-dimension SmearGate with learned sigmoid gating over embedding dimensions
Int6 QAT with int6 values stored in int8 containers for better zstd compression
Combination of SWA and SAM to improve post-quantization robustness
Use of sliding-window evaluation to recover additional BPB
11-layer architecture that fits under the artifact size limit with int6 compression