PR #273

open

Non-record: 10L Int6 QAT + SmearGate + SWA (val_bpb=1.1575)

by dentity007View on GitHub

val_bpb

1.1575

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.73MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all

Architecture

SmearGate

Per-dimension gate blending token with predecessor

parameters: null

MLP3x

Expanded feed-forward network width to 3x

parameters: {"multiplier":3}

KV head count

Uses 4 KV heads with 8 attention heads (GQA)

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied; embedding kept in FP16

parameters: null

Weight Averaging

SWA

parameters: {"every_steps":50,"start_frac":0.5,"num_checkpoints":27}

Optimizer

Muon

weight_decay: 0.038

momentum: 0.99

other_params: {"momentum_warmup":"0.92->0.99 over 1500 steps"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.038}

Other

other

Per-dimension SmearGate and step-throughput-focused 10-layer depth tradeoff to maximize training steps under a 10-minute wall-clock budget

parameters: {"layers":10,"step_time_ms":65.49,"steps":9156}

Novel Contributions

10-layer configuration chosen to improve step throughput under the 10-minute wall-clock constraint
Systematic analysis across 17 experiments comparing architecture, LR schedules, quantization, and data scaling
Int6 QAT with STE combined with per-dimension SmearGate and SWA
Demonstration that 10L outperforms 11L because faster step time yields more training steps
Use of sliding window evaluation with stride 64 and zstd-22 compression