PR #385

open

Non-record: 11L Int6 QAT + SmearGate + SWA(0.4) + WD=0.04 (3-seed mean val_bpb=1.1488)

by dentity007View on GitHub

val_bpb

1.1488

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all

Architecture

SmearGate

Per-dim gating mechanism used in the model.

parameters: null

tied embeddings

Input and output embeddings are tied, with FP16 passthrough.

parameters: null

RoPE

Rotary positional embeddings.

parameters: null

MLP3x

Uses a 3x MLP expansion.

parameters: {"mlp_mult":3,"hidden_size":1536}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Skip connections added in a U-Net-like pattern.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02}

AdamW

weight_decay: null

momentum: null

other_params: {"lr":0.03,"scope":"embeddings"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":32}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Muon weight decay increased from 0.038 to 0.04 to improve int6 quantization quality
SWA start fraction reduced from 0.5 to 0.4 to average more checkpoints and smooth weights
3-seed verified int6 QAT submission with low variance (std=0.0006)
SmearGate-based architecture combined with SWA and int6 quantization
Per-row symmetric int6 quantization in int8 containers with FP16 passthrough for tied embeddings