PR #385

open

Non-record: 11L Int6 QAT + SmearGate + SWA(0.4) + WD=0.04 (3-seed mean val_bpb=1.1488)

by dentity007View on GitHub
val_bpb
1.1488
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: all
Architecture
SmearGate
Per-dim gating mechanism used in the model.
parameters: null
tied embeddings
Input and output embeddings are tied, with FP16 passthrough.
parameters: null
RoPE
Rotary positional embeddings.
parameters: null
MLP3x
Uses a 3x MLP expansion.
parameters: {"mlp_mult":3,"hidden_size":1536}
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Skip connections added in a U-Net-like pattern.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.03,"scope":"embeddings"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • Muon weight decay increased from 0.038 to 0.04 to improve int6 quantization quality
  • SWA start fraction reduced from 0.5 to 0.4 to average more checkpoints and smooth weights
  • 3-seed verified int6 QAT submission with low variance (std=0.0006)
  • SmearGate-based architecture combined with SWA and int6 quantization
  • Per-row symmetric int6 quantization in int8 containers with FP16 passthrough for tied embeddings