PR #488

open

Record: 11L Int6 QAT + Warmdown (val_bpb=1.3267, 1xH100)

val_bpb
1.3267
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.3 MB

Training Techniques

Quantization
STE QAT int6
bits: 6
scope: all weights
Architecture
MLP3x
Uses a 3x MLP expansion in an 11-layer Transformer backbone.
parameters: {"layers":11,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":3}
GQA
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
SmearGate
Adds a SmearGate module at the embedding layer to inject additional signal.
parameters: null
BigramHash
Adds a compact bigram hash embedding for extra context.
parameters: {"bigram_vocab_size":2048,"bigram_dim":96}
Initialization
OrthoInit
Orthogonal initialization for large matrices with scaled projection weights.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_end":0.99}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"used_for":"token/scalar optimizers"}
Weight Averaging
SWA
parameters: {"checkpoints":7}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"fraction":0.15,"wallclock_based":true}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}
Other
other
Wallclock-fraction warmdown to avoid iter-based scheduling issues under torch.compile overhead.
parameters: {"last_fraction":0.15}

Novel Contributions

  • Int6 grouped quantization for all weights
  • STE fake-quantization QAT during the last 15% of wallclock
  • Wallclock-fraction warmdown that fixes iter-based scheduling under torch.compile overhead
  • SWA with 7 checkpoints during warmdown
  • Compact BigramHash embedding and SmearGate additions
  • Orthogonal initialization for large matrices
  • Sliding-window evaluation with stride 64
  • zstd-22 artifact compression