PR #170

open

Record: Int6 QAT + SmearGate + Muon WD (val_bpb=1.1669)

by baudrillardsgh0stView on GitHub
val_bpb
1.1669
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.77 MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: all weights
Compression
zstd
level: 22
Architecture
SmearGate
Learned gate blending current and previous token embeddings to add cheap bigram context.
parameters: {"params":513}
tied embeddings
Input/output embeddings are tied, with fp16 passthrough to avoid compounding quantization errors.
parameters: null
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: {"decoupled_weight_decay":true}
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.01,"decoupled":true}

Novel Contributions

  • Int6 QAT with STE fake quantization and per-row symmetric scaling
  • Int6 values stored in int8 containers with zstd-22 compression
  • SmearGate learned embedding-level bigram context
  • Decoupled Muon weight decay for improved generalization and quantization robustness
  • Sliding-window full-context evaluation
  • FP16 tied embedding passthrough