PR #192

open

Record: 11L Int6 QAT + SmearGate + WD 0.038 (val_bpb=1.1502)

by baudrillardsgh0stView on GitHub
val_bpb
1.1502
Architecture
GPT
Optimizer
Muon
Artifact Size
15.50 MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: all
Architecture
SmearGate
Learned gate blending current and previous token embeddings
parameters: {"params":513}
MLP3x
Transformer MLP widened to 3x
parameters: {"multiplier":3}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Uses fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"batch":32}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.038}
Other
other
Int6-in-int8 container storage with restricted-range zstd compression
parameters: {"container":"int8","value_range":[-32,31]}
other
FP16 tied embedding passthrough
parameters: null

Novel Contributions

  • 11-layer GPT enabled by int6 compression
  • STE int6 quantization-aware training
  • SmearGate learned embedding blend between current and previous token
  • Decoupled Muon weight decay tuned for int6 quantization
  • Int6-in-int8 storage with zstd-22 compression
  • Sliding window evaluation with stride 64