PR #619

open

non-record 16MB A100 SXM run (10L mixed int5/int6 + EMA + QAT)

by zeal175View on GitHub
val_bpb
1.4222
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,576,677 bytes

Training Techniques

Quantization
mixed int5/int6 with QAT
bits: null
scope: int5 for MLP weights, int6 for attention/bigram-sensitive weights
Architecture
BigramHash
BigramHash embedding added to model
parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}
MLP3x
3x MLP expansion
parameters: {"MLP_MULT":3,"NUM_LAYERS":10}
Weight Averaging
EMA
parameters: {"decay":0.9999}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"MATRIX_LR":0.02,"SCALAR_LR":0.04,"TIED_EMBED_LR":0.04}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":160}

Novel Contributions

  • Mixed quantization using int5 for MLP weights and int6 for attention/bigram-sensitive weights
  • Use of EMA (Exponential Moving Average) for export-time weights with high decay
  • Final-fraction QAT (Quantization Aware Training) with QAT_FINAL_FRAC=0.15
  • Incorporation of BigramHash embedding with large vocab size and dimension
  • 3x MLP expansion in a 10-layer Transformer model
  • Use of Muon optimizer with specific learning rates and momentum tuning
  • Compression of final artifact under 16MB using int8+zlib