PR #200

open

Record: SP4096 + Int6 QAT + NorMuon (val_bpb=1.2012)

by khasinskiView on GitHub
val_bpb
1.2012
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14,342,773 bytes

Training Techniques

Quantization
STE QAT
bits: 6
scope: all
Architecture
tied embeddings
Uses tied input/output embeddings.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"variant":"NorMuon"}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_start_momentum":0.92,"warmup_steps":1500}
Other
other
SP4096 SentencePiece BPE tokenizer with improved text compression over sp1024.
parameters: {"vocab_size":4096,"compression_improvement":"26%"}
other
Per-row int6 quantization with fp16 embedding passthrough and zstd-22 artifact compression.
parameters: {"range":"[-31,31]"}

Novel Contributions

  • SP4096 tokenizer with substantially better text compression than sp1024
  • Int6 STE QAT with fp16 embedding passthrough
  • zstd-22 compression to keep the artifact under 16MB
  • NorMuon optimizer with tuned learning rates and momentum
  • Extended warmdown schedule