PR #63

RECORDclosed

Record: 10L Int6 QAT + Zstd MLP2.6x Muon0.99 Sliding Window (val_bpb 1.1598)

by yahya010View on GitHub
val_bpb
1.1598
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.56MB

Training Techniques

Architecture
Transformer depth
Increased model depth from 9 to 10 transformer layers.
parameters: {"layers":10}
MLP3x
Widened MLP hidden size to 1344 (about 2.625x model dimension) enabled by quantization and compression savings.
parameters: {"hidden_size":1344,"multiplier":2.625}
tied embeddings
Used FP16 tied embedding passthrough.
parameters: null
Quantization
STE QAT
bits: 6
scope: all 2D block weights
fp16
bits: 16
scope: tied embeddings passthrough
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_from":0.92,"warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Regularization
gradient clipping
parameters: {"max_norm":0.3}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • STE int6 QAT that eliminates the quantization gap
  • Full int6 quantization of block weights with zstd-22 compression
  • Wider MLP hidden size enabled by compression savings
  • 10-layer Transformer variant
  • Muon momentum tuning with warmup from 0.92 to 0.99
  • Sliding window evaluation with stride 64
  • FP16 tied embedding passthrough