PR #547

open

Record: Int5/Int6+Zstd+MLP3x — mean val_bpb=1.1752 (10L, seq4096, sliding window)

by shajalahamedcseView on GitHub
val_bpb
1.1752
Architecture
Transformer
Optimizer
Muon
Artifact Size
≤ 16,000,000 B

Training Techniques

Quantization
int5/int6
bits: null
scope: MLP matrices (int5), attention matrices (int6), embeddings (int6)
Architecture
MLP3x
3x expansion of MLP hidden units from baseline 1024 to 1536 enabled by quantization savings
parameters: {"mlp_hidden_units":1536,"expansion_factor":3}
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3600}

Novel Contributions

  • Int5 quantization for MLP weight matrices enabling ~1.5MB savings
  • Int6 quantization for attention matrices balancing precision and size
  • Use of zstd compression replacing zlib for better compression ratio on quantized integer arrays
  • 3x MLP expansion (hidden=1536) funded by quantization savings without exceeding 16MB artifact size
  • Training with sequence length 4096 and sliding window evaluation with stride 64 for full context scoring