PR #225

open

Non-record: Int6 QAT + 11L 512d + Sliding Window, val_bpb=1.2089

by dibdaboView on GitHub
val_bpb
1.2089
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,190,812 bytes

Training Techniques

Quantization
STE QAT
bits: 6
scope: large matrices / model weights
Architecture
MLP3x
11-layer Transformer with 512d hidden size and 1024 MLP hidden size; originally targeted 1536 MLP hidden size but reduced to fit budget.
parameters: {"layers":11,"dimensions":512,"mlp_hidden":1024}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.025,"warmdown":3000,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Evaluation
sliding window eval
parameters: {"context_length":4096,"chunk_size":512,"stride":64}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Compression
zlib
level: null
Other
other
Flat tensor storage for packed int6 bytes (int6_mixed_per_row_v2) to improve compression by avoiding pickle metadata interleaving.
parameters: {"format":"int6_mixed_per_row_v2"}

Novel Contributions

  • Flat tensor storage for packed int6 weights to improve zlib compression
  • STE fake-int6 QAT activated at step 200 with fp32 weight restore after backward
  • Sliding window evaluation with ctx=4096, chunk=512, stride=64
  • Tuned Muon optimizer settings for the 8×H100, 10-minute budget
  • Observation that more training steps can worsen compression due to near-orthogonal, high-entropy weights