val_bpb
1.2089
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,190,812 bytes
Training Techniques
Quantization
STE QAT
bits: 6
scope: large matrices / model weights
Architecture
MLP3x
11-layer Transformer with 512d hidden size and 1024 MLP hidden size; originally targeted 1536 MLP hidden size but reduced to fit budget.
parameters: {"layers":11,"dimensions":512,"mlp_hidden":1024}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.025,"warmdown":3000,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Evaluation
sliding window eval
parameters: {"context_length":4096,"chunk_size":512,"stride":64}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Compression
zlib
level: null
Other
other
Flat tensor storage for packed int6 bytes (int6_mixed_per_row_v2) to improve compression by avoiding pickle metadata interleaving.
parameters: {"format":"int6_mixed_per_row_v2"}
Novel Contributions
- Flat tensor storage for packed int6 weights to improve zlib compression
- STE fake-int6 QAT activated at step 200 with fp32 weight restore after backward
- Sliding window evaluation with ctx=4096, chunk=512, stride=64
- Tuned Muon optimizer settings for the 8×H100, 10-minute budget
- Observation that more training steps can worsen compression due to near-orthogonal, high-entropy weights