PR #107
openInt6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)
by m0atView on GitHub
val_bpb
1.1648
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.93MB
Training Techniques
Quantization
mixed int6 quantization
bits: 6
scope: MLP/Q/V/proj weights
fp16
bits: 16
scope: tied embeddings
STE QAT
bits: null
scope: post-training quantization-aware training
Architecture
MLP3x
Widened MLP hidden size to improve capacity under the artifact budget.
parameters: {"hidden_size":1488}
tied embeddings
Kept the tied embedding/output head in fp16 instead of quantizing it.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"grad_clip_norm":0.3,"qk_gain_init":1.7}
Compression
zstd
level: 22
Regularization
gradient clipping
parameters: {"norm":0.3}
Other
other
Fallback from FA3 to SDPA when FA3 is unavailable.
parameters: null
Novel Contributions
- Mixed int6 quantization of MLP/Q/V/proj weights to fit a larger model under the artifact budget
- Wider MLP hidden size (1488) enabled by quantization savings
- Sliding-window evaluation with stride 64 to use more context per scored position
- Post-training QAT with STE to reduce quantization penalty
- Tuned learning rates for matrix, scalar, and tied embedding parameters
- Kept tied embedding in fp16 to avoid quantizing the most sensitive tensor
- Longer warmdown schedule to better match the short training budget
- FA3 fallback to SDPA for robustness