PR #114
openRecord: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training
by saml212View on GitHub
val_bpb
1.1574
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.98MB
Training Techniques
Quantization
int6
bits: 6
scope: weight matrices
fp16
bits: 16
scope: tied embedding and last 2 layers' key projections
Architecture
MLP3x
Tripled MLP hidden dimension to fit within artifact budget enabled by int6 compression.
parameters: {"mlp_hidden":1536,"default_mlp_hidden":1024}
tied embeddings
Input embedding and output projection share the same weight matrix.
parameters: null
KV head count
Model uses 4 KV heads with 8 attention heads and 9 layers.
parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Evaluation
sliding window eval
parameters: {"stride":256,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Selective precision preservation for sensitive tensors, including fp16 tied embedding and fp16 passthrough for late-layer key projections.
parameters: {"fp16_tied_embedding":true,"fp16_late_k_passthrough_layers":2}
Novel Contributions
- Int6 post-training quantization to reduce artifact size and free space for a 3x larger MLP.
- Selective precision preservation for the tied embedding and last two layers' key projections.
- Training at sequence length 2048 instead of 4096 while retaining performance under sliding-window evaluation.
- Gradient clipping at 0.3 to stabilize long-sequence training.
- Batch size of 786,432 tokens found to be optimal for train@2048.
- Sliding-window evaluation with stride 256, which improved val_bpb and reduced eval time versus smaller strides.