PR #1940
opennon record submission: 11 l + int6 + tuned LR + fp16 embed (1.3066 bpb local)
by skamaladView on GitHub
val_bpb
1.3066
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.99 MB
Training Techniques
Architecture
weight tying
Tied input/output embeddings with FP16 embedding passthrough.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Uses RoPE positional encoding.
parameters: null
ReLU²
ReLU squared MLP activation.
parameters: null
MLP3x
Explored wider MLP configurations via MLP_MULT overrides; baseline architecture uses 2x expansion and ablations include 3x.
parameters: {"mlp_mult":2}
depth
Increased transformer depth from 9 to 11 layers.
parameters: {"layers":11}
Quantization
int6
bits: 6
scope: all weights
fp16
bits: 16
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Novel Contributions
- 11-layer transformer instead of 9 layers
- Int6 quantization export with QUANT_MAX=31 for better compression
- FP16 embedding passthrough to eliminate quantization tax
- Tuned Muon optimizer settings with lower matrix LR and higher momentum
- MLP_HIDDEN environment override for finer MLP width control