val_bpb
1.1807
Architecture
Transformer
Optimizer
—
Artifact Size
~3.9 MB
Training Techniques
Architecture
Transformer depth
Increased model depth from 9 to 11 transformer layers.
parameters: {"layers":11}
MLP3x
Expanded the MLP hidden size to 3x the base width instead of 2x.
parameters: {"mlp_multiplier":3}
GQA
Used grouped-query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
Other
other
LeakyReLU(0.5)^2 activation function replacing ReLU^2.
parameters: {"negative_slope":0.5}
Quantization
GPTQ-lite
bits: 6
scope: per-row weights
STE QAT
bits: null
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
late QAT activation based on LR scale threshold
parameters: {"lr_scale_threshold":0.15}
Novel Contributions
- 11 transformer layers instead of the 9-layer baseline
- 3x MLP expansion
- LeakyReLU(0.5)^2 activation
- Int6 per-row GPTQ-lite quantization with clip search
- Late QAT via STE triggered when LR scale drops below 0.15
- EMA weight averaging with decay 0.997
- Grouped-query attention with 8 query heads and 4 KV heads
- Sliding window evaluation with stride 64