PR #805

open

Int6 GPTQ-lite + LeakyReLU(0.5)^2 + EMA + 11L MLP3x

val_bpb
1.1807
Architecture
Transformer
Optimizer
Artifact Size
~3.9 MB

Training Techniques

Architecture
Transformer depth
Increased model depth from 9 to 11 transformer layers.
parameters: {"layers":11}
MLP3x
Expanded the MLP hidden size to 3x the base width instead of 2x.
parameters: {"mlp_multiplier":3}
GQA
Used grouped-query attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
Other
other
LeakyReLU(0.5)^2 activation function replacing ReLU^2.
parameters: {"negative_slope":0.5}
Quantization
GPTQ-lite
bits: 6
scope: per-row weights
STE QAT
bits: null
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
late QAT activation based on LR scale threshold
parameters: {"lr_scale_threshold":0.15}

Novel Contributions

  • 11 transformer layers instead of the 9-layer baseline
  • 3x MLP expansion
  • LeakyReLU(0.5)^2 activation
  • Int6 per-row GPTQ-lite quantization with clip search
  • Late QAT via STE triggered when LR scale drops below 0.15
  • EMA weight averaging with decay 0.997
  • Grouped-query attention with 8 query heads and 4 KV heads
  • Sliding window evaluation with stride 64