PR #805

open

Int6 GPTQ-lite + LeakyReLU(0.5)^2 + EMA + 11L MLP3x

val_bpb

1.1807

Architecture

Transformer

Optimizer

—

Artifact Size

~3.9 MB

Training Techniques

Architecture

Transformer depth

Increased model depth from 9 to 11 transformer layers.

parameters: {"layers":11}

MLP3x

Expanded the MLP hidden size to 3x the base width instead of 2x.

parameters: {"mlp_multiplier":3}

GQA

Used grouped-query attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

Other

other

LeakyReLU(0.5)^2 activation function replacing ReLU^2.

parameters: {"negative_slope":0.5}

Quantization

GPTQ-lite

bits: 6

scope: per-row weights

STE QAT

bits: null

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

late QAT activation based on LR scale threshold

parameters: {"lr_scale_threshold":0.15}