PR #426

open

Record: 10L Int5-MLP + Mixed Quant + GradClip + Warmdown3k (mean val_bpb=1.20262)

by aniketio-ctrlView on GitHub
val_bpb
1.2026
Architecture
Transformer
Optimizer
Artifact Size
~15.7MB

Training Techniques

Quantization
mixed int5/int6 with fp16 embeddings
bits: null
scope: MLP, attention, embeddings
Architecture
depth and MLP width increase
Increased model depth from 9 to 10 layers and widened the MLP from 2x to 3x expansion to fit within the artifact budget.
parameters: {"layers":10,"mlp_mult":3,"hidden_size":1536,"dim":512,"heads":8,"kv_heads":4}
tied embeddings
Uses tied embeddings with zero quantization error for embeddings.
parameters: null
GQA
Uses grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Compression
zlib
level: null

Novel Contributions

  • Mixed precision quantization to fund an extra transformer layer within the 16MB budget
  • Int5 quantization for MLP weights
  • Int6 quantization for attention weights
  • FP16 tied embeddings
  • Increased depth from 9 to 10 layers
  • Wider MLP expansion from 2x to 3x
  • Longer warmdown schedule
  • Gradient clipping for more stable training