PR #426

open

Record: 10L Int5-MLP + Mixed Quant + GradClip + Warmdown3k (mean val_bpb=1.20262)

by aniketio-ctrlView on GitHub

val_bpb

1.2026

Architecture

Transformer

Optimizer

—

Artifact Size

~15.7MB

Training Techniques

Quantization

mixed int5/int6 with fp16 embeddings

bits: null

scope: MLP, attention, embeddings

Architecture

depth and MLP width increase

Increased model depth from 9 to 10 layers and widened the MLP from 2x to 3x expansion to fit within the artifact budget.

parameters: {"layers":10,"mlp_mult":3,"hidden_size":1536,"dim":512,"heads":8,"kv_heads":4}

tied embeddings

Uses tied embeddings with zero quantization error for embeddings.

parameters: null

GQA

Uses grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Compression

zlib

level: null

Mixed precision quantization to fund an extra transformer layer within the 16MB budget
Int5 quantization for MLP weights
Int6 quantization for attention weights
FP16 tied embeddings
Increased depth from 9 to 10 layers
Wider MLP expansion from 2x to 3x
Longer warmdown schedule
Gradient clipping for more stable training