PR #305

open

12L Full-INT4 (MLP + Attn) + BigramHash(4096) — val_bpb 1.1672

by Naazimsnh02View on GitHub

val_bpb

1.1672

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.4 MB

Training Techniques

Quantization

int4

bits: 4

scope: MLP and attention weights

Architecture

BigramHash

Adds a BigramHash table for token interactions / auxiliary representation

parameters: {"vocab":4096,"dim":64}

weight tying

Tied embeddings

parameters: {"dim":512}

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

3x MLP expansion

parameters: {"hidden":1536}

RoPE

Rotary positional encoding

parameters: {"base":10000}

U-Net skip connections

Symmetric skip connections between encoder and decoder halves across layers

parameters: {"layers":12}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adamw_for":"scalars, embeddings"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

Orthogonal

Orthogonal initialization with muP-scaled output projections

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

10% magnitude pruning before quantisation to create zero runs that compress better

parameters: {"pruning_percentile":10}

Novel Contributions

Group INT4 nibble-packing applied to both MLP and attention weights with gs=64 fp16 scales
Freed quantization budget to enable 12 transformer layers instead of 10
U-Net skip connections across the 12-layer model
10% magnitude pruning before quantisation to improve zstd compression
BigramHash reduced to 4096 to fit within the 16 MB budget