PR #1009

open

12L INT4 bQAT + Value Embeddings — val_bpb 1.1588

by SoHarshhView on GitHub

val_bpb

1.1574

Architecture

Transformer

Optimizer

—

Artifact Size

16.41 MB

Training Techniques

Quantization

QAT

bits: 4

scope: MLP + bigram

late QAT

bits: 4

scope: MLP + bigram

INT6

bits: 6

scope: attention

Architecture

BigramHash

Bigram embedding with 10240 buckets

parameters: {"buckets":10240}

MLP3x

Three-layer MLP with LeakyReLU squared activation

parameters: {"layers":3}

LeakyReLU

LeakyReLU squared activation

parameters: {"squared":true,"negative_slope":0.5}

XSA

Cross-layer shared attention in the last 4 layers

parameters: {"layers":4}

U-Net skip connections

U-Net style skip connections

parameters: null

RoPE

Partial rotary position embeddings

parameters: {"dimensions":16}

VE128

Value embeddings reinject token identity into V at layers 10-11 using a shared embedding table

parameters: {"ve_dim":128,"layers":[10,11]}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Initialization

resid mix

Learnable blend between residual stream and initial state

Weight Averaging

EMA

parameters: {"decay":0.997}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3}

Novel Contributions

INT4 QAT for MLP and bigram components
Value embeddings added to V vectors at layers 10-11
Shared value embedding table to reduce parameter cost
EMA with QAT activation reset fix
Combination of U-Net skips, XSA, partial RoPE, LN scale, and resid_mix under a tight size budget