PR #1002

open

12L INT4 bQAT + EMA Fix + Deterministic QAT — val_bpb ~1.165

by SoHarshhView on GitHub

val_bpb

1.1650

Architecture

Transformer

Optimizer

—

Artifact Size

15.97 MB

Training Techniques

Architecture

BigramHash

Bigram hash table used in the model, quantized and trained with INT4 bQAT.

parameters: {"buckets":10240}

MLP3x

Three-layer MLP with LeakyReLU activation.

parameters: null

LeakyReLU

LeakyReLU(0.5) squared activation used in the MLP.

parameters: {"slope":0.5}

XSA

Cross-layer shared attention applied to the last 4 layers.

parameters: {"last_n_layers":4}

RoPE

Partial rotary positional embedding.

parameters: {"dimensions":16,"total_dimensions":64}

U-Net skip connections

U-Net style skip connections in the residual stream.

parameters: null

resid_mix

Learnable x/x0 blend always active.

parameters: null

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997,"qat_activation_reset":true}

Quantization

QAT

bits: 4

scope: MLP and bigram; INT6 attention

late QAT

bits: 4

scope: training

INT4

bits: 4

scope: BigramHash

Compression

zstd

level: null

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"legal_score_first":true}

LR Schedule

warmdown

parameters: {"late_qat_frac":0.65,"late_qat_threshold":0.9}

Novel Contributions

INT4 bigram QAT to quantize the bigram table below INT6 and fit 12 layers within 16MB
EMA reset when QAT activates to avoid quantization degradation from pre-QAT EMA weights
Deterministic wallclock-based QAT trigger to remove seed-to-seed timing variance on multi-GPU runs