PR #861

open

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)

by JoeProAIView on GitHub

val_bpb

1.1326

Architecture

11-layer U-Net GPT

Optimizer

Muon

Artifact Size

15.51 MB

Training Techniques

Quantization

int5 QAT

bits: 5

scope: all weights

Architecture

XSA

Cross-layer shared attention applied across all 11 layers.

parameters: {"layers":11}

BigramHash

Bigram hash embedding added to token embeddings.

parameters: {"buckets":4096,"dimensions":128}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"scalars"}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0004,"epochs":1,"params":"MLP-only (up_proj, down_proj, gate_proj, scale)"}

LR Schedule

warmdown

parameters: {"warmdown_steps":6000}

Regularization

weight pruning

parameters: {"sparsity":0.15,"description":"Prune smallest weights before quantization for better compression."}

Compression

zstd

level: 22

Other

other

Full U-Net style model with skip connections and SwiGLU MLPs.

parameters: {"encoder_layers":[0,1,2,3,4,5],"decoder_layers":[6,7,8,9,10],"dim":512,"heads":8,"mlp_hidden":1536}

Novel Contributions

Int5 QAT with per-row scaling and percentile clipping
Score-first legal test-time training
Reduced MLP hidden size to fit under 16 MB
15% pre-quantization weight pruning for improved compression
Bigram hash embedding augmentation
XSA on all 11 layers
Extended warmdown for better int5 clustering