PR #861

open

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)

by JoeProAIView on GitHub
val_bpb
1.1326
Architecture
11-layer U-Net GPT
Optimizer
Muon
Artifact Size
15.51 MB

Training Techniques

Quantization
int5 QAT
bits: 5
scope: all weights
Architecture
XSA
Cross-layer shared attention applied across all 11 layers.
parameters: {"layers":11}
BigramHash
Bigram hash embedding added to token embeddings.
parameters: {"buckets":4096,"dimensions":128}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"scalars"}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0004,"epochs":1,"params":"MLP-only (up_proj, down_proj, gate_proj, scale)"}
LR Schedule
warmdown
parameters: {"warmdown_steps":6000}
Regularization
weight pruning
parameters: {"sparsity":0.15,"description":"Prune smallest weights before quantization for better compression."}
Compression
zstd
level: 22
Other
other
Full U-Net style model with skip connections and SwiGLU MLPs.
parameters: {"encoder_layers":[0,1,2,3,4,5],"decoder_layers":[6,7,8,9,10],"dim":512,"heads":8,"mlp_hidden":1536}

Novel Contributions

  • Int5 QAT with per-row scaling and percentile clipping
  • Score-first legal test-time training
  • Reduced MLP hidden size to fit under 16 MB
  • 15% pre-quantization weight pruning for improved compression
  • Bigram hash embedding augmentation
  • XSA on all 11 layers
  • Extended warmdown for better int5 clustering