PR #1009

open

12L INT4 bQAT + Value Embeddings — val_bpb 1.1588

by SoHarshhView on GitHub
val_bpb
1.1574
Architecture
Transformer
Optimizer
Artifact Size
16.41 MB

Training Techniques

Quantization
QAT
bits: 4
scope: MLP + bigram
late QAT
bits: 4
scope: MLP + bigram
INT6
bits: 6
scope: attention
Architecture
BigramHash
Bigram embedding with 10240 buckets
parameters: {"buckets":10240}
MLP3x
Three-layer MLP with LeakyReLU squared activation
parameters: {"layers":3}
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"negative_slope":0.5}
XSA
Cross-layer shared attention in the last 4 layers
parameters: {"layers":4}
U-Net skip connections
U-Net style skip connections
parameters: null
RoPE
Partial rotary position embeddings
parameters: {"dimensions":16}
VE128
Value embeddings reinject token identity into V at layers 10-11 using a shared embedding table
parameters: {"ve_dim":128,"layers":[10,11]}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Initialization
resid mix
Learnable blend between residual stream and initial state
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3}

Novel Contributions

  • INT4 QAT for MLP and bigram components
  • Value embeddings added to V vectors at layers 10-11
  • Shared value embedding table to reduce parameter cost
  • EMA with QAT activation reset fix
  • Combination of U-Net skips, XSA, partial RoPE, LN scale, and resid_mix under a tight size budget