PR #1002

open

12L INT4 bQAT + EMA Fix + Deterministic QAT — val_bpb ~1.165

by SoHarshhView on GitHub
val_bpb
1.1650
Architecture
Transformer
Optimizer
Artifact Size
15.97 MB

Training Techniques

Architecture
BigramHash
Bigram hash table used in the model, quantized and trained with INT4 bQAT.
parameters: {"buckets":10240}
MLP3x
Three-layer MLP with LeakyReLU activation.
parameters: null
LeakyReLU
LeakyReLU(0.5) squared activation used in the MLP.
parameters: {"slope":0.5}
XSA
Cross-layer shared attention applied to the last 4 layers.
parameters: {"last_n_layers":4}
RoPE
Partial rotary positional embedding.
parameters: {"dimensions":16,"total_dimensions":64}
U-Net skip connections
U-Net style skip connections in the residual stream.
parameters: null
resid_mix
Learnable x/x0 blend always active.
parameters: null
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997,"qat_activation_reset":true}
Quantization
QAT
bits: 4
scope: MLP and bigram; INT6 attention
late QAT
bits: 4
scope: training
INT4
bits: 4
scope: BigramHash
Compression
zstd
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"legal_score_first":true}
LR Schedule
warmdown
parameters: {"late_qat_frac":0.65,"late_qat_threshold":0.9}

Novel Contributions

  • INT4 bigram QAT to quantize the bigram table below INT6 and fit 12 layers within 16MB
  • EMA reset when QAT activates to avoid quantization degradation from pre-QAT EMA weights
  • Deterministic wallclock-based QAT trigger to remove seed-to-seed timing variance on multi-GPU runs