PR #2089

open

[Non-Record 16MB] Int8 QAT 7L d512 + Sliding Window + N-gram Backoff + TTT

by AlirezaAlampourView on GitHub
val_bpb
1.2093
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.5MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"negative_slope":0.5}
weight tying
Tied embeddings are used.
parameters: null
U-Net skip connections
U-Net style encoder/decoder skip connections with learned skip weights.
parameters: null
Quantization
QAT
bits: 8
scope: per-row weights and per-tensor vectors/scalars
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: 0.9382982028913158
other_params: null
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Token-level 5-order n-gram backoff cache mixed with neural probabilities at evaluation time.
parameters: {"max_order":5,"alpha":0.2}
Test-Time Training
score-first TTT
parameters: {"steps":1,"target":"last block + ln_f per chunk"}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
logit softcap
parameters: {"softcap":30}

Novel Contributions

  • Int8 per-row QAT with zlib-compressed artifact under the 16MB limit
  • Sliding window evaluation with stride 64
  • 5-order n-gram backoff cache mixed at evaluation time
  • Score-first test-time training on the last block and ln_f per chunk
  • GQA-based 7-layer decoder-only transformer with U-Net skip connections