PR #836

open

Full-Training QAT: 1.1219 bpb

by autocode-rayesView on GitHub
val_bpb
1.1219
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size

Training Techniques

Quantization
QAT
bits: 6
scope: all
Architecture
LeakyReLU_LegalTTT_ParallelMuon
Existing SOTA Transformer architecture with LeakyReLU, LegalTTT, Parallel Muon, and related custom components.
parameters: null
XSA
Cross/self-attention variant used in the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":"16/64"}
SmearGate
Custom gating mechanism included in the architecture.
parameters: null
BigramHash
Bigram hashing with bucketed representation.
parameters: {"buckets":2048}
MLP3x
MLP with 3x expansion and LeakyReLU(0.5)^2.
parameters: {"expansion":3}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"during":"warmdown"}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","learning_rate":0.002,"epochs":3,"chunk_size":"32K"}
Compression
LZMA
level: null
LR Schedule
warmdown
parameters: null

Novel Contributions

  • Full-training QAT with int6 fake quantization enabled from step 1
  • Removing the mismatch between full-precision training and late-stage quantization noise
  • Using QAT_ENABLED=1 with LATE_QAT_THRESHOLD=1.0 to activate quantization immediately