PR #1683

open

10-min record: 13L int4 MLP + qTTT + QAT Precompile + ANS Hybrid (val…

by yunoshevView on GitHub
val_bpb
1.1280
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.35 MB

Training Techniques

Architecture
GQA
13-layer transformer with grouped query attention, 8 heads and kv=4.
parameters: {"layers":13,"d_model":512,"heads":8,"kv":4}
Partial RoPE
Uses partial rotary position embeddings with 0.25 coverage.
parameters: {"ratio":0.25}
weight tying
Tied embeddings are used.
parameters: null
Gated Attention
Attention uses a gating mechanism.
parameters: null
Value Residual
Includes a value residual pathway.
parameters: null
BigramHash
Adds a bigram hash table embedding head.
parameters: {"dimensions":2048}
VE128
Uses a value embedding head.
parameters: {"dimensions":96}
Quantization
QAT
bits: 4
scope: MLP
QAT
bits: 5
scope: attention
late QAT
bits: 4
scope: model
GPTQ
bits: null
scope: per-layer
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":200}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"adamw_for_scalars":true}
Evaluation
sliding window eval
parameters: {"stride":256}
Test-Time Training
qTTT
parameters: {"learning_rate":0.002,"epochs":3,"target":"qo_bank"}
Compression
ANS + brotli
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 32000
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • 13-layer int4 MLP transformer with extra depth to offset aggressive quantization
  • qTTT test-time training on the Q-projection bank
  • QAT precompile warmup to avoid torch.compile recompilation stalls when late QAT activates
  • Hybrid ANS/Brotli artifact compression choosing the smaller encoding per tensor
  • Adaptive Hessian-weighted GPTQ auto-clip over multiple sigma candidates
  • Training-only document-boundary attention with eval-time varlen disabled for fused TTT sliding evaluation
  • Fused TTT plus sliding-window evaluation within the 600-second budget