PR #577

open

GPTQ + Short TTT — val_bpb 1.1207 (seed 1337)

by newjordanView on GitHub
val_bpb
1.1207
Architecture
11L/512d/8H/4KV/3xMLP (relu²), U-Net skip, Partial RoPE (16/64), XSA last 4, BigramHash(2048), VE128 on layers 9-10, SmearGate
Optimizer
Muon
Artifact Size
15.60 MB

Training Techniques

Quantization
int6 QAT + GPTQ
bits: 6
scope: all
Architecture
Partial RoPE
Rotary positional embeddings applied partially with NTK scaling
parameters: {"scaling":"16/64"}
SmearGate
Gating mechanism in MLP layers
parameters: null
BigramHash
Hashing mechanism with 2048 buckets for bigrams
parameters: {"buckets":2048}
XSA
Cross self-attention applied in last 4 layers
parameters: {"layers":4}
Weight Averaging
EMA
parameters: {"decay":0.995,"usage":"previous submission #508 (disabled in this PR)"}
Test-Time Training
full TTT with SGD
parameters: {"learning_rate":0.002,"epochs":3,"max_train_chunks":50,"EMA_decay":0,"freeze_blocks":2,"optimizer":"SGD"}
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Short TTT strategy: SGD-based test-time training with no EMA smoothing and stopping after 50 chunks to avoid late-chunk degradation
  • Demonstrated that EMA smoothing in TTT can wash out adaptation gains
  • Proper use of zstd-22 compression to reduce artifact size by ~2MB compared to previous fallback
  • Disabled int8_sensitive flag to stay within 16MB artifact size limit
  • Sharing detailed TTT chunk trajectory analysis showing adaptation and distribution shift effects
  • Maintained same base architecture and GPTQ pipeline while improving val_bpb marginally from previous submission