PR #577

open

GPTQ + Short TTT — val_bpb 1.1207 (seed 1337)

val_bpb

1.1207

Architecture

11L/512d/8H/4KV/3xMLP (relu²), U-Net skip, Partial RoPE (16/64), XSA last 4, BigramHash(2048), VE128 on layers 9-10, SmearGate

Optimizer

Muon

Artifact Size

15.60 MB

Training Techniques

Quantization

int6 QAT + GPTQ

bits: 6

scope: all

Architecture

Partial RoPE

Rotary positional embeddings applied partially with NTK scaling

parameters: {"scaling":"16/64"}

SmearGate

Gating mechanism in MLP layers

parameters: null

BigramHash

Hashing mechanism with 2048 buckets for bigrams

parameters: {"buckets":2048}

XSA

Cross self-attention applied in last 4 layers

parameters: {"layers":4}

Weight Averaging

EMA

parameters: {"decay":0.995,"usage":"previous submission #508 (disabled in this PR)"}

Test-Time Training

full TTT with SGD

parameters: {"learning_rate":0.002,"epochs":3,"max_train_chunks":50,"EMA_decay":0,"freeze_blocks":2,"optimizer":"SGD"}

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Evaluation

sliding window eval

parameters: {"stride":64}

Short TTT strategy: SGD-based test-time training with no EMA smoothing and stopping after 50 chunks to avoid late-chunk degradation
Demonstrated that EMA smoothing in TTT can wash out adaptation gains
Proper use of zstd-22 compression to reduce artifact size by ~2MB compared to previous fallback
Disabled int8_sensitive flag to stay within 16MB artifact size limit
Sharing detailed TTT chunk trajectory analysis showing adaptation and distribution shift effects
Maintained same base architecture and GPTQ pipeline while improving val_bpb marginally from previous submission