PR #1655

open

14L QEP GPTQ + Per-Window SGD TTT (1.1135 BPB, 3-seed)

by himanalotView on GitHub

val_bpb

1.1135

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.80MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 query heads and 4 KV heads

parameters: {"layers":14,"dimensions":512,"query_heads":8,"kv_heads":4}

U-Net skip connections

Encoder-decoder skip connections across the 14-layer network

parameters: {"encoder_layers":7,"decoder_layers":7}

BigramHash

Bigram hash embedding module

parameters: {"vocab_size":8192,"dimensions":64}

LeakyReLU

Leaky ReLU squared activation

parameters: {"slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.09

momentum: 0.99

other_params: {"adam_weight_decay":0.02}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Quantization

GPTQ

bits: 6

scope: attention + MLP weights

Compression

Brotli

level: 11

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"stride":76,"freeze_layers":2,"epochs":1}

Evaluation

stride-based eval

parameters: {"stride":76}

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

QEP-aware GPTQ with sequential block calibration using partially quantized model outputs
Online per-window score-first SGD test-time training
Critical-depth 14-layer 512d GQA architecture
Brotli-compressed artifact under the 16MB limit