PR #1655

open

14L QEP GPTQ + Per-Window SGD TTT (1.1135 BPB, 3-seed)

by himanalotView on GitHub
val_bpb
1.1135
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.80MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads
parameters: {"layers":14,"dimensions":512,"query_heads":8,"kv_heads":4}
U-Net skip connections
Encoder-decoder skip connections across the 14-layer network
parameters: {"encoder_layers":7,"decoder_layers":7}
BigramHash
Bigram hash embedding module
parameters: {"vocab_size":8192,"dimensions":64}
LeakyReLU
Leaky ReLU squared activation
parameters: {"slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.09
momentum: 0.99
other_params: {"adam_weight_decay":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Quantization
GPTQ
bits: 6
scope: attention + MLP weights
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"stride":76,"freeze_layers":2,"epochs":1}
Evaluation
stride-based eval
parameters: {"stride":76}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • QEP-aware GPTQ with sequential block calibration using partially quantized model outputs
  • Online per-window score-first SGD test-time training
  • Critical-depth 14-layer 512d GQA architecture
  • Brotli-compressed artifact under the 16MB limit