PR #610

open

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb

by ChaosCodesView on GitHub
val_bpb
1.1190
Architecture
GPT
Optimizer
SGD
Artifact Size
15,750,888 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA4
Last 4 layers attend across batch sequences (Cross-Sequence Attention)
parameters: {"layers":4}
EMA
Exponential Moving Average weight averaging for smoother convergence
parameters: null
U-Net skip
Residual skip connections between early and late layers
parameters: null
SmearGate
Learned gating for token mixing
parameters: null
BigramHash
2048-vocab bigram hash embeddings for local context
parameters: {"vocab_size":2048,"embedding_dim":128}
PartialRoPE
Partial Rotary Positional Embeddings on 16 dims, base 10000
parameters: {"dimensions":16,"base":10000}
LNScale
Learnable LayerNorm scaling
parameters: null
ValueEmbed
128-dim value embeddings on layers 9-10
parameters: {"dimensions":128,"layers":[9,10]}
LateQAT
Quantization-aware training enabled after loss threshold 0.15
parameters: {"loss_threshold":0.15}
SWA
Stochastic Weight Averaging checkpoint averaging every 50 steps
parameters: {"frequency_steps":50}
Activation
LeakyReLU with negative slope 0.5 squared replacing GELU²
parameters: {"negative_slope":0.5}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002,"lr_schedule":"cosine","epochs_per_chunk":3,"chunk_size_tokens":32768,"freeze_blocks":2,"score_first":true}
Compression
zstd
level: 21
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"cosine_lr_schedule":true,"max_chunks":900,"chunk_size_tokens":32768,"freeze_blocks":2,"epochs_per_chunk":3}

Novel Contributions

  • LeakyReLU(0.5)² activation replacing GELU² to improve gradient flow and save 0.0026 bpb
  • GPTQ int6 Hessian-guided column-wise quantization replacing naive per-row rounding, reducing quantization error by 33.6% and saving 0.0029 bpb
  • SGD test-time training (TTT) adapting last 9/11 layers with cosine LR decay, improving evaluation bpb by ~0.0024