PR #610

open

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb

by ChaosCodesView on GitHub

val_bpb

1.1190

Architecture

GPT

Optimizer

SGD

Artifact Size

15,750,888 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

XSA4

Last 4 layers attend across batch sequences (Cross-Sequence Attention)

parameters: {"layers":4}

EMA

Exponential Moving Average weight averaging for smoother convergence

parameters: null

U-Net skip

Residual skip connections between early and late layers

parameters: null

SmearGate

Learned gating for token mixing

parameters: null

BigramHash

2048-vocab bigram hash embeddings for local context

parameters: {"vocab_size":2048,"embedding_dim":128}

PartialRoPE

Partial Rotary Positional Embeddings on 16 dims, base 10000

parameters: {"dimensions":16,"base":10000}

LNScale

Learnable LayerNorm scaling

parameters: null

ValueEmbed

128-dim value embeddings on layers 9-10

parameters: {"dimensions":128,"layers":[9,10]}

LateQAT

Quantization-aware training enabled after loss threshold 0.15

parameters: {"loss_threshold":0.15}

SWA

Stochastic Weight Averaging checkpoint averaging every 50 steps

parameters: {"frequency_steps":50}

Activation

LeakyReLU with negative slope 0.5 squared replacing GELU²

parameters: {"negative_slope":0.5}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002,"lr_schedule":"cosine","epochs_per_chunk":3,"chunk_size_tokens":32768,"freeze_blocks":2,"score_first":true}

Compression

zstd

level: 21

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"cosine_lr_schedule":true,"max_chunks":900,"chunk_size_tokens":32768,"freeze_blocks":2,"epochs_per_chunk":3}

Novel Contributions

LeakyReLU(0.5)² activation replacing GELU² to improve gradient flow and save 0.0026 bpb
GPTQ int6 Hessian-guided column-wise quantization replacing naive per-row rounding, reducing quantization error by 33.6% and saving 0.0029 bpb
SGD test-time training (TTT) adapting last 9/11 layers with cosine LR decay, improving evaluation bpb by ~0.0024