PR #1416

open

Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)

by erichroepkeView on GitHub

val_bpb

1.0795

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.12 MB

Training Techniques

Quantization

GPTQ

bits: null

scope: embeddings and weights

GPTQ

bits: null

scope: embeddings

Architecture

depth recurrence

Loops layers 4-5 twice to increase effective depth without increasing parameter count.

parameters: null

XSA

Removes self-attention redundancy via projection across all layers.

parameters: {"layers":"all"}

U-Net skip connections

Learned gating on skip connections.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW","epochs":6,"timing":"pre-quant"}

Other

other

SDClip quantization clipping using threshold = k × std(row) instead of grid search.

parameters: null

other

SP8192 tokenizer / vocabulary.

parameters: {"vocab_size":8192}

Novel Contributions

Combined SP8192 base architecture with pre-quant AdamW TTT.
Showed that pre-quant TTT and the SP8192 + SDClip + GPTQ pipeline stack without interfering.
Achieved a new record val_bpb of 1.07948 using a 3-seed mean.
Applied TTT before quantization so the adapted full-precision weights compress cleanly.