PR #1488

open

Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean)

by ndokutovichView on GitHub

val_bpb

0.8265

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.76 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8/4 configuration

parameters: {"heads":"8/4"}

BigramHash

Bigram hash embedding component

parameters: {"dimensions":128,"size":1024}

SmearGate

SmearGate architectural component

parameters: null

U-Net skip connections

U-Net style skip connections in the model

parameters: null

Value Residual

Value Residual Learning

parameters: null

ReLU²

Squared LeakyReLU / relu-squared style MLP activation

parameters: null

XSA

XSA-all attention/sequence component

parameters: null

MLP3x

3x MLP expansion

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: null

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Compression

lzma

level: null

Test-Time Training

AdamW TTT

parameters: {"epochs":10,"learning_rate":0.00045,"freeze_blocks":1}

Evaluation

sliding window eval

parameters: {"stride":96}

LR Schedule

cosine decay

parameters: {"start":0.012,"end":0.001}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

SLOT evaluation with per-window delta and logit_bias optimized for 24 AdamW steps, then discarded

parameters: {"steps":24}

Novel Contributions

First combination of pre-quant AdamW TTT with SLOT hidden-state optimization
Pre-quant TTT baked into the artifact before quantization
Improved base sliding BPB before SLOT, enabling a stronger final SLOT score
QK_GAIN_INIT sweep to 5.25 compared with prior PR #1313