PR #1209

open

Record: Full GPTQ + Score-First TTT + SLOT — val_bpb 1.1064 (3-seed mean)

by andrewbaggio1View on GitHub

val_bpb

1.1064

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: all

Test-Time Training

score-first TTT

parameters: {"epochs":3,"chunk_tokens":65536,"learning_rate":0.002}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for":"TTT"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"SLOT","steps":8,"learning_rate":0.005}

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the model stack.

parameters: {"negative_slope":0.5}

GQA

Grouped query attention with reduced KV heads.

parameters: {"kv_heads":4,"query_groups":8}

MLP3x

Expanded MLP width to 3x.

parameters: null

BigramHash

Bigram hash embedding component.

parameters: {"vocab_size":2816,"dim":112}

SmearGate

SmearGate gating mechanism.

parameters: null

XSA

XSA attention variant.

parameters: null

Partial RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16}

VE128

VE128 architectural component.

parameters: null

Regularization

LN scale

parameters: null

logit softcap

parameters: null

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Combines full Hessian GPTQ with score-first chunked TTT and SLOT
Uses score-before-update compliant single-pass evaluation
Applies per-batch delta vector optimization in hidden space (SLOT)
Reports a 3-seed mean val_bpb of 1.1064