PR #1318

open

Record: TTT-AdamW + SLOT L-BFGS25 LogitDelta + GPTQ DAMP=0.005 — val_bpb 1.00955

by renqianluoView on GitHub

val_bpb

1.0095

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.71 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: null

scope: all

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.001}

Test-Time Training

full TTT

parameters: {"learning_rate":0.001,"epochs":1,"frozen_blocks":"0-9"}

Other

other

Sliding-window logit-space delta optimization with L-BFGS warm-start and focal loss on the last 128 tokens per window

parameters: {"method":"L-BFGS25","history":20,"warm_start":true,"delta_clip":5,"logit_space":true,"focal_tokens":128}

Regularization

logit softcap

parameters: {"delta_clip":5}

Architecture

BigramHash

Bigram hash embedding/vocabulary component

parameters: {"shape":"3072x112"}

GQA

Grouped query attention

parameters: {"heads":"8/4"}

Partial RoPE

Rotary position embeddings applied to a subset of dimensions

parameters: {"dimensions":16}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"negative_slope":0.5}

SmearGate

SmearGate gating mechanism

parameters: null

U-Net skip connections

U-Net style skip connections across layers

parameters: null

XSA

XSA applied to all layers

parameters: {"layers":11}

Weight Averaging

EMA + SWA

parameters: null

Compression

lzma

level: 9

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Evaluation

sliding window eval

parameters: {"window_size":2048,"focal_tokens":128}

Novel Contributions

Test-time training with AdamW on the test sequence before scoring
Sliding-window logit-space delta optimization using L-BFGS with warm-start
Using GPTQ with reduced damping (0.005) to improve int6 quantization quality
Combining TTT, SLOT, and GPTQ into a single high-performing evaluation pipeline