PR #1328

closed

Record: Per-Sample SLOT + TTT + LR=0.024 + Stride=96 — val_bpb 0.63614 (3-seed mean)

by renqianluoView on GitHub

val_bpb

0.6361

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,991,012 bytes

Training Techniques

Other

other

Per-sample SLOT optimization with a dedicated hidden-state delta and logit bias for each input sequence

parameters: {"hidden_state_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params_per_sequence":1536,"steps":24}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"lr":0.024,"lr_min":0.001,"steps":24}

Evaluation

sliding window eval

parameters: {"stride":96}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":10}

LR Schedule

cosine decay

parameters: {"start_lr":0.024,"end_lr":0.001}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Architecture

LeakyReLU

Uses LeakyReLU squared MLP activation

parameters: {"negative_slope":0.5}

SmearGate

Includes SmearGate in the architecture

parameters: null

U-Net skip connections

Uses U-Net style skip connections

parameters: null

Partial RoPE

Applies RoPE to only part of the embedding dimensions

parameters: {"dimensions":16}

BigramHash

Uses a bigram hash table for token interactions

parameters: {"vocab":3072,"dim":112}

XSA

Adds extra self-attention across all layers

parameters: {"layers":11}

Weight Averaging

EMA + SWA

parameters: null

Novel Contributions

Per-sample SLOT optimization with separate parameters for each validation sequence
Higher SLOT learning rate (0.024) to reach a better per-sequence minimum within the step budget
Stride=96 sliding window evaluation to reduce evaluation windows and fit more optimization steps
Test-time training with AdamW and freezing most transformer blocks before SLOT