PR #1329

open

Record: Per-Sample SLOT + TTT + LR=0.024 + Stride=96 — val_bpb 0.63614 (3-seed mean)

by renqianluoView on GitHub

val_bpb

0.6361

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,991,012 bytes

Training Techniques

Other

other

Per-sample SLOT optimization with a dedicated hidden-state delta and logit bias for each input sequence

parameters: {"hidden_state_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params_per_sequence":1536,"steps":24}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.024,"learning_rate_min":0.001,"steps":24}

Evaluation

sliding window eval

parameters: {"stride":96}

Test-Time Training

full TTT

parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":10}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Architecture

LeakyReLU

Uses LeakyReLU(0.5)^2 MLP activation

parameters: {"negative_slope":0.5}

SmearGate

Uses SmearGate in the model architecture

parameters: null

U-Net skip connections

Uses U-Net style skip connections

parameters: null

Partial RoPE

Applies rotary position embeddings to only part of the dimensions

parameters: {"dimensions":16}

BigramHash

Uses a bigram hash table for token representations

parameters: {"vocab":3072,"dim":112}

XSA

Adds extra self-attention across all layers

parameters: {"layers":11}

Weight Averaging

EMA + SWA

parameters: null

Novel Contributions

Per-sample SLOT optimization with separate parameters for each validation sequence
Higher SLOT learning rate of 0.024 with a 24-step cosine schedule
Stride=96 evaluation to reduce windows and fit more optimization steps within budget
Test-time training with AdamW and freezing 10 of 11 transformer blocks
3-seed mean validation BPB of 0.63614