PR #1329

open

Record: Per-Sample SLOT + TTT + LR=0.024 + Stride=96 — val_bpb 0.63614 (3-seed mean)

by renqianluoView on GitHub
val_bpb
0.6361
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,991,012 bytes

Training Techniques

Other
other
Per-sample SLOT optimization with a dedicated hidden-state delta and logit bias for each input sequence
parameters: {"hidden_state_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params_per_sequence":1536,"steps":24}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.024,"learning_rate_min":0.001,"steps":24}
Evaluation
sliding window eval
parameters: {"stride":96}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":10}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activation
parameters: {"negative_slope":0.5}
SmearGate
Uses SmearGate in the model architecture
parameters: null
U-Net skip connections
Uses U-Net style skip connections
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the dimensions
parameters: {"dimensions":16}
BigramHash
Uses a bigram hash table for token representations
parameters: {"vocab":3072,"dim":112}
XSA
Adds extra self-attention across all layers
parameters: {"layers":11}
Weight Averaging
EMA + SWA
parameters: null

Novel Contributions

  • Per-sample SLOT optimization with separate parameters for each validation sequence
  • Higher SLOT learning rate of 0.024 with a 24-step cosine schedule
  • Stride=96 evaluation to reduce windows and fit more optimization steps within budget
  • Test-time training with AdamW and freezing 10 of 11 transformer blocks
  • 3-seed mean validation BPB of 0.63614