PR #1328
closedRecord: Per-Sample SLOT + TTT + LR=0.024 + Stride=96 — val_bpb 0.63614 (3-seed mean)
by renqianluoView on GitHub
val_bpb
0.6361
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,991,012 bytes
Training Techniques
Other
other
Per-sample SLOT optimization with a dedicated hidden-state delta and logit bias for each input sequence
parameters: {"hidden_state_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params_per_sequence":1536,"steps":24}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.024,"lr_min":0.001,"steps":24}
Evaluation
sliding window eval
parameters: {"stride":96}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":10}
LR Schedule
cosine decay
parameters: {"start_lr":0.024,"end_lr":0.001}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation
parameters: {"negative_slope":0.5}
SmearGate
Includes SmearGate in the architecture
parameters: null
U-Net skip connections
Uses U-Net style skip connections
parameters: null
Partial RoPE
Applies RoPE to only part of the embedding dimensions
parameters: {"dimensions":16}
BigramHash
Uses a bigram hash table for token interactions
parameters: {"vocab":3072,"dim":112}
XSA
Adds extra self-attention across all layers
parameters: {"layers":11}
Weight Averaging
EMA + SWA
parameters: null
Novel Contributions
- Per-sample SLOT optimization with separate parameters for each validation sequence
- Higher SLOT learning rate (0.024) to reach a better per-sequence minimum within the step budget
- Stride=96 sliding window evaluation to reduce evaluation windows and fit more optimization steps
- Test-time training with AdamW and freezing most transformer blocks before SLOT