PR #1306

open

Record: Causal SLOT + Pre-quant TTT — val_bpb 1.0846 (3-seed mean)

by resouerView on GitHub
val_bpb
1.0846
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.95 MB

Training Techniques

Evaluation
sliding window eval
parameters: null
Other
other
Causal SLOT: eval-time delta optimization restricted to context-only positions to preserve causal dependence
parameters: {"steps":8,"learning_rate":0.005}
Test-Time Training
full TTT
parameters: {"epochs":6,"learning_rate":0.0005,"freeze_first_blocks":2,"batch_size":32}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"ttt_learning_rate":0.0005,"ttt_epochs":6}
LR Schedule
cosine decay
parameters: null
Architecture
BigramHash
BigramHash 3072 used in the base merged SOTA configuration referenced by the submission
parameters: {"size":3072}
XSA
XSA-all used in the base merged SOTA configuration referenced by the submission
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: null
scope: all

Novel Contributions

  • Causal SLOT with provably causal eval-time delta optimization using only context-scored positions
  • Pre-quant AdamW test-time training before GPTQ quantization
  • Coprime-stride multi-shard data loader
  • Combined 3-seed mean val_bpb of 1.0846