PR #152

closed

Add TTT (Test-Time Training) submission: 1.1767 BPB

by timowhite88View on GitHub
val_bpb
1.1744
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,328,877 bytes

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings in a 9-layer Transformer.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"combined_with":"Adam","training_phase":"pretraining"}
Quantization
int8
bits: 8
scope: full model artifact
Compression
zlib
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":2,"momentum":0.9,"batch_size":32}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":1024}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Other
other
Decompresses an int8+zlib artifact back to full precision before test-time adaptation.
parameters: null

Novel Contributions

  • Test-time training during evaluation to adapt the full model on validation data
  • Full-model SGD adaptation instead of LoRA-based TTT
  • Use of the evaluation budget as additional optimization time for improved BPB
  • Int8 plus zlib artifact compression to fit within the submission size cap