PR #1547

closed

Non record: Single H100 10 min 1.24 BPB

by adityasasidharView on GitHub
val_bpb
1.1928
Architecture
Transformer
Optimizer
Adam
Artifact Size

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Evaluation
stride-based eval
parameters: {"chunk_size":256,"eval_seq_len":1024,"batch_size":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"betas":[0.9,0.95]}
Architecture
weight tying
LoRA adapters target lm_head and attention projections across transformer blocks; base architecture remains a standard Transformer.
parameters: null

Novel Contributions

  • Per-document LoRA test-time training at evaluation
  • Document-isolated evaluation with no leakage across validation sequences
  • Strided overlapping chunk evaluation
  • Batching documents by length for faster evaluation
  • Using LoRA adapters to make test-time training significantly faster