val_bpb
1.1928
Architecture
Transformer
Optimizer
Adam
Artifact Size
—
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Evaluation
stride-based eval
parameters: {"chunk_size":256,"eval_seq_len":1024,"batch_size":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"betas":[0.9,0.95]}
Architecture
weight tying
LoRA adapters target lm_head and attention projections across transformer blocks; base architecture remains a standard Transformer.
parameters: null
Novel Contributions
- Per-document LoRA test-time training at evaluation
- Document-isolated evaluation with no leakage across validation sequences
- Strided overlapping chunk evaluation
- Batching documents by length for faster evaluation
- Using LoRA adapters to make test-time training significantly faster