PR #77

RECORDclosed

[record bpb=1.195] sliding window + LoRA TTT

by samacquaView on GitHub
val_bpb
1.1950
Architecture
Transformer
Optimizer
Adam
Artifact Size

Training Techniques

Evaluation
sliding window eval
parameters: {"chunk_size":256,"eval_seq_len":1024}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"betas":[0.9,0.95]}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Other
other
Document masking / document-isolated evaluation with BOS-based document boundary detection and per-document reset of LoRA parameters to avoid leakage across validation sequences.
parameters: {"batch_size":64,"targets":["lm_head","c_q","c_v"]}

Novel Contributions

  • Per-document LoRA test-time training during evaluation
  • Sliding window / strided evaluation over overlapping chunks
  • Document-aware evaluation with BOS-based boundary detection and no leakage across documents
  • Batching and sorting documents by length for faster per-sequence adaptation
  • Applying LoRA adapters to lm_head, c_q, and c_v projections in all transformer blocks