PR #77

RECORDclosed

[record bpb=1.195] sliding window + LoRA TTT

by samacquaView on GitHub

val_bpb

1.1950

Architecture

Transformer

Optimizer

Adam

Artifact Size

—

Training Techniques

Evaluation

sliding window eval

parameters: {"chunk_size":256,"eval_seq_len":1024}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"betas":[0.9,0.95]}

Sequence Length

sequence_length

train_length: null

eval_length: 1024

Other

other

Document masking / document-isolated evaluation with BOS-based document boundary detection and per-document reset of LoRA parameters to avoid leakage across validation sequences.

parameters: {"batch_size":64,"targets":["lm_head","c_q","c_v"]}

Novel Contributions

Per-document LoRA test-time training during evaluation
Sliding window / strided evaluation over overlapping chunks
Document-aware evaluation with BOS-based boundary detection and no leakage across documents
Batching and sorting documents by length for faster per-sequence adaptation
Applying LoRA adapters to lm_head, c_q, and c_v projections in all transformer blocks