PR #658

open

Non-record: LoRA TTT exploration on SOTA base (negative result)

by hmlizamaView on GitHub
val_bpb
1.1734
Architecture
GPT
Optimizer
Artifact Size

Training Techniques

Architecture
SmearGate
Architecture component used in the SOTA base to inject strong local context into embeddings.
parameters: null
BigramHash
Architecture component used in the SOTA base to inject strong bigram/local context into embeddings.
parameters: null
Weight Averaging
SWA
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"eval_seq_len":1024}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
sequence_length
train_length: null
eval_length: 1024
Evaluation
sliding window eval
parameters: {"context_length":1024,"chunk_size":256}

Novel Contributions

  • Exploration of combining LoRA test-time training with the current SOTA base model
  • Added batched LoRA support for Q/V projections during evaluation
  • Implemented per-document LoRA adaptation and reset between documents
  • Added a dedicated TTT evaluation loop and standalone eval script
  • Reported a negative result showing TTT does not improve the SOTA base