PR #617

closed

Add 11L TTT LoRA submission: SOTA architecture + per-document LoRA te…

by ryanadamsaiView on GitHub
val_bpb
1.1228
Architecture
11L EMA + GPTQ-lite
Optimizer
Adam
Artifact Size

Training Techniques

Architecture
EMA
Uses exponential moving average in the 11-layer model architecture/training setup.
parameters: null
Quantization
GPTQ-lite
bits: null
scope: all
QAT
bits: null
scope: all
LR Schedule
warmdown3500
parameters: {"warmdown_steps":3500}
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"eval_seq_len":2048,"batch_size":32}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"betas":[0.9,0.95]}

Novel Contributions

  • Combines the PR #401 SOTA architecture with per-document LoRA test-time training at evaluation.
  • Adds a forward_with_lora path and attention/block hooks to support per-batch LoRA adapters for Q/V projections.
  • Introduces BatchedTTTLoRA for rank-8 LoRA adapters on Q, V, and optionally the LM head.
  • Implements per-document chunked evaluation that resets LoRA parameters between documents to avoid leakage.
  • Uses document boundary detection via BOS tokens and batched length-sorted evaluation for efficiency.