PR #617

closed

Add 11L TTT LoRA submission: SOTA architecture + per-document LoRA te…

val_bpb

1.1228

Architecture

11L EMA + GPTQ-lite

Optimizer

Adam

Artifact Size

—

Training Techniques

Architecture

EMA

Uses exponential moving average in the 11-layer model architecture/training setup.

parameters: null

Quantization

GPTQ-lite

bits: null

scope: all

QAT

bits: null

scope: all

LR Schedule

warmdown3500

parameters: {"warmdown_steps":3500}

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"eval_seq_len":2048,"batch_size":32}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Optimizer

Adam

weight_decay: null

momentum: null

other_params: {"betas":[0.9,0.95]}

Combines the PR #401 SOTA architecture with per-document LoRA test-time training at evaluation.
Adds a forward_with_lora path and attention/block hooks to support per-batch LoRA adapters for Q/V projections.
Introduces BatchedTTTLoRA for rank-8 LoRA adapters on Q, V, and optionally the LM head.
Implements per-document chunked evaluation that resets LoRA parameters between documents to avoid leakage.
Uses document boundary detection via BOS tokens and batched length-sorted evaluation for efficiency.