PR #658

open

Non-record: LoRA TTT exploration on SOTA base (negative result)

val_bpb

1.1734

Architecture

GPT

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

SmearGate

Architecture component used in the SOTA base to inject strong local context into embeddings.

parameters: null

BigramHash

Architecture component used in the SOTA base to inject strong bigram/local context into embeddings.

parameters: null

Weight Averaging

SWA

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"eval_seq_len":1024}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

sequence_length

train_length: null

eval_length: 1024

Evaluation

sliding window eval

parameters: {"context_length":1024,"chunk_size":256}

Exploration of combining LoRA test-time training with the current SOTA base model
Added batched LoRA support for Q/V projections during evaluation
Implemented per-document LoRA adaptation and reset between documents
Added a dedicated TTT evaluation loop and standalone eval script
Reported a negative result showing TTT does not improve the SOTA base