PR #611

closed

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)

val_bpb

0.5601

Architecture

Transformer

Optimizer

—

Artifact Size

1498 lines

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":8,"k_projection_lora":true,"ttt_epochs":8}

Architecture

K projection LoRA

Adds LoRA adapters to key projections in attention, in addition to the usual Q/V LoRA, with a reduced learning-rate multiplier.

parameters: {"lr_multiplier":0.3}

Evaluation

min-NLL epoch selection

parameters: {"select_best_epoch_per_document":true}

LR Schedule

cosine decay

parameters: {"ttt_epochs":8}

Adds LoRA adapters to K projections during test-time training, not just Q and V.
Tracks the minimum average NLL per document across TTT epochs instead of using only the last epoch.
Extends TTT from 6 to 8 epochs while avoiding late-epoch overfitting via min-NLL selection.
Uses a conservative 0.3x learning-rate multiplier for K-projection LoRA.