PR #611

closed

Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)

by teddyowehView on GitHub
val_bpb
0.5601
Architecture
Transformer
Optimizer
Artifact Size
1498 lines

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"rank":8,"k_projection_lora":true,"ttt_epochs":8}
Architecture
K projection LoRA
Adds LoRA adapters to key projections in attention, in addition to the usual Q/V LoRA, with a reduced learning-rate multiplier.
parameters: {"lr_multiplier":0.3}
Evaluation
min-NLL epoch selection
parameters: {"select_best_epoch_per_document":true}
LR Schedule
cosine decay
parameters: {"ttt_epochs":8}

Novel Contributions

  • Adds LoRA adapters to K projections during test-time training, not just Q and V.
  • Tracks the minimum average NLL per document across TTT epochs instead of using only the last epoch.
  • Extends TTT from 6 to 8 epochs while avoiding late-epoch overfitting via min-NLL selection.
  • Uses a conservative 0.3x learning-rate multiplier for K-projection LoRA.