PR #611
closedRecord: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean)
by teddyowehView on GitHub
val_bpb
0.5601
Architecture
Transformer
Optimizer
—
Artifact Size
1498 lines
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":8,"k_projection_lora":true,"ttt_epochs":8}
Architecture
K projection LoRA
Adds LoRA adapters to key projections in attention, in addition to the usual Q/V LoRA, with a reduced learning-rate multiplier.
parameters: {"lr_multiplier":0.3}
Evaluation
min-NLL epoch selection
parameters: {"select_best_epoch_per_document":true}
LR Schedule
cosine decay
parameters: {"ttt_epochs":8}
Novel Contributions
- Adds LoRA adapters to K projections during test-time training, not just Q and V.
- Tracks the minimum average NLL per document across TTT epochs instead of using only the last epoch.
- Extends TTT from 6 to 8 epochs while avoiding late-epoch overfitting via min-NLL selection.
- Uses a conservative 0.3x learning-rate multiplier for K-projection LoRA.