PR #1767
openAlpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)
by renqianluoView on GitHub
val_bpb
1.0721
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.94MB
Training Techniques
Quantization
GPTQ
bits: 7
scope: embeddings and model weights
Architecture
LoRA TTT
BatchedLinearLoRA with rank-scaled output and warm-started A across batches for phased test-time training.
parameters: {"rank":128,"alpha":144}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_A":true}
Regularization
weight decay
parameters: {"weight_decay":1}
Novel Contributions
- Rank-scaled LoRA output using alpha/rank to safely support higher rank
- Warm-starting LoRA A across batches while resetting only B
- Increasing TTT weight decay from 0.5 to 1.0 to stabilize warm-start A
- Raising LoRA alpha from 96 to 144 for stronger adaptation