PR #1767

open

Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)

by renqianluoView on GitHub
val_bpb
1.0721
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.94MB

Training Techniques

Quantization
GPTQ
bits: 7
scope: embeddings and model weights
Architecture
LoRA TTT
BatchedLinearLoRA with rank-scaled output and warm-started A across batches for phased test-time training.
parameters: {"rank":128,"alpha":144}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_A":true}
Regularization
weight decay
parameters: {"weight_decay":1}

Novel Contributions

  • Rank-scaled LoRA output using alpha/rank to safely support higher rank
  • Warm-starting LoRA A across batches while resetting only B
  • Increasing TTT weight decay from 0.5 to 1.0 to stabilize warm-start A
  • Raising LoRA alpha from 96 to 144 for stronger adaptation