PR #1765

open

Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)

by renqianluoView on GitHub
val_bpb
1.0727
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,935,775 B

Training Techniques

Test-Time Training
LoRA TTT
parameters: {"rank":128}
Architecture
LoRA
Adds alpha/rank output scaling to BatchedLinearLoRA so effective update magnitude is decoupled from rank.
parameters: {"rank":128,"alpha":96}
LoRA
Warm-starts LoRA A across batches while still resetting B to zero.
parameters: null
Regularization
weight decay
parameters: {"weight_decay":1}
Quantization
GPTQ
bits: 7
scope: embeddings

Novel Contributions

  • Alpha/rank scaling for BatchedLinearLoRA to make higher LoRA rank stable
  • Warm-starting LoRA A across batches instead of re-randomizing it each batch
  • Increasing TTT weight decay from 0.5 to 1.0 to counter warm-start overfitting
  • Using rank 128 with alpha 96 to preserve effective magnitude while increasing capacity