PR #1765
openAlpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)
by renqianluoView on GitHub
val_bpb
1.0727
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,935,775 B
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":128}
Architecture
LoRA
Adds alpha/rank output scaling to BatchedLinearLoRA so effective update magnitude is decoupled from rank.
parameters: {"rank":128,"alpha":96}
LoRA
Warm-starts LoRA A across batches while still resetting B to zero.
parameters: null
Regularization
weight decay
parameters: {"weight_decay":1}
Quantization
GPTQ
bits: 7
scope: embeddings
Novel Contributions
- Alpha/rank scaling for BatchedLinearLoRA to make higher LoRA rank stable
- Warm-starting LoRA A across batches instead of re-randomizing it each batch
- Increasing TTT weight decay from 0.5 to 1.0 to counter warm-start overfitting
- Using rank 128 with alpha 96 to preserve effective magnitude while increasing capacity