PR #1765

open

Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)

val_bpb

1.0727

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,935,775 B

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":128}

Architecture

LoRA

Adds alpha/rank output scaling to BatchedLinearLoRA so effective update magnitude is decoupled from rank.

parameters: {"rank":128,"alpha":96}

LoRA

Warm-starts LoRA A across batches while still resetting B to zero.

parameters: null

Regularization

weight decay

parameters: {"weight_decay":1}

Quantization

GPTQ

bits: 7

scope: embeddings

Alpha/rank scaling for BatchedLinearLoRA to make higher LoRA rank stable
Warm-starting LoRA A across batches instead of re-randomizing it each batch
Increasing TTT weight decay from 0.5 to 1.0 to counter warm-start overfitting
Using rank 128 with alpha 96 to preserve effective magnitude while increasing capacity