PR #1784

open

Record: GatedAttn + Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07081 (3-seed mean)

by renqianluoView on GitHub
val_bpb
1.0708
Architecture
Transformer
Optimizer
Artifact Size
15.98MB

Training Techniques

Architecture
Gated Attention
Per-head sigmoid gate applied to SDPA output in attention blocks.
parameters: {"per_head":true,"gate_activation":"sigmoid"}
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_a":1}
Regularization
weight decay
parameters: {"weight_decay":1}
Quantization
int8
bits: 8
scope: attn_gate_w
Other
other
Mirrored the attention gate inside the LoRA-TTT forward path so training and scoring use the same gated attention computation.
parameters: null
other
Per-row int8 quantization for attention gate weights with one fp16 scale per head to fit under the artifact size cap.
parameters: {"per_row":true}

Novel Contributions

  • Stacks GatedAttn on top of the LoRA-TTT stack from PR #1767.
  • Mirrors the attention gate inside the LoRA-TTT inline forward path to avoid train/eval mismatch.
  • Uses per-row int8 quantization for attn_gate_w to keep the artifact under the 16 MB cap with minimal accuracy loss.