PR #1784

open

Record: GatedAttn + Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07081 (3-seed mean)

val_bpb

1.0708

Architecture

Transformer

Optimizer

—

Artifact Size

15.98MB

Training Techniques

Architecture

Gated Attention

Per-head sigmoid gate applied to SDPA output in attention blocks.

parameters: {"per_head":true,"gate_activation":"sigmoid"}

Test-Time Training

LoRA TTT

parameters: {"rank":128,"alpha":144,"warm_start_a":1}

Regularization

weight decay

parameters: {"weight_decay":1}

Quantization

int8

bits: 8

scope: attn_gate_w

Other

other

Mirrored the attention gate inside the LoRA-TTT forward path so training and scoring use the same gated attention computation.

parameters: null

other

Per-row int8 quantization for attention gate weights with one fp16 scale per head to fit under the artifact size cap.

parameters: {"per_row":true}

Stacks GatedAttn on top of the LoRA-TTT stack from PR #1767.
Mirrors the attention gate inside the LoRA-TTT inline forward path to avoid train/eval mismatch.
Uses per-row int8 quantization for attn_gate_w to keep the artifact under the 16 MB cap with minimal accuracy loss.