PR #1886

open

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)

by renqianluoView on GitHub

val_bpb

1.0696

Architecture

Transformer

Optimizer

—

Artifact Size

15.98MB

Training Techniques

Test-Time Training

LoRA TTT

parameters: {"rank":128,"alpha":144,"warm_start_A":1}

Regularization

weight decay

parameters: {"weight_decay":2}

logit softcap

parameters: null

Architecture

Gated Attention

Per-head gated attention with per-row int8 gate quantization and gate mirror in the LoRA-TTT path.

parameters: null

weight tying

Not explicitly stated in this PR body, but the submission references a stacked Transformer-based parameter-golf model; no additional architecture details are confirmed beyond the listed modifications.

parameters: null

Other

other

Fused softcap cross-entropy Triton kernel used during training; evaluation path keeps eager numerics.

parameters: null

other

Polar Express NS coefficients.

parameters: null

other

Phased TTT with 3 phases.

parameters: {"phases":3}

Novel Contributions

Stacks the fused softcap CE Triton kernel on top of the prior warm-start LoRA submission.
Identifies a divergence interaction between fused CE fp32 accumulation differences and warm-start A on certain seeds.
Shows that increasing TTT_WEIGHT_DECAY from 1.0 to 2.0 stabilizes all three seeds while preserving the warm-start gain.
Achieves a new 3-seed mean val_bpb of 1.06957, improving over PR #1768.