PR #1886
openRecord: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)
by renqianluoView on GitHub
val_bpb
1.0696
Architecture
Transformer
Optimizer
—
Artifact Size
15.98MB
Training Techniques
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144,"warm_start_A":1}
Regularization
weight decay
parameters: {"weight_decay":2}
logit softcap
parameters: null
Architecture
Gated Attention
Per-head gated attention with per-row int8 gate quantization and gate mirror in the LoRA-TTT path.
parameters: null
weight tying
Not explicitly stated in this PR body, but the submission references a stacked Transformer-based parameter-golf model; no additional architecture details are confirmed beyond the listed modifications.
parameters: null
Other
other
Fused softcap cross-entropy Triton kernel used during training; evaluation path keeps eager numerics.
parameters: null
other
Polar Express NS coefficients.
parameters: null
other
Phased TTT with 3 phases.
parameters: {"phases":3}
Novel Contributions
- Stacks the fused softcap CE Triton kernel on top of the prior warm-start LoRA submission.
- Identifies a divergence interaction between fused CE fp32 accumulation differences and warm-start A on certain seeds.
- Shows that increasing TTT_WEIGHT_DECAY from 1.0 to 2.0 stabilizes all three seeds while preserving the warm-start gain.
- Achieves a new 3-seed mean val_bpb of 1.06957, improving over PR #1768.