PR #1530

open

Record: Varlen attention + fused MLP + doc-independent TTT (1.07643)

by samacquaView on GitHub
val_bpb
1.0764
Architecture
Transformer
Optimizer
Artifact Size
~15.99 MB

Training Techniques

Architecture
attention modification
Replaced dense causal attention with Flash Attention 3 variable-length attention so tokens only attend within each packed document.
parameters: null
LeakyReLU
Fused MLP uses LeakyReLU(0.5) squared inside a custom Triton kernel.
parameters: {"slope":0.5}
Test-Time Training
LoRA TTT
parameters: {"scope":"per-document","independent_sequences":true}
score-first TTT
parameters: {"batched_loras":true}
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Variable-length attention with packed documents to avoid cross-document attention and reduce FLOPs
  • Fused MLP kernel combining up-projection, LeakyReLU(0.5)^2, and squaring
  • Grouping many small parameters into a single all-reduce to reduce communication overhead
  • Document-independent LoRA test-time training applied separately to each validation sequence
  • Faster TTT implementation enabling smaller chunk sizes and better performance