PR #1530
openRecord: Varlen attention + fused MLP + doc-independent TTT (1.07643)
by samacquaView on GitHub
val_bpb
1.0764
Architecture
Transformer
Optimizer
—
Artifact Size
~15.99 MB
Training Techniques
Architecture
attention modification
Replaced dense causal attention with Flash Attention 3 variable-length attention so tokens only attend within each packed document.
parameters: null
LeakyReLU
Fused MLP uses LeakyReLU(0.5) squared inside a custom Triton kernel.
parameters: {"slope":0.5}
Test-Time Training
LoRA TTT
parameters: {"scope":"per-document","independent_sequences":true}
score-first TTT
parameters: {"batched_loras":true}
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Variable-length attention with packed documents to avoid cross-document attention and reduce FLOPs
- Fused MLP kernel combining up-projection, LeakyReLU(0.5)^2, and squaring
- Grouping many small parameters into a single all-reduce to reduce communication overhead
- Document-independent LoRA test-time training applied separately to each validation sequence
- Faster TTT implementation enabling smaller chunk sizes and better performance