PR #1354

open

add varlen+fused mlp+ttt record

by samacquaView on GitHub
val_bpb
1.1092
Architecture
Transformer
Optimizer
Adam
Artifact Size
~15.9 MB

Training Techniques

Architecture
attention
Variable-length Flash Attention 3 using flash_attn_varlen_func with packed documents and cu_seqlens so attention stays within each document.
parameters: null
MLP
Fused up-projection, LeakyReLU(0.5)^2 activation, and squaring into a custom Triton kernel.
parameters: {"activation":"LeakyReLU","activation_power":2}
Test-Time Training
LoRA TTT
parameters: {"chunk_size":32,"optimizer":"RMSProp via Adam beta1=0"}
Quantization
int6
bits: 6
scope: model

Novel Contributions

  • Variable-length attention with Flash Attention 3 to avoid cross-document attention during training
  • Fused MLP Triton kernel combining up-projection, LeakyReLU(0.5)^2, and squaring
  • Reintroduced and optimized LoRA-based test-time training
  • Smaller TTT chunk size and RMSProp-style optimization for better long-context gains