val_bpb
1.1092
Architecture
Transformer
Optimizer
Adam
Artifact Size
~15.9 MB
Training Techniques
Architecture
attention
Variable-length Flash Attention 3 using flash_attn_varlen_func with packed documents and cu_seqlens so attention stays within each document.
parameters: null
MLP
Fused up-projection, LeakyReLU(0.5)^2 activation, and squaring into a custom Triton kernel.
parameters: {"activation":"LeakyReLU","activation_power":2}
Test-Time Training
LoRA TTT
parameters: {"chunk_size":32,"optimizer":"RMSProp via Adam beta1=0"}
Quantization
int6
bits: 6
scope: model
Novel Contributions
- Variable-length attention with Flash Attention 3 to avoid cross-document attention during training
- Fused MLP Triton kernel combining up-projection, LeakyReLU(0.5)^2, and squaring
- Reintroduced and optimized LoRA-based test-time training
- Smaller TTT chunk size and RMSProp-style optimization for better long-context gains