PR #1270

open

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)

by VirajDeshwalView on GitHub
val_bpb
1.1088
Architecture
Transformer
Optimizer
Artifact Size
~15.82 MB

Training Techniques

Architecture
Unified Attention
Replaces separate Q/K/V projections with a single unified attention projection matrix, reducing attention projection parameters.
parameters: null
FA3
Uses FlashAttention 3 for attention computation.
parameters: null
Quantization
QAT
bits: null
scope: model
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":10000}
Test-Time Training
full TTT
parameters: {"epochs":3}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Unified Attention architecture
  • FlashAttention 3 integration
  • Longer 1-hour training schedule showing scaling gains
  • Demonstration that unified attention benefits from extended warmdown training
  • Improved val_bpb to 1.1088 under unlimited compute