PR #1270
openNon-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)
by VirajDeshwalView on GitHub
val_bpb
1.1088
Architecture
Transformer
Optimizer
—
Artifact Size
~15.82 MB
Training Techniques
Architecture
Unified Attention
Replaces separate Q/K/V projections with a single unified attention projection matrix, reducing attention projection parameters.
parameters: null
FA3
Uses FlashAttention 3 for attention computation.
parameters: null
Quantization
QAT
bits: null
scope: model
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":10000}
Test-Time Training
full TTT
parameters: {"epochs":3}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Unified Attention architecture
- FlashAttention 3 integration
- Longer 1-hour training schedule showing scaling gains
- Demonstration that unified attention benefits from extended warmdown training
- Improved val_bpb to 1.1088 under unlimited compute