PR #1560
openRecord: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0741
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
attention
VarLen attention with per-document cu_seqlens and strict causal masking
parameters: null
MLP
Triton fused MLP implementation
parameters: null
LoRA TTT
doc-independent LoRA-based test-time training stack
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP
parameters: {"slope":0.5}
weight tying
Tied embeddings
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Test-Time Training
LoRA TTT
parameters: {"chunk_size":48}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: null
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- VarLen attention with per-document cu_seqlens and strict causal masking
- Triton fused MLP
- doc-TTT stack
- Warmdown fraction increased to 0.75
- TTT chunk size increased to 48
- Muon momentum tuned to 0.97