PR #1560

open

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0741
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
attention
VarLen attention with per-document cu_seqlens and strict causal masking
parameters: null
MLP
Triton fused MLP implementation
parameters: null
LoRA TTT
doc-independent LoRA-based test-time training stack
parameters: null
LeakyReLU
LeakyReLU activation used in the MLP
parameters: {"slope":0.5}
weight tying
Tied embeddings
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Test-Time Training
LoRA TTT
parameters: {"chunk_size":48}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: null
Regularization
logit softcap
parameters: {"value":30}

Novel Contributions

  • VarLen attention with per-document cu_seqlens and strict causal masking
  • Triton fused MLP
  • doc-TTT stack
  • Warmdown fraction increased to 0.75
  • TTT chunk size increased to 48
  • Muon momentum tuned to 0.97