PR #1540

open

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0777
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Architecture
attention
VarLen attention with within-document boundaries only
parameters: null
LeakyReLU
Fused MLP uses fc -> LeakyReLU(0.5) -> square in one Triton kernel
parameters: {"negative_slope":0.5}
depth recurrence
Triple depth recurrence with parallel residuals
parameters: {"layers":[3,4,5]}
Test-Time Training
LoRA TTT
parameters: {"rank":96}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"muon_scale":0.97}
Regularization
weight decay
parameters: {"qk_gain":5.25,"sdclip":true}
Compression
Brotli
level: null
Other
other
Importlib-based code loader writes decompressed Triton code to a temp file and loads it as __main__ to satisfy inspect.getsourcelines() for JIT compilation
parameters: null

Novel Contributions

  • Importlib-based wrapper that enables Triton JIT compilation from compressed submission code
  • Integration of fused Triton TMA MLP into the VarLen + LoRA TTT stack
  • Doc-independent score-first LoRA TTT
  • VarLen attention with within-document boundaries
  • Triple depth recurrence with parallel residuals