PR #1540

open

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

1.0777

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Architecture

attention

VarLen attention with within-document boundaries only

parameters: null

LeakyReLU

Fused MLP uses fc -> LeakyReLU(0.5) -> square in one Triton kernel

parameters: {"negative_slope":0.5}

depth recurrence

Triple depth recurrence with parallel residuals

parameters: {"layers":[3,4,5]}

Test-Time Training

LoRA TTT

parameters: {"rank":96}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"muon_scale":0.97}

Regularization

weight decay

parameters: {"qk_gain":5.25,"sdclip":true}

Compression

Brotli

level: null

Other

other

Importlib-based code loader writes decompressed Triton code to a temp file and loads it as __main__ to satisfy inspect.getsourcelines() for JIT compilation

parameters: null

Novel Contributions

Importlib-based wrapper that enables Triton JIT compilation from compressed submission code
Integration of fused Triton TMA MLP into the VarLen + LoRA TTT stack
Doc-independent score-first LoRA TTT
VarLen attention with within-document boundaries
Triple depth recurrence with parallel residuals