PR #1560

open

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)

val_bpb

1.0741

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

attention

VarLen attention with per-document cu_seqlens and strict causal masking

parameters: null

MLP

Triton fused MLP implementation

parameters: null

LoRA TTT

doc-independent LoRA-based test-time training stack

parameters: null

LeakyReLU

LeakyReLU activation used in the MLP

parameters: {"slope":0.5}

weight tying

Tied embeddings

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.75}

Test-Time Training

LoRA TTT

parameters: {"chunk_size":48}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: null

Regularization

logit softcap

parameters: {"value":30}