PR #1530

RECORDopen

Record: Varlen attention + fused MLP + doc-independent TTT (1.07643)

val_bpb

1.0764

Architecture

Transformer

Optimizer

—

Artifact Size

~15.99 MB

Training Techniques

Architecture

attention modification

Replaced dense causal attention with Flash Attention 3 variable-length attention so tokens only attend within each packed document.

parameters: null

LeakyReLU

Fused MLP uses LeakyReLU(0.5) squared inside a custom Triton kernel.

parameters: {"slope":0.5}

Test-Time Training

LoRA TTT

parameters: {"scope":"per-document","independent_sequences":true}

score-first TTT

parameters: {"batched_loras":true}

Evaluation

sliding window eval

parameters: null

Variable-length attention with packed documents to avoid cross-document attention and reduce FLOPs
Fused MLP kernel combining up-projection, LeakyReLU(0.5)^2, and squaring
Grouping many small parameters into a single all-reduce to reduce communication overhead
Document-independent LoRA test-time training applied separately to each validation sequence
Faster TTT implementation enabling smaller chunk sizes and better performance