PR #785

open

Applied Async Prefetching Boost Performance of Any Approach

by SirSaltySalmonView on GitHub

val_bpb

1.5364

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

—

Training Techniques

Architecture

LeakyReLU² MLP

Uses LeakyReLU with negative slope 0.5 followed by squaring before the down projection; rewritten as h * h for compiler fusion friendliness.

parameters: null

XSA

Uses XSA with last-N token attention/history.

parameters: {"last_n":4}

BigramHash

Bigram vocabulary / hashing-based token component.

parameters: {"vocab_size":1536}

RoPE

Rotary positional embeddings.

parameters: {"dimensions":16}

weight tying

Tied embeddings are used.

parameters: null

Other

other

Pinned async training batch prefetch with background CPU batch preparation, pin_memory, bounded queue, and optional dedicated CUDA copy stream for overlapping H2D transfers with compute.

parameters: {"prefetch":1,"prefetch_queue":2,"copy_stream":1}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Regularization

layerwise LN scale

parameters: {"enabled":1}

Quantization

QAT

bits: null

scope: late QAT

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

Evaluation

stride-based eval

parameters: {"stride":64}

Novel Contributions

Pinned async batch prefetching to overlap CPU batch preparation and GPU compute
Optional dedicated CUDA copy stream for non-blocking host-to-device transfers
Compiler fusion-friendly rewrite of the LeakyReLU² MLP using h * h and explicit weight casting
Demonstrated modest step-count improvement in 600s and slightly better val_bpb versus the base run