PR #785
openApplied Async Prefetching Boost Performance of Any Approach
by SirSaltySalmonView on GitHub
val_bpb
1.5364
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
—
Training Techniques
Architecture
LeakyReLU² MLP
Uses LeakyReLU with negative slope 0.5 followed by squaring before the down projection; rewritten as h * h for compiler fusion friendliness.
parameters: null
XSA
Uses XSA with last-N token attention/history.
parameters: {"last_n":4}
BigramHash
Bigram vocabulary / hashing-based token component.
parameters: {"vocab_size":1536}
RoPE
Rotary positional embeddings.
parameters: {"dimensions":16}
weight tying
Tied embeddings are used.
parameters: null
Other
other
Pinned async training batch prefetch with background CPU batch preparation, pin_memory, bounded queue, and optional dedicated CUDA copy stream for overlapping H2D transfers with compute.
parameters: {"prefetch":1,"prefetch_queue":2,"copy_stream":1}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Regularization
layerwise LN scale
parameters: {"enabled":1}
Quantization
QAT
bits: null
scope: late QAT
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Evaluation
stride-based eval
parameters: {"stride":64}
Novel Contributions
- Pinned async batch prefetching to overlap CPU batch preparation and GPU compute
- Optional dedicated CUDA copy stream for non-blocking host-to-device transfers
- Compiler fusion-friendly rewrite of the LeakyReLU² MLP using h * h and explicit weight casting
- Demonstrated modest step-count improvement in 600s and slightly better val_bpb versus the base run