PR #726

open

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon

by DeepReinforceView on GitHub

val_bpb

1.1147

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.23 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: model weights

Architecture

XSA

Uses XSA on the last four layers as part of the custom PR #549-style architecture stack.

parameters: {"layers":4}

Partial RoPE

Applies RoPE only partially across the model.

parameters: {"dimensions":"16/64"}

BigramHash

Includes BigramHash in the model architecture.

parameters: null

LeakyReLU²

MLP nonlinearity using leaky ReLU with negative slope 0.5 followed by squaring before the down projection.

parameters: {"negative_slope":0.5}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

AdamW

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":50,"start_condition":"warmdown LR scale below 0.2"}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs":3,"frozen_blocks":2,"gradient_clip":1,"stride":64}

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Other

other

Memmap multi-shard data pipeline with global window sampling, coprime stride over shards, merged slab reads, and asynchronous CPU/GPU prefetch using a daemon thread plus CUDA streams/events.

parameters: {"memmap":true,"multi_shard":true,"gpu_prefetch":true}

Novel Contributions

Memmap-based multi-shard distributed token loader
Global training window sampling across shards with coprime stride and diversity-aware shard weighting
Merged slab reads to reduce mmap churn
Asynchronous CPU batch construction with GPU prefetch via CUDA streams and events
Legal score-first test-time training with chunk-wise adaptation
LeakyReLU² MLP nonlinearity
Parallel Muon optimizer usage