PR #726

open

Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon

by DeepReinforceView on GitHub
val_bpb
1.1147
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.23 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: model weights
Architecture
XSA
Uses XSA on the last four layers as part of the custom PR #549-style architecture stack.
parameters: {"layers":4}
Partial RoPE
Applies RoPE only partially across the model.
parameters: {"dimensions":"16/64"}
BigramHash
Includes BigramHash in the model architecture.
parameters: null
LeakyReLU²
MLP nonlinearity using leaky ReLU with negative slope 0.5 followed by squaring before the down projection.
parameters: {"negative_slope":0.5}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":50,"start_condition":"warmdown LR scale below 0.2"}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs":3,"frozen_blocks":2,"gradient_clip":1,"stride":64}
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Other
other
Memmap multi-shard data pipeline with global window sampling, coprime stride over shards, merged slab reads, and asynchronous CPU/GPU prefetch using a daemon thread plus CUDA streams/events.
parameters: {"memmap":true,"multi_shard":true,"gpu_prefetch":true}

Novel Contributions

  • Memmap-based multi-shard distributed token loader
  • Global training window sampling across shards with coprime stride and diversity-aware shard weighting
  • Merged slab reads to reduce mmap churn
  • Asynchronous CPU batch construction with GPU prefetch via CUDA streams and events
  • Legal score-first test-time training with chunk-wise adaptation
  • LeakyReLU² MLP nonlinearity
  • Parallel Muon optimizer usage