PR #680

open

Add non-record 10min/16MB submission: Wavelet-Lite PR549 Parallel Muon (1.1483)

by bro4allView on GitHub

val_bpb

1.1483

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,859,711 bytes

Training Techniques

Architecture

wavelet-lite mixer

Adds a tiny causal Haar-style wavelet-lite mixer inside each residual block, splitting the first 16 post-attention channels into low/high bands using the current token and a one-token lagged copy, with a learned low-band drift scale.

parameters: {"dimensions":16}

BigramHash

Trims the bigram table to fit the byte budget by using a reduced bigram vocabulary.

parameters: {"bigram_vocab_size":1024}

TTT disabled

Removes test-time training from the final budgeted run.

parameters: null

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: null

SWA

parameters: {"enabled":true,"every":50}

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: null

Quantization

QAT

bits: 6

scope: all

Initialization

wavelet init

Uses WAVELET_INIT=0.25 for the wavelet-lite mixer.

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3500}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Uses gated attention, value residuals, late QAT thresholding, and local NVMe staging for data/tokenizer to meet the 10-minute training budget.

parameters: {"gated_attention":true,"value_residual":true,"late_qat_threshold":0.15,"max_wallclock_seconds":600}

Novel Contributions

Adds a tiny causal wavelet-lite mixer inside each residual block
Uses a PR #549-derived Parallel Muon stack with architectural changes rather than a pure retune
Disables TTT in the final budgeted run to fit the 16MB cap
Trims the bigram table to BIGRAM_VOCAB_SIZE=1024 to reduce artifact size
Recovers the final int6 artifact and exact roundtrip evaluation from a persisted full-precision checkpoint