PR #680
openAdd non-record 10min/16MB submission: Wavelet-Lite PR549 Parallel Muon (1.1483)
by bro4allView on GitHub
val_bpb
1.1483
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,859,711 bytes
Training Techniques
Architecture
wavelet-lite mixer
Adds a tiny causal Haar-style wavelet-lite mixer inside each residual block, splitting the first 16 post-attention channels into low/high bands using the current token and a one-token lagged copy, with a learned low-band drift scale.
parameters: {"dimensions":16}
BigramHash
Trims the bigram table to fit the byte budget by using a reduced bigram vocabulary.
parameters: {"bigram_vocab_size":1024}
TTT disabled
Removes test-time training from the final budgeted run.
parameters: null
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: null
SWA
parameters: {"enabled":true,"every":50}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: null
Quantization
QAT
bits: 6
scope: all
Initialization
wavelet init
Uses WAVELET_INIT=0.25 for the wavelet-lite mixer.
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Uses gated attention, value residuals, late QAT thresholding, and local NVMe staging for data/tokenizer to meet the 10-minute training budget.
parameters: {"gated_attention":true,"value_residual":true,"late_qat_threshold":0.15,"max_wallclock_seconds":600}
Novel Contributions
- Adds a tiny causal wavelet-lite mixer inside each residual block
- Uses a PR #549-derived Parallel Muon stack with architectural changes rather than a pure retune
- Disables TTT in the final budgeted run to fit the 16MB cap
- Trims the bigram table to BIGRAM_VOCAB_SIZE=1024 to reduce artifact size
- Recovers the final int6 artifact and exact roundtrip evaluation from a persisted full-precision checkpoint