PR #1205

open

Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431

by SergheiBrinzaView on GitHub
val_bpb
1.1431
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.36MB

Training Techniques

Architecture
BigramHash
Added bigram hash embeddings to provide cheap access to previous-token information.
parameters: {"dimensions":128,"table_size":10240}
ReLU²
Used ReLU squared MLP activation with 3x expansion.
parameters: {"hidden":1536}
U-Net skip connections
Added U-Net style skip connections across layers.
parameters: {"layers":10}
VE128
Applied value residual / token identity injection on selected layers.
parameters: {"layers":[8,9,10]}
Initialization
OrthoInit
Orthogonal initialization for all weight matrices.
Weight Averaging
SWA
parameters: {"start_fraction":0.5,"interval_steps":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"gradient_clipping":0.3,"momentum_warmup_steps":1000,"momentum_start":0.85,"momentum_end":0.99}
Quantization
mixed int6/int8
bits: null
scope: weights and embeddings
Compression
zstd
level: 22
LR Schedule
warmdown
parameters: {"warmdown_steps":4500}
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Wider EngramLite / BigramHash-style embedding table (10240) for more n-gram coverage
  • VE applied on layers 8, 9, and 10 for additional token identity injection
  • Higher learning rate for faster convergence
  • Longer warmdown schedule for smoother weight averaging
  • Muon momentum warmup adjustment
  • Mixed quantization and zstd compression to fit the artifact budget