PR #1205
openNon-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431
by SergheiBrinzaView on GitHub
val_bpb
1.1431
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.36MB
Training Techniques
Architecture
BigramHash
Added bigram hash embeddings to provide cheap access to previous-token information.
parameters: {"dimensions":128,"table_size":10240}
ReLU²
Used ReLU squared MLP activation with 3x expansion.
parameters: {"hidden":1536}
U-Net skip connections
Added U-Net style skip connections across layers.
parameters: {"layers":10}
VE128
Applied value residual / token identity injection on selected layers.
parameters: {"layers":[8,9,10]}
Initialization
OrthoInit
Orthogonal initialization for all weight matrices.
Weight Averaging
SWA
parameters: {"start_fraction":0.5,"interval_steps":50}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"gradient_clipping":0.3,"momentum_warmup_steps":1000,"momentum_start":0.85,"momentum_end":0.99}
Quantization
mixed int6/int8
bits: null
scope: weights and embeddings
Compression
zstd
level: 22
LR Schedule
warmdown
parameters: {"warmdown_steps":4500}
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Wider EngramLite / BigramHash-style embedding table (10240) for more n-gram coverage
- VE applied on layers 8, 9, and 10 for additional token identity injection
- Higher learning rate for faster convergence
- Longer warmdown schedule for smoother weight averaging
- Muon momentum warmup adjustment
- Mixed quantization and zstd compression to fit the artifact budget