PR #1205

open

Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431

by SergheiBrinzaView on GitHub

val_bpb

1.1431

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.36MB

Training Techniques

Architecture

BigramHash

Added bigram hash embeddings to provide cheap access to previous-token information.

parameters: {"dimensions":128,"table_size":10240}

ReLU²

Used ReLU squared MLP activation with 3x expansion.

parameters: {"hidden":1536}

U-Net skip connections

Added U-Net style skip connections across layers.

parameters: {"layers":10}

VE128

Applied value residual / token identity injection on selected layers.

parameters: {"layers":[8,9,10]}

Initialization

OrthoInit

Orthogonal initialization for all weight matrices.

Weight Averaging

SWA

parameters: {"start_fraction":0.5,"interval_steps":50}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"gradient_clipping":0.3,"momentum_warmup_steps":1000,"momentum_start":0.85,"momentum_end":0.99}

Quantization

mixed int6/int8

bits: null

scope: weights and embeddings

Compression

zstd

level: 22

LR Schedule

warmdown

parameters: {"warmdown_steps":4500}

Evaluation

sliding window eval

parameters: null

Novel Contributions

Wider EngramLite / BigramHash-style embedding table (10240) for more n-gram coverage
VE applied on layers 8, 9, and 10 for additional token identity injection
Higher learning rate for faster convergence
Longer warmdown schedule for smoother weight averaging
Muon momentum warmup adjustment
Mixed quantization and zstd compression to fit the artifact budget