PR #1089

open

Record Submission: 1.1086 BPB - Turbo-Muon + EngramLite + ParamBanking (11L 512d)

by mikeapediaView on GitHub
val_bpb
1.1086
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.3 MB

Training Techniques

Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"Turbo-Muon":true,"AOL_preconditioning":true,"Polar_Express_coefficients":true,"post_ns_normalization":"row_col"}
Architecture
BigramHash
Multi-head prime-based hash embeddings capturing bigram statistics.
parameters: {"heads":2,"buckets":8192}
TrigramHash
Multi-head prime-based hash embeddings capturing trigram statistics.
parameters: {"heads":2,"buckets":8192}
U-Net skip connections
Learned sigmoid-gated encoder/decoder skip paths.
parameters: null
ValueEmbedding
Reinjects token identity into attention values at deep layers.
parameters: {"layers":[9,10]}
SmearGate
Causal shift blending each token with its predecessor.
parameters: null
LeakyReLU
Uses LeakyReLU(0.3)^2 in the MLP.
parameters: {"slope":0.3,"squared":true}
Partial RoPE
Applies RoPE to a subset of dimensions.
parameters: {"dimensions":16}
Quantization
GPTQ mixed int6/int7
bits: null
scope: block weights
late QAT
bits: null
scope: quantized weights
Weight Averaging
SWA
parameters: {"every":50,"start_after_fraction":0.2}
EMA
parameters: {"decay":0.997}
Compression
brotli
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Turbo-Muon optimizer with AOL preconditioning, Polar Express coefficients, and row_col post-normalization
  • EngramLite multi-head prime-based hash embeddings for bigram and trigram context
  • Parameter banking with contiguous 3D tensors enabling batched Newton-Schulz via torch.bmm
  • U-Net style gated skip connections
  • ValueEmbedding at deep layers to reinject token identity
  • SmearGate causal predecessor blending
  • GPTQ mixed-precision compression with Hessian-based bit allocation
  • Late QAT with soft-round sigmoid ramp
  • Brotli plus byte-shuffle artifact compression