PR #1169

open

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)

by BortlesboatView on GitHub
val_bpb
1.1126
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Turbo-Muon":true,"AOL preconditioning":true,"Polar Express coefficients":true,"row_col normalization":true,"newton_schulz_iterations":4}
Architecture
BigramHash
Hash embeddings for bigrams
parameters: {"heads":2,"buckets":8192}
TrigramHash
Hash embeddings for trigrams
parameters: {"heads":2,"buckets":8192}
GQA
Grouped query attention
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU(0.3) squared activation
parameters: {"negative_slope":0.3}
U-Net skip connections
Sigmoid-gated skip connections in a U-Net style
parameters: null
RoPE
Partial rotary positional embeddings
parameters: {"partial_dim":16}
SmearGate
SmearGate component
parameters: null
ValueEmbedding
ValueEmbedding used in later layers
parameters: {"layers":[9,10]}
Weight Averaging
SWA
parameters: {"threshold":0.2,"every":50,"snapshots":14}
EMA
parameters: {"decay":0.997,"fallback":true}
Quantization
GPTQ
bits: null
scope: mixed int5 base with selective int6/int7 promotion
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
LN scale
parameters: null

Novel Contributions

  • GPTQ reserve optimization that reduced calibration reserve from 14s to 9s, recovering additional training steps
  • Experimental forward-only fused Triton MLP kernel architecture using triton_op and standard PyTorch backward
  • Centralized activation parameter handling via a shared negative slope constant
  • Turbo-Muon + EngramLite + ParamBanking combined submission built on prior PR #1089