PR #1169
openTurbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)
by BortlesboatView on GitHub
val_bpb
1.1126
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Turbo-Muon":true,"AOL preconditioning":true,"Polar Express coefficients":true,"row_col normalization":true,"newton_schulz_iterations":4}
Architecture
BigramHash
Hash embeddings for bigrams
parameters: {"heads":2,"buckets":8192}
TrigramHash
Hash embeddings for trigrams
parameters: {"heads":2,"buckets":8192}
GQA
Grouped query attention
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses LeakyReLU(0.3) squared activation
parameters: {"negative_slope":0.3}
U-Net skip connections
Sigmoid-gated skip connections in a U-Net style
parameters: null
RoPE
Partial rotary positional embeddings
parameters: {"partial_dim":16}
SmearGate
SmearGate component
parameters: null
ValueEmbedding
ValueEmbedding used in later layers
parameters: {"layers":[9,10]}
Weight Averaging
SWA
parameters: {"threshold":0.2,"every":50,"snapshots":14}
EMA
parameters: {"decay":0.997,"fallback":true}
Quantization
GPTQ
bits: null
scope: mixed int5 base with selective int6/int7 promotion
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
LN scale
parameters: null
Novel Contributions
- GPTQ reserve optimization that reduced calibration reserve from 14s to 9s, recovering additional training steps
- Experimental forward-only fused Triton MLP kernel architecture using triton_op and standard PyTorch backward
- Centralized activation parameter handling via a shared negative slope constant
- Turbo-Muon + EngramLite + ParamBanking combined submission built on prior PR #1089