PR #915

open

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff

by anthony-maioView on GitHub
val_bpb
0.9642
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.95 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation in the model stack.
parameters: {"power":2}
VRL
Value Residual Learning added to the architecture.
parameters: null
VE128
Value embedding / value expansion component with 128-dimensional setting.
parameters: {"dimensions":128}
SmearGate
SmearGate module included in the model.
parameters: null
BigramHash
Bigram hash feature with hashed buckets.
parameters: {"buckets":2048}
XSA
XSA attention variant used in the architecture.
parameters: null
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
U-Net style skip connections included in the network.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP stack.
parameters: {"layers":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: model weights
late QAT
bits: null
scope: model
STE QAT
bits: null
scope: model
Regularization
LN scale
parameters: null
logit softcap
parameters: {"scale":30}
Evaluation
sliding window eval
parameters: null
Other
other
Entropy-adaptive multi-order n-gram backoff cache mixed with neural predictions during evaluation.
parameters: {"orders":"2-7","alpha_formula":"0.05 + 0.55 * sigmoid(2.0 * (H - 4.0))"}
other
Fused softcap plus cross-entropy CUDA megakernel for faster evaluation.
parameters: {"speedup_vs_torch_compile":1.94}
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Initialization
OrthoInit

Novel Contributions

  • Fused softcap + cross entropy CUDA megakernel
  • Entropy-adaptive multi-order n-gram backoff cache
  • Score-first causal n-gram updating during evaluation
  • Linear probability-space mixing of neural and n-gram predictions
  • Integration of the fused kernel into sliding window evaluation