PR #915
openNon-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff
by anthony-maioView on GitHub
val_bpb
0.9642
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.95 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation in the model stack.
parameters: {"power":2}
VRL
Value Residual Learning added to the architecture.
parameters: null
VE128
Value embedding / value expansion component with 128-dimensional setting.
parameters: {"dimensions":128}
SmearGate
SmearGate module included in the model.
parameters: null
BigramHash
Bigram hash feature with hashed buckets.
parameters: {"buckets":2048}
XSA
XSA attention variant used in the architecture.
parameters: null
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
U-Net style skip connections included in the network.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP stack.
parameters: {"layers":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: model weights
late QAT
bits: null
scope: model
STE QAT
bits: null
scope: model
Regularization
LN scale
parameters: null
logit softcap
parameters: {"scale":30}
Evaluation
sliding window eval
parameters: null
Other
other
Entropy-adaptive multi-order n-gram backoff cache mixed with neural predictions during evaluation.
parameters: {"orders":"2-7","alpha_formula":"0.05 + 0.55 * sigmoid(2.0 * (H - 4.0))"}
other
Fused softcap plus cross-entropy CUDA megakernel for faster evaluation.
parameters: {"speedup_vs_torch_compile":1.94}
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Initialization
OrthoInit
Novel Contributions
- Fused softcap + cross entropy CUDA megakernel
- Entropy-adaptive multi-order n-gram backoff cache
- Score-first causal n-gram updating during evaluation
- Linear probability-space mixing of neural and n-gram predictions
- Integration of the fused kernel into sliding window evaluation