PR #915

open

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff

by anthony-maioView on GitHub

val_bpb

0.9642

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.95 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared MLP activation in the model stack.

parameters: {"power":2}

VRL

Value Residual Learning added to the architecture.

parameters: null

VE128

Value embedding / value expansion component with 128-dimensional setting.

parameters: {"dimensions":128}

SmearGate

SmearGate module included in the model.

parameters: null

BigramHash

Bigram hash feature with hashed buckets.

parameters: {"buckets":2048}

XSA

XSA attention variant used in the architecture.

parameters: null

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

U-Net skip connections

U-Net style skip connections included in the network.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-layer MLP stack.

parameters: {"layers":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: model weights

late QAT

bits: null

scope: model

STE QAT

bits: null

scope: model

Regularization

LN scale

parameters: null

logit softcap

parameters: {"scale":30}

Evaluation

sliding window eval

parameters: null

Other

other

Entropy-adaptive multi-order n-gram backoff cache mixed with neural predictions during evaluation.

parameters: {"orders":"2-7","alpha_formula":"0.05 + 0.55 * sigmoid(2.0 * (H - 4.0))"}

other

Fused softcap plus cross-entropy CUDA megakernel for faster evaluation.

parameters: {"speedup_vs_torch_compile":1.94}

Compression

lzma

level: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Initialization

OrthoInit

Novel Contributions

Fused softcap + cross entropy CUDA megakernel
Entropy-adaptive multi-order n-gram backoff cache
Score-first causal n-gram updating during evaluation
Linear probability-space mixing of neural and n-gram predictions
Integration of the fused kernel into sliding window evaluation