PR #1135

open

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)

by barneywohlView on GitHub
val_bpb
1.1116
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
16,000,000 bytes

Training Techniques

Architecture
LeakyReLU
Fused Triton kernel for leaky_relu(x, 0.5).square() in the MLP
parameters: {"activation":"leaky_relu(x, 0.5).square()"}
XSA
Exclusive self-attention applied to all layers
parameters: {"layers":11}
BigramHash
Enlarged bigram feature embedding/projection
parameters: {"vocab_size":2816,"dimensions":112}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"with_adamw_embeddings":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Compression
lzma
level: null

Novel Contributions

  • Fused Triton MLP kernel for leaky_relu(x, 0.5).square()
  • Full Hessian GPTQ with Cholesky error compensation, actorder, and clip sweep
  • Coprime-stride multi-shard data loader with memmap and diversity-weighted shard sampling
  • XSA applied to all 11 layers
  • Enlarged BigramHash(2816×112)