PR #1135

open

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)

by barneywohlView on GitHub

val_bpb

1.1116

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

16,000,000 bytes

Training Techniques

Architecture

LeakyReLU

Fused Triton kernel for leaky_relu(x, 0.5).square() in the MLP

parameters: {"activation":"leaky_relu(x, 0.5).square()"}

XSA

Exclusive self-attention applied to all layers

parameters: {"layers":11}

BigramHash

Enlarged bigram feature embedding/projection

parameters: {"vocab_size":2816,"dimensions":112}

Quantization

GPTQ

bits: 6

scope: all

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"with_adamw_embeddings":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Compression

lzma

level: null

Novel Contributions

Fused Triton MLP kernel for leaky_relu(x, 0.5).square()
Full Hessian GPTQ with Cholesky error compensation, actorder, and clip sweep
Coprime-stride multi-shard data loader with memmap and diversity-weighted shard sampling
XSA applied to all 11 layers
Enlarged BigramHash(2816×112)