PR #1122

open

Record: EngramLite + Gated Skips + Full GPTQ + FA3 — val_bpb 1.1146 (1-seed, 2 pending)

by icryoView on GitHub

val_bpb

1.1146

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.71 MB

Training Techniques

Architecture

EngramLite

Multi-head bigram+trigram hash embeddings with learned sigmoid gate, replacing BigramHash.

parameters: {"buckets":8192,"heads":2,"orders":2,"dim_per_head":32}

U-Net skip connections

Sigmoid-gated skip connections on U-Net skips using learned gates.

parameters: null

XSA

XSA applied to all layers.

parameters: {"layers":11}

LeakyReLU

LeakyReLU squared activation with negative slope 0.3.

parameters: {"negative_slope":0.3,"squared":true}

Quantization

GPTQ

bits: 6

scope: all weights

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"ns_steps":4}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency_steps":50,"scale_threshold":0.2}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"lr_floor":0.05}

Regularization

LN scale

parameters: {"scale":"1/sqrt(l+1)"}

Other

other

Coprime-stride multi-shard data loader for diverse batches across 80 shards.

parameters: {"shards":80}

other

FlashAttention 3 on Hopper native hardware.

parameters: null

Novel Contributions

EngramLite multi-head bigram+trigram hash embeddings
Sigmoid-gated skip connections on U-Net skips
Full Hessian GPTQ with Cholesky error compensation
Coprime-stride multi-shard loader across 80 shards
XSA applied to all 11 layers
FlashAttention 3 Hopper-native setup
Combined stack from prior PR innovations