PR #827

open

Record: LeakyReLU² + XSA4 + LN Scale + Partial RoPE — val_bpb 1.3999

by ProgrammerryokiView on GitHub

val_bpb

1.3999

Architecture

Transformer

Optimizer

—

Artifact Size

~13.5 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all weights

Architecture

XSA

Exclusive self-attention applied to the last layers; subtracts self-value from attention output so tokens attend more to context.

parameters: {"layers":4}

Partial RoPE

Rotary position encoding applied only to part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Bigram hashing component used in the model.

parameters: {"buckets":1536}

SmearGate

SmearGate enabled in the architecture.

parameters: null

U-Net Skips

U-Net style skip connections enabled.

parameters: null

MLP3x

MLP widened to 2× with LeakyReLU(0.5)^2 activation.

parameters: {"multiplier":2}

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Compression

zstd

level: 22

Other

other

LeakyReLU(0.5)^2 activation replacing relu(x)^2 to preserve negative gradient flow and reduce dead neurons.

parameters: null

other

GPTQ-lite clip search over multiple clip percentiles per weight row to minimize reconstruction MSE.

parameters: {"clip_percentiles":[0.9999,0.99995,0.99999,0.999995,1]}

Novel Contributions

LeakyReLU(0.5)^2 activation
Exclusive self-attention (XSA) in the last 4 layers
Layerwise LN scaling by 1/sqrt(layer+1)
Partial RoPE using 16 of 64 head dimensions
GPTQ-lite clip search for quantization
Int6 QAT with zstd-22 compression