PR #792

open

11L LeakyReLU² + XSA-all + Full GPTQ + 5-gram Backoff (1.0340 BPB)

by xexyzView on GitHub

val_bpb

1.0340

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,903,061 bytes

Training Techniques

Architecture

LeakyReLU²

Uses LeakyReLU(0.5) squared in the MLP instead of relu² to improve gradient flow.

parameters: {"negative_slope":0.5}

XSA

Cross-sequence attention applied to all transformer layers instead of only the last few layers.

parameters: {"layers":11}

Quantization

GPTQ

bits: 6

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

n-gram backoff

parameters: {"order":5,"backoff_orders":[5,4,3,2],"entropy_adaptive":true}

Test-Time Training

score-first TTT

parameters: {"cache_update_after_scoring":true}

Other

other

Entropy-adaptive alpha blending between model predictions and n-gram cache.

parameters: {"alpha_low":0.05,"alpha_high":0.4,"entropy_threshold":4}

Novel Contributions

LeakyReLU(0.5)² MLP activation
XSA applied to all 11 layers
Full Hessian-based GPTQ with actorder and Cholesky error compensation
5-gram multi-order backoff with separate hash tables per order
Entropy-adaptive alpha for n-gram/model mixing
Score-first n-gram cache protocol