PR #763

open

Record: 11L XSA-all + backoff 7-gram (mean val_bpb=0.9917)

by hypery11View on GitHub

val_bpb

0.9917

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.99 MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied to all layers

parameters: {"layers":11}

LeakyReLU^2 MLP

MLP uses LeakyReLU(0.5)^2 activation with 3x expansion

parameters: {"expansion":3}

BigramHash

BigramHash feature module

parameters: {"dimensions":10240}

SmearGate

SmearGate gating mechanism

parameters: null

Value Residual

Adds value residual connections

parameters: null

Gated Attention

Uses gated attention mechanism

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: null

Evaluation

multi-order backoff n-gram eval cache

parameters: {"orders":[2,3,4,5,6,7],"fallback":"highest-order-first","alpha":0.4,"buckets_per_order":"4M","score_first":true,"deterministic":true,"no_ttt":true}

Novel Contributions

11-layer Transformer with XSA applied to all layers
Multi-order backoff n-gram evaluation cache from orders 2 through 7
Highest-order-first fallback with fixed alpha=0.40
Score-first deterministic evaluation with no test-time training
GPTQ-lite int6 quantization combined with zstd-22 compression
EMA plus Tight SWA plus Late QAT training recipe