PR #758

open

Record: 11L XSA-all + 7-gram cache (mean val_bpb=1.0465)

by hypery11View on GitHub

val_bpb

1.0465

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.99 MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied on all 11 layers

parameters: {"layers":11}

LeakyReLU(0.5)^2 MLP

MLP uses LeakyReLU with slope 0.5 squared, with 3x expansion

parameters: {"expansion":3}

BigramHash

Bigram hash feature module

parameters: {"dimensions":10240}

SmearGate

Gating mechanism used in the architecture

parameters: null

Value Residual

Adds value residual connections

parameters: null

Gated Attention

Attention mechanism with gating

parameters: null

U-Net skip

U-Net style skip connections

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: null

Regularization

LN scaling

parameters: {"scale":"1/sqrt(layer+1)"}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"Tight SWA"}

Evaluation

backward-looking eval cache

parameters: {"order":7,"alpha":0.4,"buckets":4000000,"min_count":2,"deterministic":true,"score_first":true}

Test-Time Training

score-first TTT

parameters: {"deterministic":true,"enabled":false}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: null

Novel Contributions

11-layer Transformer with XSA applied to all layers
7-gram backward-looking evaluation cache with fixed alpha and hash buckets
GPTQ-lite int6 quantization combined with zstd-22 compression
EMA, Tight SWA, and Late QAT training pipeline
Use of BigramHash and SmearGate architectural components
Score-first deterministic evaluation without TTT