PR #740

open

Record: 9L XSA-all + LeakyReLU² + 5-gram eval cache — val_bpb 1.0909 (3-seed mean)

by resouerView on GitHub

val_bpb

1.0909

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.7 MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied to all layers

parameters: {"layers":9}

SmearGate

Additional gating mechanism in the transformer

parameters: null

BigramHash

Hashed bigram feature with 4096 buckets

parameters: {"dimensions":4096}

Partial RoPE

Rotary positional embeddings applied partially

parameters: {"percentage":25}

LeakyReLU²

Uses squared LeakyReLU activation with slope 0.5

parameters: {"slope":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int8

bits: 8

scope: per-row weights

Compression

zstd

level: 22

Initialization

OrthoInit

Orthogonal initialization

Regularization

LN Scale

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

online n-gram cache

parameters: {"order":5,"buckets":4000000,"mixing":{"model":0.8,"ngram":0.2},"score_first":true,"backward_looking":true,"target_aware_gating":false}

Novel Contributions

XSA applied to all 9 layers
LeakyReLU squared activation
Online 5-gram evaluation cache with fixed-weight mixing
Hashed 5-gram frequency table with 4M buckets
Int8 per-row quantization with zstd-22 compression