PR #740

open

Record: 9L XSA-all + LeakyReLU² + 5-gram eval cache — val_bpb 1.0909 (3-seed mean)

by resouerView on GitHub
val_bpb
1.0909
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied to all layers
parameters: {"layers":9}
SmearGate
Additional gating mechanism in the transformer
parameters: null
BigramHash
Hashed bigram feature with 4096 buckets
parameters: {"dimensions":4096}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"percentage":25}
LeakyReLU²
Uses squared LeakyReLU activation with slope 0.5
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int8
bits: 8
scope: per-row weights
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization
Regularization
LN Scale
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
online n-gram cache
parameters: {"order":5,"buckets":4000000,"mixing":{"model":0.8,"ngram":0.2},"score_first":true,"backward_looking":true,"target_aware_gating":false}

Novel Contributions

  • XSA applied to all 9 layers
  • LeakyReLU squared activation
  • Online 5-gram evaluation cache with fixed-weight mixing
  • Hashed 5-gram frequency table with 4M buckets
  • Int8 per-row quantization with zstd-22 compression