PR #740
openRecord: 9L XSA-all + LeakyReLU² + 5-gram eval cache — val_bpb 1.0909 (3-seed mean)
by resouerView on GitHub
val_bpb
1.0909
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7 MB
Training Techniques
Architecture
XSA
Exclusive Self-Attention applied to all layers
parameters: {"layers":9}
SmearGate
Additional gating mechanism in the transformer
parameters: null
BigramHash
Hashed bigram feature with 4096 buckets
parameters: {"dimensions":4096}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"percentage":25}
LeakyReLU²
Uses squared LeakyReLU activation with slope 0.5
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int8
bits: 8
scope: per-row weights
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization
Regularization
LN Scale
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
online n-gram cache
parameters: {"order":5,"buckets":4000000,"mixing":{"model":0.8,"ngram":0.2},"score_first":true,"backward_looking":true,"target_aware_gating":false}
Novel Contributions
- XSA applied to all 9 layers
- LeakyReLU squared activation
- Online 5-gram evaluation cache with fixed-weight mixing
- Hashed 5-gram frequency table with 4M buckets
- Int8 per-row quantization with zstd-22 compression