PR #333
open11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)
by mahsumaktasView on GitHub
val_bpb
1.1565
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to the last 4 transformer layers with GQA-compatible value expansion.
parameters: {"layers":4}
SmearGate
SmearGate added together with BigramHash to provide bigram-aware embedding/context handling.
parameters: {"bigram_vocab_size":2048}
tied embeddings
Uses FP16 tied embedding weights.
parameters: null
Late-K FP16
Keeps the last K layers in FP16 for improved quantization behavior.
parameters: {"layers":2}
RoPE
Uses a larger RoPE base for longer-context modeling.
parameters: {"base":50000}
phase-transition residual mixing
Residual mixing strategy used during initialization/training.
parameters: null
MLP3x
Expanded MLP width to 2.75x (hidden size 1408), near the 3x regime but smaller to fit artifact constraints.
parameters: {"multiplier":2.75,"hidden_size":1408}
Quantization
int6
bits: 6
scope: per-row weights
Compression
zstd
level: 22
Weight Averaging
SWA
parameters: {"every_steps":50,"start_frac":0.4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Initialization
OrthoInit
Orthogonal initialization used with SmearGate/BigramHash.
Overtone SVD init
Spectral embedding initialization based on SVD.
Regularization
magnitude pruning
parameters: {"sparsity":0.02}
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"norm":0.3}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- 11-layer Transformer with XSA on the last 4 layers
- SmearGate combined with BigramHash(2048) and OrthoInit
- INT6 per-row quantization with zstd-22 compression
- SWA every 50 steps with fp32 accumulation
- Muon optimizer tuning with RoPE base 50K
- Overtone SVD initialization and phase-transition residual mixing
- MLP expansion set to 2.75x to stay under the 16MB artifact limit
- Magnitude pruning before quantization
- Empirical finding that EMA performs much worse than SWA for this stack