PR #186

closed

11L XSA + SmearGate + BigramHash + SWA (mean val_bpb=1.1565, 3 seeds)

by mahsumaktasView on GitHub
val_bpb
1.1565
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB

Training Techniques

Architecture
XSA
Exclusive Self Attention applied to the last 4 transformer layers to remove self-value bias in a GQA-compatible way.
parameters: {"layers":4}
SmearGate
Bigram-aware gating mechanism used together with BigramHash.
parameters: null
BigramHash
Bigram-aware embedding/hash mechanism with vocabulary size 2048.
parameters: {"vocab_size":2048}
RoPE
Rotary positional embedding with increased base for longer-context modeling.
parameters: {"base":50000}
MLP2.75x
Expanded MLP width to 2.75x with hidden size 1408 to fit within the artifact budget.
parameters: {"multiplier":2.75,"hidden_size":1408}
Quantization
int6
bits: 6
scope: per-row weights
fp16
bits: 16
scope: tied embedding and late-K layers
Compression
zstd
level: 22
Weight Averaging
SWA
parameters: {"every_steps":50,"start_frac":0.4,"accumulation":"fp32"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Initialization
OrthoInit
Orthogonal initialization used with SmearGate and BigramHash.
spectral init
Overtone SVD initialization with phase-transition residual mixing.
Regularization
grad clip
parameters: {"norm":0.3}
weight decay
parameters: {"value":0.04}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
Magnitude pruning before quantization.
parameters: {"sparsity":0.02}

Novel Contributions

  • 11 transformer layers with XSA on the last 4 layers
  • SmearGate combined with BigramHash(2048) and OrthoInit
  • INT6 per-row quantization with zstd-22 compression
  • SWA with fp32 accumulation instead of EMA for better quantization behavior
  • Muon optimizer tuning with specific weight decay and momentum warmup
  • RoPE base increased to 50K
  • Overtone SVD initialization with phase-transition residual mixing
  • MLP expansion tuned to 2.75x to fit under the 16MB limit
  • Magnitude pruning before quantization