PR #333

open

11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)

by mahsumaktasView on GitHub

val_bpb

1.1565

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to the last 4 transformer layers with GQA-compatible value expansion.

parameters: {"layers":4}

SmearGate

SmearGate added together with BigramHash to provide bigram-aware embedding/context handling.

parameters: {"bigram_vocab_size":2048}

tied embeddings

Uses FP16 tied embedding weights.

parameters: null

Late-K FP16

Keeps the last K layers in FP16 for improved quantization behavior.

parameters: {"layers":2}

RoPE

Uses a larger RoPE base for longer-context modeling.

parameters: {"base":50000}

phase-transition residual mixing

Residual mixing strategy used during initialization/training.

parameters: null

MLP3x

Expanded MLP width to 2.75x (hidden size 1408), near the 3x regime but smaller to fit artifact constraints.

parameters: {"multiplier":2.75,"hidden_size":1408}

Quantization

int6

bits: 6

scope: per-row weights

Compression

zstd

level: 22

Weight Averaging

SWA

parameters: {"every_steps":50,"start_frac":0.4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

Initialization

OrthoInit

Orthogonal initialization used with SmearGate/BigramHash.

Overtone SVD init

Spectral embedding initialization based on SVD.

Regularization

magnitude pruning

parameters: {"sparsity":0.02}

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"norm":0.3}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

11-layer Transformer with XSA on the last 4 layers
SmearGate combined with BigramHash(2048) and OrthoInit
INT6 per-row quantization with zstd-22 compression
SWA every 50 steps with fp32 accumulation
Muon optimizer tuning with RoPE base 50K
Overtone SVD initialization and phase-transition residual mixing
MLP expansion set to 2.75x to stay under the 16MB artifact limit
Magnitude pruning before quantization
Empirical finding that EMA performs much worse than SWA for this stack