PR #186

closed

11L XSA + SmearGate + BigramHash + SWA (mean val_bpb=1.1565, 3 seeds)

by mahsumaktasView on GitHub

val_bpb

1.1565

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to the last 4 transformer layers to remove self-value bias in a GQA-compatible way.

parameters: {"layers":4}

SmearGate

Bigram-aware gating mechanism used together with BigramHash.

parameters: null

BigramHash

Bigram-aware embedding/hash mechanism with vocabulary size 2048.

parameters: {"vocab_size":2048}

RoPE

Rotary positional embedding with increased base for longer-context modeling.

parameters: {"base":50000}

MLP2.75x

Expanded MLP width to 2.75x with hidden size 1408 to fit within the artifact budget.

parameters: {"multiplier":2.75,"hidden_size":1408}

Quantization

int6

bits: 6

scope: per-row weights

fp16

bits: 16

scope: tied embedding and late-K layers

Compression

zstd

level: 22

Weight Averaging

SWA

parameters: {"every_steps":50,"start_frac":0.4,"accumulation":"fp32"}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Initialization

OrthoInit

Orthogonal initialization used with SmearGate and BigramHash.

spectral init

Overtone SVD initialization with phase-transition residual mixing.

Regularization

grad clip

parameters: {"norm":0.3}

weight decay

parameters: {"value":0.04}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Other

other

Magnitude pruning before quantization.

parameters: {"sparsity":0.02}

Novel Contributions

11 transformer layers with XSA on the last 4 layers
SmearGate combined with BigramHash(2048) and OrthoInit
INT6 per-row quantization with zstd-22 compression
SWA with fp32 accumulation instead of EMA for better quantization behavior
Muon optimizer tuning with specific weight decay and momentum warmup
RoPE base increased to 50K
Overtone SVD initialization with phase-transition residual mixing
MLP expansion tuned to 2.75x to fit under the 16MB limit
Magnitude pruning before quantization