PR #1630

open

12L XSA-all + Partial RoPE + Batch 786K (1.1412 BPB, 13.5 MB)

by KevinChunyeView on GitHub

val_bpb

1.1412

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.5 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to all 12 layers

parameters: {"layers":12}

Partial RoPE

Rotary positional embeddings applied to a subset of head dimensions

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Bigram hashing embedding component

parameters: {"buckets":2048,"dim":128}

SmearGate

SmearGate enabled in the model

parameters: null

MLP3x

Three-times wider MLP with LeakyReLU activation

parameters: {"multiplier":3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: all

QAT

bits: null

scope: all

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"adam_weight_decay":0.04}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Evaluation

stride-based eval

parameters: {"stride":64}

Novel Contributions

12-layer architecture that fits under the 16 MB limit
XSA applied to all 12 layers
Partial RoPE using 16/64 head dimensions
Large batch training with 786K tokens
Systematic ablation study across 11 experiments
Combination of GPTQ-lite int6 quantization with zstd-22 compression
Late QAT to improve artifact size-performance tradeoff
EMA-based weight averaging