PR #1630

open

12L XSA-all + Partial RoPE + Batch 786K (1.1412 BPB, 13.5 MB)

by KevinChunyeView on GitHub
val_bpb
1.1412
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5 MB

Training Techniques

Architecture
XSA
Exclusive Self Attention applied to all 12 layers
parameters: {"layers":12}
Partial RoPE
Rotary positional embeddings applied to a subset of head dimensions
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Bigram hashing embedding component
parameters: {"buckets":2048,"dim":128}
SmearGate
SmearGate enabled in the model
parameters: null
MLP3x
Three-times wider MLP with LeakyReLU activation
parameters: {"multiplier":3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adam_weight_decay":0.04}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Evaluation
stride-based eval
parameters: {"stride":64}

Novel Contributions

  • 12-layer architecture that fits under the 16 MB limit
  • XSA applied to all 12 layers
  • Partial RoPE using 16/64 head dimensions
  • Large batch training with 786K tokens
  • Systematic ablation study across 11 experiments
  • Combination of GPTQ-lite int6 quantization with zstd-22 compression
  • Late QAT to improve artifact size-performance tradeoff
  • EMA-based weight averaging