PR #634

open

Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171)

by raahilshahView on GitHub

val_bpb

1.1171

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.92MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied on all 11 layers to force cross-position mixing from layer 0

parameters: {"layers":11}

LeakyReLU(0.5)^2

Activation function to prevent dead neurons and double effective MLP capacity

parameters: null

Partial RoPE

Partial Rotary Positional Embeddings with NTK-aware scaling

parameters: {"dimensions":"16/64"}

SmearGate

Temporal gating mechanism

parameters: null

BigramHash

Bigram hashing with 2048 buckets and 128-dim embedding

parameters: {"buckets":2048,"embedding_dim":128}

U-Net skips

Skip connections with 5 encoder and 6 decoder layers

parameters: {"encoder_skips":5,"decoder_skips":6}

KV head count

8 heads with 4 KV heads (GQA)

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Weight tying of embeddings

parameters: null

Quantization

Full Hessian GPTQ with amax-aligned QAT

bits: 6

scope: all block weights

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr_matrices":0.025,"lr_embeddings":0.035,"Newton-Schulz_steps":5,"gradient_clip":0.3,"batch_tokens":786432,"seq_len":2048}

Weight Averaging

EMA + Tight SWA

parameters: {"EMA_decay":0.997,"SWA_frequency_steps":50,"SWA_scale_threshold":0.2}

Compression

lzma

level: 6

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

weight decay

parameters: {"value":0.04}

layerwise LN scale

parameters: {"scale_factor":"1/sqrt(layer_idx+1)"}

Other

other

Selective ±1 magnitude pruning post-GPTQ to zero least impactful ±1 quantized values until target artifact size

parameters: {"target_size_MB":15.9}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Initialization

Orthogonal initialization

Novel Contributions

Applying Exclusive Self-Attention (XSA) on all 11 layers instead of last 4 to improve cross-position mixing
Full Hessian GPTQ with 256-sample calibration and Cholesky error compensation for int6 quantization
amax-aligned QAT with row-maximum clipping matching export quantizer
Parallel Muon optimizer with parameter banking and 3-phase overlapped optimizer step to eliminate DDP overhead and speed training
Selective ±1 magnitude pruning post-GPTQ to reduce artifact size with minimal reconstruction error
Use of LZMA compression (preset 6) for better compression ratio on int6 weights
LeakyReLU(0.5)^2 activation to prevent dead neurons and double effective MLP capacity
Combination of EMA and Tight SWA for weight averaging
Partial RoPE with NTK-aware scaling and other architectural tweaks like SmearGate, BigramHash, U-Net skips