PR #609

open

Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)

by saml212View on GitHub

val_bpb

1.1154

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.94 MB

Training Techniques

Architecture

XSA

Cross-Position Self-Attention applied on all 11 layers instead of last 4, forcing cross-position information mixing from layer 0

parameters: {"layers":11}

Selective ±1 magnitude pruning

Post-GPTQ pruning of ±1 quantized values sorted by reconstruction error (scale²), zeroing least-impactful values first until artifact fits

parameters: null

LeakyReLU(0.5)² MLP 3x

MLP with LeakyReLU activation squared, repeated 3 times

parameters: null

BigramHash

Bigram hashing with 2048 buckets

parameters: {"buckets":2048}

Partial RoPE

Rotary Positional Embeddings applied partially with parameters 16/64

parameters: {"partial_rope":"16/64"}

LN Scale

LayerNorm scaling

parameters: null

VE128

Value Embedding with dimension 128

parameters: {"dimension":128}

SmearGate

SmearGate mechanism

parameters: null

U-Net skips

Skip connections inspired by U-Net architecture

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Quantization

Full Hessian GPTQ

bits: 6

scope: int6

Compression

lzma

level: null

Novel Contributions

Applying Cross-Position Self-Attention (XSA) on all 11 layers instead of the standard last 4 layers, improving cross-position information mixing from layer 0
Selective ±1 magnitude pruning post-GPTQ by sorting ±1 quantized values by reconstruction error and zeroing the least impactful first until artifact fits