PR #634
openRecord: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171)
by raahilshahView on GitHub
val_bpb
1.1171
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.92MB
Training Techniques
Architecture
XSA
Exclusive Self-Attention applied on all 11 layers to force cross-position mixing from layer 0
parameters: {"layers":11}
LeakyReLU(0.5)^2
Activation function to prevent dead neurons and double effective MLP capacity
parameters: null
Partial RoPE
Partial Rotary Positional Embeddings with NTK-aware scaling
parameters: {"dimensions":"16/64"}
SmearGate
Temporal gating mechanism
parameters: null
BigramHash
Bigram hashing with 2048 buckets and 128-dim embedding
parameters: {"buckets":2048,"embedding_dim":128}
U-Net skips
Skip connections with 5 encoder and 6 decoder layers
parameters: {"encoder_skips":5,"decoder_skips":6}
KV head count
8 heads with 4 KV heads (GQA)
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Weight tying of embeddings
parameters: null
Quantization
Full Hessian GPTQ with amax-aligned QAT
bits: 6
scope: all block weights
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr_matrices":0.025,"lr_embeddings":0.035,"Newton-Schulz_steps":5,"gradient_clip":0.3,"batch_tokens":786432,"seq_len":2048}
Weight Averaging
EMA + Tight SWA
parameters: {"EMA_decay":0.997,"SWA_frequency_steps":50,"SWA_scale_threshold":0.2}
Compression
lzma
level: 6
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"value":0.04}
layerwise LN scale
parameters: {"scale_factor":"1/sqrt(layer_idx+1)"}
Other
other
Selective ±1 magnitude pruning post-GPTQ to zero least impactful ±1 quantized values until target artifact size
parameters: {"target_size_MB":15.9}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Initialization
Orthogonal initialization
Novel Contributions
- Applying Exclusive Self-Attention (XSA) on all 11 layers instead of last 4 to improve cross-position mixing
- Full Hessian GPTQ with 256-sample calibration and Cholesky error compensation for int6 quantization
- amax-aligned QAT with row-maximum clipping matching export quantizer
- Parallel Muon optimizer with parameter banking and 3-phase overlapped optimizer step to eliminate DDP overhead and speed training
- Selective ±1 magnitude pruning post-GPTQ to reduce artifact size with minimal reconstruction error
- Use of LZMA compression (preset 6) for better compression ratio on int6 weights
- LeakyReLU(0.5)^2 activation to prevent dead neurons and double effective MLP capacity
- Combination of EMA and Tight SWA for weight averaging
- Partial RoPE with NTK-aware scaling and other architectural tweaks like SmearGate, BigramHash, U-Net skips