PR #332

open

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)

by saml212View on GitHub

val_bpb

1.1320

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7 MB

Training Techniques

Quantization

mixed int5/int6/int7

bits: 5

scope: all weights with gradient-guided per-tensor allocation

Architecture

Partial RoPE

Applies rotary embeddings to only part of the head dimensions; remaining dimensions use position-free attention.

parameters: {"dimensions":16}

XSA

Exclusive Self Attention removes self-value bias from attention output via orthogonal projection.

parameters: {"layers":4}

Regularization

LN scale

parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmup_steps":1500,"warmdown_iters":3000}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"grad_clip_norm":0.3}

Compression

zstandard

level: null

Other

other

Gradient-guided adaptive quantization ranks tensors by squared gradient magnitude during warmdown and assigns precision based on sensitivity.

parameters: {"top_10_percent":"int7","middle_70_percent":"int6","bottom_20_percent":"int5"}

Novel Contributions

Gradient-guided adaptive quantization with per-tensor sensitivity ranking
Mixed-precision allocation across tensors (int7/int6/int5) to save artifact size
12-layer model enabled by quantization savings
Reduced batch size to increase optimization steps within the wallclock budget
Partial RoPE applied to only 16 of 64 dimensions
Layer-wise RMSNorm scaling (LN scale)
Exclusive Self Attention on the last 4 layers
EMA replacing SWA
Negative finding that Late QAT hurts at 12 layers due to throughput cost