PR #332

open

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)

by saml212View on GitHub
val_bpb
1.1320
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7 MB

Training Techniques

Quantization
mixed int5/int6/int7
bits: 5
scope: all weights with gradient-guided per-tensor allocation
Architecture
Partial RoPE
Applies rotary embeddings to only part of the head dimensions; remaining dimensions use position-free attention.
parameters: {"dimensions":16}
XSA
Exclusive Self Attention removes self-value bias from attention output via orthogonal projection.
parameters: {"layers":4}
Regularization
LN scale
parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3000}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"grad_clip_norm":0.3}
Compression
zstandard
level: null
Other
other
Gradient-guided adaptive quantization ranks tensors by squared gradient magnitude during warmdown and assigns precision based on sensitivity.
parameters: {"top_10_percent":"int7","middle_70_percent":"int6","bottom_20_percent":"int5"}

Novel Contributions

  • Gradient-guided adaptive quantization with per-tensor sensitivity ranking
  • Mixed-precision allocation across tensors (int7/int6/int5) to save artifact size
  • 12-layer model enabled by quantization savings
  • Reduced batch size to increase optimization steps within the wallclock budget
  • Partial RoPE applied to only 16 of 64 dimensions
  • Layer-wise RMSNorm scaling (LN scale)
  • Exclusive Self Attention on the last 4 layers
  • EMA replacing SWA
  • Negative finding that Late QAT hurts at 12 layers due to throughput cost