PR #422

open

Record: 11L Gradient-Guided Adaptive Quant + EMA + Sliding Eval (val_bpb=1.1396)

by albertorkiveView on GitHub
val_bpb
1.1396
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB

Training Techniques

Quantization
mixed int5/int6/int7
bits: null
scope: all
Architecture
SmearGate
Residual mixing with a learnable gate
parameters: null
RoPE
NTK-aware rotary position encoding with interpolation
parameters: null
XSA
Cross-sequence attention applied to the final 4 transformer layers
parameters: {"layers":4}
MLP3x
3x MLP expansion with hidden size 1536 and relu^2 activation
parameters: {"hidden":1536}
weight tying
Tied input and output embeddings
parameters: null
KV head count
Grouped-query attention with 4 KV heads
parameters: {"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"grad_clip":0.3}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"start":"init"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"full_coverage":true,"score_last_tokens":64}
Initialization
OrthoInit
Orthogonal weight initialization
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000,"auto_cap_fraction":0.55,"momentum_warmup_start":0.92,"momentum_warmup_end":0.99,"momentum_warmup_steps":1500}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Gradient-guided adaptive quantization assigns per-tensor bitwidth based on gradient sensitivity
parameters: {"top_45_percent":"int7","middle_40_percent":"int6","bottom_15_percent":"int5"}

Novel Contributions

  • Gradient-guided adaptive quantization with per-tensor int5/int6/int7 assignment based on gradient sensitivity
  • EMA tracking from initialization and loaded at evaluation time
  • Adaptive warmdown that auto-caps based on estimated total steps for hardware robustness
  • Sliding window evaluation with stride 64 and full validation coverage
  • SmearGate residual mixing with a learnable gate
  • NTK-aware RoPE interpolation
  • XSA cross-sequence attention on the last 4 layers