PR #389

open

Record: 11L Int5-All + XSA5 + EMA + 10% Pruning (val_bpb=1.1466)

by trasnake87View on GitHub
val_bpb
1.1466
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.8 MB

Training Techniques

Quantization
int5
bits: 5
scope: all weights (MLP and attention)
STE QAT
bits: 5
scope: final ~5% of training
Architecture
XSA
Exclusive Self Attention applied to the last 5 layers
parameters: {"layers":5}
Partial RoPE
Rotary positional embeddings applied to only part of the head dimensions
parameters: {"dimensions":16,"total_head_dims":64}
SmearGate
Additional gating mechanism used in the model
parameters: null
BigramHash
Bigram hashing module used as part of the architecture
parameters: {"hash_size":4096,"dim":128}
MLP3x
Expanded MLP width to 3x
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP output scaling
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.025}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
10% magnitude pruning after EMA averaging and before quantization
parameters: {"pruning_fraction":0.1}

Novel Contributions

  • Uniform int5 quantization for both MLP and attention weights
  • 10% magnitude pruning after EMA averaging and before quantization
  • Reduced artifact size from about 15.6MB to 14.8MB with minimal quality impact
  • Late int5 STE fake-quantization during the final portion of training