PR #389

open

Record: 11L Int5-All + XSA5 + EMA + 10% Pruning (val_bpb=1.1466)

by trasnake87View on GitHub

val_bpb

1.1466

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.8 MB

Training Techniques

Quantization

int5

bits: 5

scope: all weights (MLP and attention)

STE QAT

bits: 5

scope: final ~5% of training

Architecture

XSA

Exclusive Self Attention applied to the last 5 layers

parameters: {"layers":5}

Partial RoPE

Rotary positional embeddings applied to only part of the head dimensions

parameters: {"dimensions":16,"total_head_dims":64}

SmearGate

Additional gating mechanism used in the model

parameters: null

BigramHash

Bigram hashing module used as part of the architecture

parameters: {"hash_size":4096,"dim":128}

MLP3x

Expanded MLP width to 3x

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP output scaling

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.025}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

10% magnitude pruning after EMA averaging and before quantization

parameters: {"pruning_fraction":0.1}

Novel Contributions

Uniform int5 quantization for both MLP and attention weights
10% magnitude pruning after EMA averaging and before quantization
Reduced artifact size from about 15.6MB to 14.8MB with minimal quality impact
Late int5 STE fake-quantization during the final portion of training