PR #623

open

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb

by SPTholeView on GitHub

val_bpb

1.1507

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4 MB

Training Techniques

Quantization

mixed int5/int6/int8

bits: null

scope: MLP weights (int5), Attention weights (int6), Bigram embeddings (int6), Token embeddings (int8)

Architecture

11L Shared

10 unique weight sets, last block reused to save depth cost

parameters: {"num_layers":11,"unique_layers":10,"shared_last_layer":true}

ReLU²

Sparser MLP activations using squared ReLU activation function

parameters: null

skip_connections

U-Net style skip connections with 5 encoder and 6 decoder layers

parameters: {"encoder_layers":5,"decoder_layers":6}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"momentum_schedule":"cyclic 0.85–0.95","learning_rate":0.025,"momentum_warmup":"0.92 to cyclic over 1500 steps"}

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate_embeds":0.035,"learning_rate_scalars":0.025,"scope":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.2,"every":50}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":3500,"warmup_steps":20}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Novel Contributions

Activation-aware weight quantization (AWQ) scaling weight columns by activation importance before int5/int6 quantization, reducing quantization error on high-activation channels
Cyclic Muon Momentum optimizer with triangle wave momentum schedule (0.85–0.95) to escape sharp minima
Use of ReLU squared (ReLU²) activation for sparser MLP activations beneficial for small models
11-layer architecture with 10 unique layers and last block weight sharing to save depth cost
U-Net style skip connections with 5 encoder and 6 decoder layers