PR #623

open

[10min/16MB] AWQ + Cyclic Momentum + ReLU² + 11L Shared — 1.1507 bpb

by SPTholeView on GitHub
val_bpb
1.1507
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4 MB

Training Techniques

Quantization
mixed int5/int6/int8
bits: null
scope: MLP weights (int5), Attention weights (int6), Bigram embeddings (int6), Token embeddings (int8)
Architecture
11L Shared
10 unique weight sets, last block reused to save depth cost
parameters: {"num_layers":11,"unique_layers":10,"shared_last_layer":true}
ReLU²
Sparser MLP activations using squared ReLU activation function
parameters: null
skip_connections
U-Net style skip connections with 5 encoder and 6 decoder layers
parameters: {"encoder_layers":5,"decoder_layers":6}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"momentum_schedule":"cyclic 0.85–0.95","learning_rate":0.025,"momentum_warmup":"0.92 to cyclic over 1500 steps"}
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate_embeds":0.035,"learning_rate_scalars":0.025,"scope":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.2,"every":50}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"warmup_steps":20}
Regularization
weight decay
parameters: {"weight_decay":0.04}

Novel Contributions

  • Activation-aware weight quantization (AWQ) scaling weight columns by activation importance before int5/int6 quantization, reducing quantization error on high-activation channels
  • Cyclic Muon Momentum optimizer with triangle wave momentum schedule (0.85–0.95) to escape sharp minima
  • Use of ReLU squared (ReLU²) activation for sparser MLP activations beneficial for small models
  • 11-layer architecture with 10 unique layers and last block weight sharing to save depth cost
  • U-Net style skip connections with 5 encoder and 6 decoder layers