PR #370

open

Add submission: Mixed Quantization + BigramHash + SWA (val_bpb 1.2421)

by SergheiBrinzaView on GitHub
val_bpb
1.2421
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.28 MB

Training Techniques

Architecture
BigramHash
Added hash-based bigram embeddings to give the model cheap access to previous-token information.
parameters: {"table_size":10240,"embedding_dim":128}
MLP3x
ReLU² MLP with 3x expansion for faster feedforward computation.
parameters: {"hidden":1536}
U-Net skip connections
Added skip connections across layers in the Transformer.
parameters: {"layers":10}
Quantization
mixed int6/int8 with STE
bits: 6
scope: all weight matrices and embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"gradient_clipping":0.3}
Weight Averaging
SWA
parameters: {"start_fraction":0.5,"interval_steps":50}
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization for all weight matrices, with SVD-based initialization for embeddings.
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • 10-layer Transformer with U-Net skip connections
  • ReLU² MLP with 3x expansion
  • BigramHash embeddings using a 10240-entry hash table
  • Mixed INT6 quantization for weights and INT8 for embeddings
  • Straight-Through Estimator training for quantization robustness
  • Stochastic Weight Averaging over the last half of training
  • Orthogonal and SVD-based initialization
  • zstd level 22 compression