PR #370
openAdd submission: Mixed Quantization + BigramHash + SWA (val_bpb 1.2421)
by SergheiBrinzaView on GitHub
val_bpb
1.2421
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.28 MB
Training Techniques
Architecture
BigramHash
Added hash-based bigram embeddings to give the model cheap access to previous-token information.
parameters: {"table_size":10240,"embedding_dim":128}
MLP3x
ReLU² MLP with 3x expansion for faster feedforward computation.
parameters: {"hidden":1536}
U-Net skip connections
Added skip connections across layers in the Transformer.
parameters: {"layers":10}
Quantization
mixed int6/int8 with STE
bits: 6
scope: all weight matrices and embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"gradient_clipping":0.3}
Weight Averaging
SWA
parameters: {"start_fraction":0.5,"interval_steps":50}
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization for all weight matrices, with SVD-based initialization for embeddings.
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- 10-layer Transformer with U-Net skip connections
- ReLU² MLP with 3x expansion
- BigramHash embeddings using a 10240-entry hash table
- Mixed INT6 quantization for weights and INT8 for embeddings
- Straight-Through Estimator training for quantization robustness
- Stochastic Weight Averaging over the last half of training
- Orthogonal and SVD-based initialization
- zstd level 22 compression