PR #1071

open

Non-record: Reproduction of SOTA #1 (SmearGate+BigramHash+Int6+SWA) on RunPod 8xH100

by AbhayAnandUCSDView on GitHub
val_bpb
1.1455
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB

Training Techniques

Architecture
SmearGate
Learned bigram blending at the embedding layer.
parameters: null
BigramHash
Bigram hash embedding with 4096 buckets projected to model dimension.
parameters: {"buckets":4096,"dimension":128,"projected_dim":512}
MLP3x
3x MLP expansion in the feedforward blocks.
parameters: {"hidden_dim":1536}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
int6
bits: 6
scope: per-row weights
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.02,"warmup_momentum":0.92,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoints":30}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Reproduction of the March 20 SOTA #1 submission on RunPod 8xH100 SXM
  • Confirmed reproducibility with val_bpb 1.1455 matching the published 1.1458 result
  • Achieved training within the 600s wallclock limit while keeping the artifact under 16MB