PR #1071

open

Non-record: Reproduction of SOTA #1 (SmearGate+BigramHash+Int6+SWA) on RunPod 8xH100

by AbhayAnandUCSDView on GitHub

val_bpb

1.1455

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.88 MB

Training Techniques

Architecture

SmearGate

Learned bigram blending at the embedding layer.

parameters: null

BigramHash

Bigram hash embedding with 4096 buckets projected to model dimension.

parameters: {"buckets":4096,"dimension":128,"projected_dim":512}

MLP3x

3x MLP expansion in the feedforward blocks.

parameters: {"hidden_dim":1536}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

int6

bits: 6

scope: per-row weights

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.02,"warmup_momentum":0.92,"warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoints":30}

Evaluation

sliding window eval

parameters: {"stride":64}

Reproduction of the March 20 SOTA #1 submission on RunPod 8xH100 SXM
Confirmed reproducibility with val_bpb 1.1455 matching the published 1.1458 result
Achieved training within the 600s wallclock limit while keeping the artifact under 16MB