PR #1463

open

Non-record: 1xH100 Budget Run — SmearGate + BigramHash + MLP3x (1.2774 BPB)

by tsubasagitView on GitHub

val_bpb

1.2774

Architecture

Transformer

Optimizer

—

Artifact Size

16,374,104 bytes

Training Techniques

Architecture

SmearGate

Learned gate blending each token embedding with the previous token embedding to add lightweight bigram context.

parameters: null

BigramHash

Hashes consecutive token pairs into a 4096-bucket embedding table projected to model dimension.

parameters: {"buckets":4096,"dim":128}

MLP3x

Uses a 3x MLP expansion in the Transformer block.

parameters: {"hidden":1536}

U-Net skip connections

Adds skip connections from first-half layer outputs to second-half layers with learned scaling.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Weight Averaging

SWA

parameters: {"start_frac":0.7,"every":100}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

int6

bits: 6

scope: weights

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

LR Schedule

warmdown

parameters: {"warmdown_iters":800}

Novel Contributions

Single-GPU 1xH100 budget run targeting the Parameter Golf challenge
Retuning PR #162 techniques for a much smaller training budget
Demonstration that increasing training shards from 1 to 20 substantially improved BPB
Use of SmearGate, BigramHash, MLP3x, U-Net skip connections, and SWA in a budget-constrained setup
Sliding-window evaluation with stride 64 and post-training int6+zlib artifact packaging