PR #1463
openNon-record: 1xH100 Budget Run — SmearGate + BigramHash + MLP3x (1.2774 BPB)
by tsubasagitView on GitHub
val_bpb
1.2774
Architecture
Transformer
Optimizer
—
Artifact Size
16,374,104 bytes
Training Techniques
Architecture
SmearGate
Learned gate blending each token embedding with the previous token embedding to add lightweight bigram context.
parameters: null
BigramHash
Hashes consecutive token pairs into a 4096-bucket embedding table projected to model dimension.
parameters: {"buckets":4096,"dim":128}
MLP3x
Uses a 3x MLP expansion in the Transformer block.
parameters: {"hidden":1536}
U-Net skip connections
Adds skip connections from first-half layer outputs to second-half layers with learned scaling.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
SWA
parameters: {"start_frac":0.7,"every":100}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
int6
bits: 6
scope: weights
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_iters":800}
Novel Contributions
- Single-GPU 1xH100 budget run targeting the Parameter Golf challenge
- Retuning PR #162 techniques for a much smaller training budget
- Demonstration that increasing training shards from 1 to 20 substantially improved BPB
- Use of SmearGate, BigramHash, MLP3x, U-Net skip connections, and SWA in a budget-constrained setup
- Sliding-window evaluation with stride 64 and post-training int6+zlib artifact packaging