Learn how to fit a language model into 16MB. This site breaks down every technique used in OpenAI's Parameter Golf competition — from quantization and architecture tricks to test-time training — with interactive deep dives, real submission data, and code.
PRs Processed
1342
Best Record BPB
1.0810
Techniques
1040
Deep Dives
10
val_bpb
1.0810
Architecture: Transformer
Optimizer: SGD
Size: ~15.99 MB
QuantizationGPTQ, int8
Architecturedepth recurrence, parallel residuals, weight tying, LeakyReLU, Partial RoPE, U-Net skip connections
Regularizationlogit softcap
OptimizerSGD
Weight AveragingEMA
Compressionlzma
Evaluationsliding window eval
Test-Time Trainingscore-first TTT
LR Schedulecosine decay, warmdown
Deep Dives
quantization
Quantization Fundamentals
Reducing model size while preserving performance
5 sections
architecture modification
Architecture Tricks
U-Net skips, BigramHash, SmearGate, and more
9 sections
optimizer technique
The Muon Optimizer
Why Parameter Golf's best players abandoned Adam
11 sections
weight averaging
Weight Averaging
SWA, EMA, and ensemble-like approaches that cost almost nothing
12 sections
compression
Compression
zstd, pruning, and artifact size optimization
6 sections
test time training
Test-Time Training
Adapting models at inference time with LoRA, score-first TTT, and per-document fine-tuning
7 sections
lr schedule
Learning Rate Schedules
Warmdown, cosine, and schedule optimization
7 sections
initialization
Initialization
OrthoInit and weight initialization strategies
6 sections
regularization
Regularization
Weight decay, pruning, and overfitting prevention
7 sections
evaluation technique
Evaluation Strategies
Sliding window eval, N-gram mixing, and scoring techniques
8 sections
Record Progression
Record BPB (is_record)
All submissions
Baseline (1.9)
Technique Category Frequency
Top Submissions
View all 1342PRs →#20.0000
PR #959himanalot
#30.0109
PR #1076sofiabod
#40.0165
PR #943aamodbhatt
#50.0165
PR #944aamodbhatt
#60.0180
PR #1056sofiabod
#70.0214
PR #962AnirudhRahul