Learn how to fit a language model into 16MB. This site breaks down every technique used in OpenAI's Parameter Golf competition — from quantization and architecture tricks to test-time training — with interactive deep dives, real submission data, and code.
PRs Processed
1614
Best Record BPB
1.0576
Techniques
1151
Deep Dives
10
val_bpb
1.0576
Architecture: Transformer
Optimizer: Muon
Size: 15.98 MB
Sequence Lengthsequence_length
LR Schedulewarmdown
Test-Time TrainingLoRA TTT
ArchitectureSmearGate, XSA, Partial RoPE, depth recurrence, GQA, parallel decoder, SparseAttnGate
OptimizerMuon
Weight AveragingEMA
QuantizationGPTQ, GPTQ, LQER, AWQ-lite
Compressionpergroup
Evaluationstride-based eval
Regularizationweight decay
Deep Dives
quantization
Quantization Fundamentals
Reducing model size while preserving performance
5 sections
architecture modification
Architecture Tricks
U-Net skips, BigramHash, SmearGate, and more
9 sections
optimizer technique
The Muon Optimizer
Why Parameter Golf's best players abandoned Adam
11 sections
weight averaging
Weight Averaging
SWA, EMA, and ensemble-like approaches that cost almost nothing
12 sections
compression
Compression
zstd, pruning, and artifact size optimization
6 sections
test time training
Test-Time Training
Adapting models at inference time with LoRA, score-first TTT, and per-document fine-tuning
7 sections
lr schedule
Learning Rate Schedules
Warmdown, cosine, and schedule optimization
7 sections
initialization
Initialization
OrthoInit and weight initialization strategies
6 sections
regularization
Regularization
Weight decay, pruning, and overfitting prevention
7 sections
evaluation technique
Evaluation Strategies
Sliding window eval, N-gram mixing, and scoring techniques
8 sections
Record Progression
Record BPB (is_record)
All submissions
Baseline (1.9)
Technique Category Frequency
Top Submissions
View all 1614PRs →#20.0000
PR #959himanalot
#30.0109
PR #1076sofiabod
#40.0165
PR #943aamodbhatt
#50.0165
PR #944aamodbhatt
#60.0180
PR #1056sofiabod
#70.0214
PR #962AnirudhRahul