PR #593

closed

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)

by abaybektursunView on GitHub
val_bpb
1.1163
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.90 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
BigramHash
Expanded bigram hash table with narrower embeddings to fit the artifact budget while reducing collisions.
parameters: {"buckets":3072,"dim":80}
MLP3x
Three-layer MLP variant using LeakyReLU squared activation.
parameters: {"layers":3}
Partial RoPE
Rotary positional embeddings applied to only part of the dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
XSA enabled in the last layers of the model.
parameters: {"last_n_layers":4}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"parameter_banking":true,"async_reduce_scatter_all_gather":true}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
test_time_training
parameters: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Other
other
Full Hessian GPTQ with Hessian collection, actorder column reordering, and Cholesky error compensation.
parameters: {"calibration_batches":256}

Novel Contributions

  • Full Hessian GPTQ with actorder and Cholesky error compensation
  • Parallel Muon with parameter banking and communication overlap
  • BigramHash reallocation from 1536x128 to 3072x80 to reduce collisions under the artifact budget
  • LeakyReLU² MLP variant
  • GPTQ memory fix by freeing the training model before Hessian collection