PR #593

closed

Record: Full GPTQ + LeakyReLU² + Parallel Muon + BigramHash 3072 (val_bpb 1.1163, 3-seed mean)

by abaybektursunView on GitHub

val_bpb

1.1163

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.90 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

BigramHash

Expanded bigram hash table with narrower embeddings to fit the artifact budget while reducing collisions.

parameters: {"buckets":3072,"dim":80}

MLP3x

Three-layer MLP variant using LeakyReLU squared activation.

parameters: {"layers":3}

Partial RoPE

Rotary positional embeddings applied to only part of the dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

XSA enabled in the last layers of the model.

parameters: {"last_n_layers":4}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"parameter_banking":true,"async_reduce_scatter_all_gather":true}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

test_time_training

parameters: null

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Other

other

Full Hessian GPTQ with Hessian collection, actorder column reordering, and Cholesky error compensation.

parameters: {"calibration_batches":256}

Novel Contributions

Full Hessian GPTQ with actorder and Cholesky error compensation
Parallel Muon with parameter banking and communication overlap
BigramHash reallocation from 1536x128 to 3072x80 to reduce collisions under the artifact budget
LeakyReLU² MLP variant
GPTQ memory fix by freeing the training model before Hessian collection