PR #1172

closed

Record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.1015

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.65 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"steps":8}

Architecture

XSA

Applied XSA to all layers of the model.

parameters: {"layers":11}

BigramHash

Expanded bigram embedding representation.

parameters: {"buckets":2816,"dimensions":160}

U-Net skip connections

Used sigmoid-gated lerp skip connections instead of simple addition.

parameters: null

LeakyReLU

Used LeakyReLU^2 MLP activation.

parameters: {"slope":0.5}

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: null

scope: late

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"split_lr":true,"early_layers_lr":0.025,"late_layers_lr":0.03}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Compression

brotli

level: 11

lzma

level: 2

Other

other

Code minification with pyminify and a self-extracting wrapper to reduce code size.

parameters: null

Novel Contributions

SLOT test-time optimization on frozen hidden states with an additive delta vector
Split early/late Muon learning rates
Sigmoid-gated skip connections
Soft-round QAT with alpha ramp
BigramHash dimension expansion to 160
Brotli-11 compression with byte-shuffle
Reduced GPTQ calibration reserve time