PR #1179

open

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.1105

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.81 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Architecture

BigramHash

Bigram hash projection with fewer buckets and wider projection dimension.

parameters: {"buckets":2816,"dimensions":160}

U-Net skip connections

Sigmoid-gated encoder-decoder skip connections.

parameters: null

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: null

scope: all

Compression

brotli

level: 11

byte-shuffle

level: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Split learning rates for early and late transformer layers
BigramHash with 2816 buckets and 160-dimensional projection
Sigmoid-gated U-Net skip connections
Soft-round QAT with alpha ramp from 1 to 16
Brotli-11 plus byte-shuffle artifact compression
Code minification to reduce artifact size
Reduced GPTQ calibration reserve to increase training time