PR #1179

open

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.1105
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.81 MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Architecture
BigramHash
Bigram hash projection with fewer buckets and wider projection dimension.
parameters: {"buckets":2816,"dimensions":160}
U-Net skip connections
Sigmoid-gated encoder-decoder skip connections.
parameters: null
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
brotli
level: 11
byte-shuffle
level: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • Split learning rates for early and late transformer layers
  • BigramHash with 2816 buckets and 160-dimensional projection
  • Sigmoid-gated U-Net skip connections
  • Soft-round QAT with alpha ramp from 1 to 16
  • Brotli-11 plus byte-shuffle artifact compression
  • Code minification to reduce artifact size
  • Reduced GPTQ calibration reserve to increase training time