PR #1302

open

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean)

by vlivashkinView on GitHub
val_bpb
1.1078
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.86 MB

Training Techniques

Architecture
BigramHash
Wider bigram hash projection used in the model.
parameters: {"buckets":2816,"dimensions":160}
U-Net skip connections
Sigmoid-gated U-Net style skip connections.
parameters: null
LeakyReLU
MLP activation uses LeakyReLU squared.
parameters: {"slope":0.5}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention used across all layers.
parameters: {"layers":11}
VE128
VE128 enabled in later layers.
parameters: {"layers":[9,10]}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
Quantization
QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
Brotli
level: 11
Evaluation
online n-gram agreement
parameters: {"experts":3,"causal":true,"score_first":true,"normalized":true}
sliding window eval
parameters: {"stride":64}
LR Schedule
split-LR
parameters: {"early":0.025,"late":0.03,"bank_split":5}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Split-LR training with different early and late layer learning rates
  • BigramHash widening to 2816 x 160
  • Sigmoid-gated U-Net skip connections
  • Soft-round QAT with alpha ramp from 1 to 16
  • Brotli-11 plus byte-shuffle artifact compression
  • Coprime-stride data loader
  • Online n-gram agreement evaluation with three causal experts
  • Properly normalized exponential tilting for probability adjustment
  • Full Hessian GPTQ int6 quantization