PR #1302

open

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean)

by vlivashkinView on GitHub

val_bpb

1.1078

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.86 MB

Training Techniques

Architecture

BigramHash

Wider bigram hash projection used in the model.

parameters: {"buckets":2816,"dimensions":160}

U-Net skip connections

Sigmoid-gated U-Net style skip connections.

parameters: null

LeakyReLU

MLP activation uses LeakyReLU squared.

parameters: {"slope":0.5}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

XSA

XSA attention used across all layers.

parameters: {"layers":11}

VE128

VE128 enabled in later layers.

parameters: {"layers":[9,10]}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

Quantization

QAT

bits: 6

scope: all

GPTQ

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Compression

Brotli

level: 11

Evaluation

online n-gram agreement

parameters: {"experts":3,"causal":true,"score_first":true,"normalized":true}

sliding window eval

parameters: {"stride":64}

LR Schedule

split-LR

parameters: {"early":0.025,"late":0.03,"bank_split":5}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Split-LR training with different early and late layer learning rates
BigramHash widening to 2816 x 160
Sigmoid-gated U-Net skip connections
Soft-round QAT with alpha ramp from 1 to 16
Brotli-11 plus byte-shuffle artifact compression
Coprime-stride data loader
Online n-gram agreement evaluation with three causal experts
Properly normalized exponential tilting for probability adjustment
Full Hessian GPTQ int6 quantization