PR #1086

open

Track A: 11L U-Net + BigramHash + SmearGate + Partial RoPE + QAT (1.1349 bpb)

by OmrigotliebView on GitHub

val_bpb

1.1349

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.33MB

Training Techniques

Architecture

U-Net skip connections

11-layer U-Net transformer with encoder-decoder skip connections

parameters: {"layers":11,"encoder_layers":5,"decoder_layers":6}

GQA

Grouped query attention with fewer KV heads than query heads

parameters: {"query_heads":8,"kv_heads":4,"head_dim":64}

MLP3x

MLP with 3x expansion

parameters: {"expansion":3}

LeakyReLU

LeakyReLU squared activation in the MLP

parameters: {"slope":0.5}

BigramHash

BigramHash embeddings with projection

parameters: {"buckets":8192,"projection_dim":128}

SmearGate

Learned previous-token blend after embedding normalization

parameters: null

XSA

Exclusive self-attention applied in the last 4 layers

parameters: {"layers":4}

Partial RoPE

Only part of the head dimension uses rotary position embeddings

parameters: {"rotated_dims":16,"total_dims":64}

vocab_bias

Learned per-token logit prior

parameters: null

Regularization

z-loss

parameters: {"weight":0.0001}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_momentum_end":0.99,"warmup_steps":1500,"adam_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: 6

scope: last 15% of warmdown

mixed int6/int8

bits: null

scope: embeddings, MLP, attention

GPTQ-lite

bits: null

scope: per-row

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"steps":3500,"wallclock_adaptive":true}

Novel Contributions

11-layer U-Net transformer with skip connections
BigramHash embeddings
SmearGate token blending
Partial RoPE
Exclusive self-attention in the last 4 layers
Mixed int6/int8 GPTQ-lite quantization
Late QAT during warmdown
Muon optimizer with EMA