PR #376

closed

Record: 11L Next-Gen Stack + Custom Kernels, val_bpb=1.1399

by anthony-maioView on GitHub

val_bpb

1.1399

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.79MB

Training Techniques

Architecture

MLP3x

3x expansion MLP with ReLU² activation

parameters: {"hidden":1536}

XSA

Exclusive Self Attention applied to the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied to only part of the head dimension with NTK-aware scaling

parameters: {"rope_dims":16,"total_dims":64,"base":50000}

SmearGate

Learned sigmoid token blending gate

parameters: {"parameters":512}

BigramHash

Hash embedding for token-pair features

parameters: {"buckets":2048,"dim":128}

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_end":0.99,"warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoint_average":7,"scale_threshold":0.2}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP scaling

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmup_steps":1500,"warmup_start":0.92,"warmup_end":0.99}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer_idx+1)"}

Quantization

int5

bits: 5

scope: mixed precision weights

Novel Contributions

11-layer transformer with a competitive stack achieving 1.1399 val_bpb
Exclusive Self Attention on the last 4 layers
Partial RoPE with NTK-aware base scaling
SmearGate learned token blending
BigramHash token-pair feature embedding
Int5 mixed precision with late QAT STE
GPTQ-lite clip search during compression
Muon optimizer with custom warmup schedule
Tight SWA checkpoint averaging
Custom Triton/CUDA kernel pipeline for future speedups