PR #376

closed

Record: 11L Next-Gen Stack + Custom Kernels, val_bpb=1.1399

by anthony-maioView on GitHub
val_bpb
1.1399
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79MB

Training Techniques

Architecture
MLP3x
3x expansion MLP with ReLU² activation
parameters: {"hidden":1536}
XSA
Exclusive Self Attention applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to only part of the head dimension with NTK-aware scaling
parameters: {"rope_dims":16,"total_dims":64,"base":50000}
SmearGate
Learned sigmoid token blending gate
parameters: {"parameters":512}
BigramHash
Hash embedding for token-pair features
parameters: {"buckets":2048,"dim":128}
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_end":0.99,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoint_average":7,"scale_threshold":0.2}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP scaling
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmup_start":0.92,"warmup_end":0.99}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer_idx+1)"}
Quantization
int5
bits: 5
scope: mixed precision weights

Novel Contributions

  • 11-layer transformer with a competitive stack achieving 1.1399 val_bpb
  • Exclusive Self Attention on the last 4 layers
  • Partial RoPE with NTK-aware base scaling
  • SmearGate learned token blending
  • BigramHash token-pair feature embedding
  • Int5 mixed precision with late QAT STE
  • GPTQ-lite clip search during compression
  • Muon optimizer with custom warmup schedule
  • Tight SWA checkpoint averaging
  • Custom Triton/CUDA kernel pipeline for future speedups