PR #251

open

Add SP4096 11L432 MLP3x Int6+Zstd Momentum99 record (val_bpb=1.1596)

by kshitizz36View on GitHub

val_bpb

1.1596

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3MB

Training Techniques

Architecture

MLP3x

Increased MLP expansion from 2x to 3x to add model capacity.

parameters: {"mlp_mult":3}

tied embeddings

Uses tied embeddings with fp16 embedding passthrough during quantization.

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.02

momentum: 0.99

other_params: null

Regularization

weight decay

parameters: {"muon_wd":0.02,"adam_wd":0.02}

Quantization

int6

bits: 6

scope: all except fp16 embeddings

fp16

bits: 16

scope: embeddings

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Initialization

spectral init

Tied embeddings use overtone spectral initialization.

Novel Contributions

New SOTA validation score of 1.1596 bpb
11-layer SP-4096 Transformer with dim 432
3x MLP expansion with relu^2 activation
Muon optimizer momentum increased to 0.99
Int6 post-training quantization with zstd-22 compression
fp16 embedding passthrough to preserve embedding quality
Sliding-window evaluation with stride 64
Tied embeddings with overtone spectral initialization