PR #362

closed

Record: 11L Int6+Zstd MLP3x SmearGate BigramHash OrthoInit MuonWD EMA (mean val_bpb=1.1497)

by mkenney2View on GitHub

val_bpb

1.1497

Architecture

Transformer

Optimizer

Muon

Artifact Size

~14.8MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

MLP3x

Uses 3x MLP expansion with 1536 hidden dimension.

parameters: {"mlp_multiplier":3,"hidden_dim":1536}

SmearGate

Learned per-dimension gate blending each token with its predecessor.

parameters: null

BigramHash

Adds a 4096-bucket hash embedding for bigram context.

parameters: {"buckets":4096}

tied embeddings

Input and output embeddings are tied, with FP16 embeddings to avoid quantization degradation.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Optimizer

Muon

weight_decay: 0.02

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Regularization

weight decay

parameters: {"weight_decay":0.02}

Initialization

OrthoInit

Orthogonal weight initialization with projection scaling.

Novel Contributions

11-layer Transformer with 3x MLP expansion
Int6 quantization combined with zstd-22 compression to fit a larger model under the artifact limit
SmearGate token-to-predecessor blending mechanism
BigramHash 4096-bucket hash embedding for bigram context
OrthoInit orthogonal initialization
Muon optimizer with weight decay 0.02
EMA with decay 0.997
FP16 tied embeddings
Sliding-window evaluation with stride 256
Extensive ablation of AttnRes, depth recurrence, sequence-length curriculum, and TTT