PR #86

RECORDclosed

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)

by aruniyerView on GitHub

val_bpb

1.1502

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.4MB

Training Techniques

Architecture

MLP3x

Widened MLP expansion to 3x for more capacity per layer.

parameters: {"mlp_mult":3,"hidden":1536}

tied embeddings

Uses tied input/output embeddings with FP16 export for the embedding/head.

parameters: {"vocab_size":1024}

KV head count

Uses grouped-query attention with 4 KV heads.

parameters: {"attention_heads":8,"kv_heads":4}

RoPE

Uses rotary positional embeddings in attention.

parameters: null

Quantization

STE QAT int6

bits: 6

scope: all block weights

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Novel Contributions

11-layer transformer with 3x MLP expansion
Int6 quantization-aware training with STE fake quantization
Decoupled weight decay of 0.04 on both Muon and AdamW
FP16 tied embedding export to preserve embedding/head quality
zstd-22 compression to fit the larger model under the 16MB limit
Sliding window evaluation with stride 64 for improved val_bpb
Higher Muon momentum with warmup from 0.92 to 0.99