PR #70

open

Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659

by jfprinczView on GitHub

val_bpb

1.1659

Architecture

Transformer

Optimizer

Muon

Artifact Size

14,855,508 bytes

Training Techniques

Architecture

MLP3x

Widened the MLP expansion from 2x to 3x (hidden size 1536) to improve performance.

parameters: {"mlp_mult":3,"hidden_size":1536}

tied embeddings

Uses tied input/output embeddings.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Quantization

mixed int6/int8

bits: 6

scope: int6 per-row on MLP and attention projection weights; int8 per-row on embeddings and other tensors

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":256}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"warmdown_iters":3000}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

Wider 3x MLP expansion to increase model capacity while staying under the artifact limit
Mixed precision quantization with int6 per-row for MLP and attention weights and int8 for embeddings/other tensors
Sliding window evaluation with stride 256 to improve validation score using more context per scored token
Use of zstd level 22 compression to fit the larger model within the 16MB submission limit
Optimizer tuning for Muon with custom learning rates and momentum warmup/warmdown settings