PR #173

open

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532

by tamoghnokandarView on GitHub

val_bpb

1.1532

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15.96MB

Training Techniques

Quantization

int6

bits: 6

scope: weight matrices with per-row scaling; tied embedding and last 2 layers' c_k.weight kept in fp16

Architecture

MLP3x

Increased MLP hidden size from 1024 to 1536 (3x expansion).

parameters: {"hidden_size":1536,"base_hidden_size":1024}

tied embeddings

Kept tied token embedding in fp16 for sensitivity reasons.

parameters: null

KV head count

Model uses 4 KV heads with 8 attention heads.

parameters: {"heads":8,"kv_heads":4,"layers":9,"dim":512}

Optimizer

NorMuon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Evaluation

sliding window eval

parameters: {"stride":256,"eval_seq_len":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Other

other

FlashAttention 3 used for the attention kernel to improve training/runtime on H100s.

parameters: null

Novel Contributions

Replaced Muon with NorMuon for optimizer updates.
Switched the attention path to FlashAttention 3.
Used int6 post-training quantization with per-row scaling to fit a larger MLP.
Expanded the MLP hidden size from 1024 to 1536 while staying within the artifact budget.
Validated the submission across three seeds (7, 42, 1337) with sliding-window evaluation.