PR #102

open

Int6 MLP3x + Tuned LR + SmearGate + SlidingWindow (val_bpb: 1.1618)

by unnirView on GitHub

val_bpb

1.1618

Architecture

GPT

Optimizer

Muon

Artifact Size

15,144,136 bytes

Training Techniques

Quantization

int6

bits: 6

scope: MLP and attention weight matrices

Architecture

MLP3x

Increased MLP hidden dimension from 1024 to 1536 (3x model_dim) to improve capacity.

parameters: {"mlp_mult":3,"hidden_dim":1536}

SmearGate

Learned gate blends each token embedding with the previous token embedding before the first transformer layer.

parameters: {"gate_type":"sigmoid","cost_params":512}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"warmdown_iters":3000,"grad_clip_norm":1}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1024}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_steps":3000,"momentum_warmup_steps":1500}

Novel Contributions

Per-row int6 quantization of MLP and attention weights with fp16 passthrough for tied embeddings
Using freed compression budget to expand the MLP to 3x width
Tuned Muon optimizer hyperparameters including lower learning rates, momentum warmup, warmdown, and gradient clipping
SmearGate pre-attention module that mixes current and previous token embeddings
Sliding-window evaluation with stride 64 to score tokens with near-full context