PR #123

open

Record: Vocab 4096 + MLP 3x + Sliding Window Eval (mean val_bpb=1.1642, 3 seeds)

by saikrishnarallabandiView on GitHub

val_bpb

1.1642

Architecture

GPT

Optimizer

Muon

Artifact Size

~15.85 MB

Training Techniques

Architecture

MLP3x

Expanded the MLP hidden size to 3x the baseline using quantization savings.

parameters: {"multiplier":3,"hidden_size":1536}

tied embeddings

Uses tied input/output embeddings.

parameters: null

Quantization

STE QAT

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

Weight Averaging

SWA

parameters: {"checkpoints":7}

Evaluation

sliding window eval

parameters: {"stride":256,"context_length":4096}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}

Other

other

Custom SentencePiece BPE tokenizer with vocab size 4096 trained on FineWeb.

parameters: {"vocab_size":4096}

Novel Contributions

Custom SentencePiece BPE tokenizer with vocab size 4096
3x MLP expansion enabled by int6 quantization savings
Int6 STE fake quantization with small quantization gap
Training with 4096-token sequences
Stochastic Weight Averaging over 7 checkpoints
Sliding window evaluation with stride 256