PR #135

open

Record: OrthoInit + Int6 MLP3x + BigramHash + SmearGate (val_bpb: 1.1539)

by unnirView on GitHub

val_bpb

1.1539

Architecture

GPT

Optimizer

Muon

Artifact Size

15,162,375 bytes

Training Techniques

Initialization

OrthoInit

Orthogonal initialization with gain=1.0, plus muP-scaled output projections.

Quantization

mixed int6

bits: 6

scope: MLP and attention weight matrices; FP16 passthrough for tied embeddings and last 2 layers' Key projections

Architecture

MLP3x

Expanded MLP hidden dimension from 1024 to 1536 (3x model_dim).

parameters: {"hidden_dimension":1536}

SmearGate

Learned gate blending each token embedding with the previous token embedding.

parameters: {"parameters":512}

BigramHash

4096-bucket hash table injecting token-pair information, projected to model dimension.

parameters: {"buckets":4096,"dimension":128,"projection_dimension":512}

Optimizer

Muon

weight_decay: 0.01

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"grad_clip_norm":0.3,"beta1":0.9,"beta2":0.95,"adamw_for_embedding_and_scalar_params":true}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"weight_decay":0.01}

Novel Contributions

Orthogonal initialization with muP-scaled output projections
Mixed int6 quantization with FP16 passthrough for sensitive tensors
3x MLP expansion enabled by quantization savings
Tuned Muon/AdamW optimizer hyperparameters
SmearGate token blending mechanism
BigramHash token-pair embedding
Sliding-window evaluation with stride 64