PR #102

open

Int6 MLP3x + Tuned LR + SmearGate + SlidingWindow (val_bpb: 1.1618)

val_bpb
1.1618
Architecture
GPT
Optimizer
Muon
Artifact Size
15,144,136 bytes

Training Techniques

Quantization
int6
bits: 6
scope: MLP and attention weight matrices
Architecture
MLP3x
Increased MLP hidden dimension from 1024 to 1536 (3x model_dim) to improve capacity.
parameters: {"mlp_mult":3,"hidden_dim":1536}
SmearGate
Learned gate blends each token embedding with the previous token embedding before the first transformer layer.
parameters: {"gate_type":"sigmoid","cost_params":512}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"warmdown_iters":3000,"grad_clip_norm":1}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"momentum_warmup_steps":1500}

Novel Contributions

  • Per-row int6 quantization of MLP and attention weights with fp16 passthrough for tied embeddings
  • Using freed compression budget to expand the MLP to 3x width
  • Tuned Muon optimizer hyperparameters including lower learning rates, momentum warmup, warmdown, and gradient clipping
  • SmearGate pre-attention module that mixes current and previous token embeddings
  • Sliding-window evaluation with stride 64 to score tokens with near-full context