PR #114

open

Record: val_bpb=1.1574 — Int6 + MLP 3x + selective precision + optimized long-context training

by saml212View on GitHub

val_bpb

1.1574

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.98MB

Training Techniques

Quantization

int6

bits: 6

scope: weight matrices

fp16

bits: 16

scope: tied embedding and last 2 layers' key projections

Architecture

MLP3x

Tripled MLP hidden dimension to fit within artifact budget enabled by int6 compression.

parameters: {"mlp_hidden":1536,"default_mlp_hidden":1024}

tied embeddings

Input embedding and output projection share the same weight matrix.

parameters: null

KV head count

Model uses 4 KV heads with 8 attention heads and 9 layers.

parameters: {"layers":9,"dim":512,"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Evaluation

sliding window eval

parameters: {"stride":256,"context_length":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Other

other

Selective precision preservation for sensitive tensors, including fp16 tied embedding and fp16 passthrough for late-layer key projections.

parameters: {"fp16_tied_embedding":true,"fp16_late_k_passthrough_layers":2}

Novel Contributions

Int6 post-training quantization to reduce artifact size and free space for a 3x larger MLP.
Selective precision preservation for the tied embedding and last two layers' key projections.
Training at sequence length 2048 instead of 4096 while retaining performance under sliding-window evaluation.
Gradient clipping at 0.3 to stabilize long-sequence training.
Batch size of 786,432 tokens found to be optimal for train@2048.
Sliding-window evaluation with stride 256, which improved val_bpb and reduced eval time versus smaller strides.