PR #236

open

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)

by saml212View on GitHub

val_bpb

1.1400

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

15.7 MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: attention + MLP weights; int8 tok_emb

Architecture

SmearGate

Per-dimension gating module that blends adjacent token embeddings

parameters: {"params":512}

BigramHash

Adds consecutive token pair features via hashed bigram buckets

parameters: {"buckets":2048,"dim":128}

MLP3x

Widened MLP to 3x hidden size

parameters: {"hidden":1536}

KV head count

Uses fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied embedding / output projection implied by tied embedding learning rate

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"grad_clip_norm":0.3}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"grad_clip_norm":0.3}

Weight Averaging

SWA

parameters: {"checkpoints":7,"every_steps":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"batch_seqs":32}

Initialization

OrthoInit

Orthogonal initialization with muP output scaling

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adamw_wd":0.04}

Other

other

Reduced batch size to improve step count under a fixed 600s training budget

parameters: {"from_tokens":786000,"to_tokens":524288}

Novel Contributions

Reduced batch size from 786K to 524K tokens to maximize optimization steps within a fixed training time budget
Used int6 quantization for all main weights instead of int5 MLP quantization
Switched tok_emb from fp16 to int8 to free artifact space for a wider MLP
Added SmearGate as a per-dimension embedding blending mechanism
Added BigramHash features with 2048 buckets and 128-dimensional embeddings
Applied batched sliding-window evaluation with stride 64 to make long-context eval feasible within time limits
Used SWA with periodic checkpoint averaging
Combined Muon and AdamW with dual weight decay to improve compression and quantization behavior
Used OrthoInit with muP output scaling