PR #236

open

Record: 11L Int6 + SmearGate + Batch Optimization (val_bpb=1.1400)

by saml212View on GitHub
val_bpb
1.1400
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
15.7 MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: attention + MLP weights; int8 tok_emb
Architecture
SmearGate
Per-dimension gating module that blends adjacent token embeddings
parameters: {"params":512}
BigramHash
Adds consecutive token pair features via hashed bigram buckets
parameters: {"buckets":2048,"dim":128}
MLP3x
Widened MLP to 3x hidden size
parameters: {"hidden":1536}
KV head count
Uses fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied embedding / output projection implied by tied embedding learning rate
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"grad_clip_norm":0.3}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"grad_clip_norm":0.3}
Weight Averaging
SWA
parameters: {"checkpoints":7,"every_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32}
Initialization
OrthoInit
Orthogonal initialization with muP output scaling
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adamw_wd":0.04}
Other
other
Reduced batch size to improve step count under a fixed 600s training budget
parameters: {"from_tokens":786000,"to_tokens":524288}

Novel Contributions

  • Reduced batch size from 786K to 524K tokens to maximize optimization steps within a fixed training time budget
  • Used int6 quantization for all main weights instead of int5 MLP quantization
  • Switched tok_emb from fp16 to int8 to free artifact space for a wider MLP
  • Added SmearGate as a per-dimension embedding blending mechanism
  • Added BigramHash features with 2048 buckets and 128-dimensional embeddings
  • Applied batched sliding-window evaluation with stride 64 to make long-context eval feasible within time limits
  • Used SWA with periodic checkpoint averaging
  • Combined Muon and AdamW with dual weight decay to improve compression and quantization behavior
  • Used OrthoInit with muP output scaling