PR #230

open

Record: Int6 + MLP 3x + NorMuon + SmearGate + BigramHash + OrthoInit + Sliding Window, val_bpb=1.1541

by MatthewHRockwellView on GitHub
val_bpb
1.1541
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,992,610 bytes

Training Techniques

Quantization
int6
bits: 6
scope: per-row weights; tied embeddings kept in fp16
Architecture
MLP3x
Expanded MLP hidden size to 3x model dimension to increase capacity.
parameters: {"hidden_dim":1536,"multiplier":3}
SmearGate
Learned gate blending each token embedding with the previous token embedding.
parameters: {"params":512}
BigramHash
Hash-based embedding for token pairs to inject explicit bigram context.
parameters: {"buckets":4096,"dimension":64,"projected_dim":512}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: {"decoupled_weight_decay":true,"normalized_newton_schulz":true}
Initialization
OrthoInit
Orthogonal initialization with muP-style output projection scaling by 1/sqrt(2L).
Evaluation
sliding window eval
parameters: {"stride":256,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"value":0.02,"decoupled":true}

Novel Contributions

  • Int6 per-row quantization with fp16 scales and fp16 tied embeddings
  • MLP hidden expansion to 3x model dimension enabled by quantization savings
  • NorMuon / normalized Newton-Schulz optimization with decoupled weight decay
  • SmearGate token blending with previous-token context
  • BigramHash embedding for token pairs
  • Orthogonal initialization with muP-scaled output projections
  • Sliding window evaluation with stride 256 over 2048-token windows