PR #312

open

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)

by chanwoo-park-officialView on GitHub
val_bpb
1.1668
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,267,347 bytes

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; other large tensors int8
Architecture
Canon ACD
Canon layers placed before attention, before MLP, and in widened MLP hidden stream, avoiding the expensive QKV placement.
parameters: {"set":"ACD","kernel":3}
BigramHash
Bigram hash embedding added as context extra.
parameters: {"bigram_vocab_size":2048,"bigram_dim":128}
SmearGate
SmearGate context/architecture component used alongside bigram hash embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Transformer MLP widened with multiplier 3.0.
parameters: {"mlp_mult":3}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"adam_weight_decay":0.04}
Weight Averaging
SWA
parameters: {"enabled":1,"every":200,"start_lrmul":0.5,"averaged_checkpoints":8}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}

Novel Contributions

  • Mixed int6 quantization for MLP and attention with int8 for other large tensors
  • Canon ACD placement with kernel size 3 to retain Canon benefits while avoiding QKV cost
  • Bigram hash embedding and SmearGate context extras
  • Muon + Adam mixed optimization with momentum warmup and warmdown
  • SWA near the end of training
  • Sliding-window evaluation with stride 64 as the main comparison metric