PR #208

closed

Staging: Int6 MLP3x 11L + SmearGate + BigramHash4096x128 + MuonWD038 + SWA50 + DocSliding (single-run val_bpb=1.1568)

by ajkpersonalView on GitHub
val_bpb
1.1568
Architecture
Transformer
Optimizer
Muon
Artifact Size
15704854 bytes

Training Techniques

Quantization
int6
bits: 6
scope: artifact/model weights
Architecture
MLP3x
Expanded MLP width by 3x in an 11-layer dense-lexical KV4 model.
parameters: {"layers":11}
SmearGate
Added SmearGate to the model.
parameters: null
BigramHash
Added bigram hash features to the model.
parameters: {"dimensions":"4096x128"}
Optimizer
Muon
weight_decay: 0.038
momentum: null
other_params: {"adam_weight_decay":0.01}
Weight Averaging
SWA
parameters: {"every":50,"start_frac":0.5}
Evaluation
sliding window eval
parameters: {"context_length":2048,"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zstd
level: null
Regularization
weight decay
parameters: {"muon_weight_decay":0.038,"adam_weight_decay":0.01}

Novel Contributions

  • 11-layer dense-lexical KV4 model with MLP3x
  • SmearGate architecture addition
  • BigramHash(4096x128) feature augmentation
  • Muon optimizer with weight decay 0.038 plus Adam weight decay 0.01
  • SWA every 50 steps starting at 50% of training
  • Legal re-export path using int6_zstd_core with doc_sliding 2048/256 to fit the artifact cap