PR #78

open

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb

by mtybadgerView on GitHub
val_bpb
1.1858
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14796836 bytes

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: weights int6, embeddings int8
Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: null
Architecture
vocab size
Increased tokenizer/model vocabulary from 1024 to 8192, requiring a layer reduction to fit constraints.
parameters: {"vocab_size":8192,"layers":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zlib
level: null

Novel Contributions

  • Expanded vocabulary size from 1024 to 8192 using a newly trained tokenizer
  • Replaced Muon with NorMuon optimizer
  • Applied selective quantization with int6 weights and int8 embeddings
  • Reduced model depth to accommodate the larger vocabulary