PR #78
openRecord: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb
by mtybadgerView on GitHub
val_bpb
1.1858
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14796836 bytes
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: weights int6, embeddings int8
Optimizer
NorMuon
weight_decay: null
momentum: null
other_params: null
Architecture
vocab size
Increased tokenizer/model vocabulary from 1024 to 8192, requiring a layer reduction to fit constraints.
parameters: {"vocab_size":8192,"layers":8}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zlib
level: null
Novel Contributions
- Expanded vocabulary size from 1024 to 8192 using a newly trained tokenizer
- Replaced Muon with NorMuon optimizer
- Applied selective quantization with int6 weights and int8 embeddings
- Reduced model depth to accommodate the larger vocabulary