PR #92

open

Record: 8192 Vocab, Sliding Window Eval, Selective Quantization; 1.194 val_bpb

by saikrishnarallabandiView on GitHub
val_bpb
1.1938
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.7 MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: weights and embeddings
Evaluation
sliding window eval
parameters: {"stride":256}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"NorMuon"}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
SP-8192 tokenizer for improved token compression
parameters: {"vocab_size":8192}

Novel Contributions

  • SP-8192 tokenizer for better token compression
  • NorMuon optimizer for improved convergence
  • Sliding window evaluation with stride 256
  • Selective quantization using INT6 weights and INT8 embeddings
  • 8-layer model with TRAIN_SEQ_LEN=4096