PR #92
openRecord: 8192 Vocab, Sliding Window Eval, Selective Quantization; 1.194 val_bpb
by saikrishnarallabandiView on GitHub
val_bpb
1.1938
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.7 MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: weights and embeddings
Evaluation
sliding window eval
parameters: {"stride":256}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"NorMuon"}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
SP-8192 tokenizer for improved token compression
parameters: {"vocab_size":8192}
Novel Contributions
- SP-8192 tokenizer for better token compression
- NorMuon optimizer for improved convergence
- Sliding window evaluation with stride 256
- Selective quantization using INT6 weights and INT8 embeddings
- 8-layer model with TRAIN_SEQ_LEN=4096