PR #217
openRecord: SP4096 int6+zstd 10L496 overtone+phase sliding (val_bpb=1.1753)
by kshitizz36View on GitHub
val_bpb
1.1753
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,672,752 bytes
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
tied embeddings
Token embedding weights are kept tied/passthrough in fp16.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.95
other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"tied_embed_lr":0.1}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"eval_batch_seqs":256}
Initialization
spectral init
Overtone spectral initialization with phase-transition residual mixing.
resid mix
Phase-transition residual mixing initialization.
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Other
other
SentencePiece tokenizer with 4096 vocabulary size.
parameters: {"vocab_size":4096}
Novel Contributions
- 4096-vocab SentencePiece tokenizer
- int6-range quantization in an int8 container to improve zstd compressibility
- zstd level 22 compression
- fp16 passthrough for token embeddings
- sliding-window evaluation with stride 64 and long context coverage
- overtone spectral initialization
- phase-transition residual mixing