PR #217

open

Record: SP4096 int6+zstd 10L496 overtone+phase sliding (val_bpb=1.1753)

by kshitizz36View on GitHub
val_bpb
1.1753
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,672,752 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
tied embeddings
Token embedding weights are kept tied/passthrough in fp16.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.95
other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"tied_embed_lr":0.1}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"eval_batch_seqs":256}
Initialization
spectral init
Overtone spectral initialization with phase-transition residual mixing.
resid mix
Phase-transition residual mixing initialization.
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}
Regularization
weight decay
parameters: {"weight_decay":0.02}
Other
other
SentencePiece tokenizer with 4096 vocabulary size.
parameters: {"vocab_size":4096}

Novel Contributions

  • 4096-vocab SentencePiece tokenizer
  • int6-range quantization in an int8 container to improve zstd compressibility
  • zstd level 22 compression
  • fp16 passthrough for token embeddings
  • sliding-window evaluation with stride 64 and long context coverage
  • overtone spectral initialization
  • phase-transition residual mixing