PR #1707

open

Add SP10240 + FreqGPTQ + lowercase tokenization: 1.07399 BPB

by nothingLivaView on GitHub
val_bpb
1.0740
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Other
other
Lowercase tokenization via casefolding training data before building a custom SP10240 tokenizer to remove case-variant duplication and improve tokenization efficiency.
parameters: {"tokenizer":"SP10240","casefold":true}
Quantization
GPTQ
bits: 6
scope: block weights
mixed int6/int7
bits: null
scope: all
Architecture
depth recurrence
Depth recurrence architecture used as the base model structure.
parameters: {"layers":10}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.6
momentum: null
other_params: {"matrix_lr":0.028,"scalar_lr":0.028,"tied_embed_lr":0.042}
LR Schedule
warmdown
parameters: {"warmdown":0.6}
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"value":0.6}

Novel Contributions

  • Lowercase tokenization using casefolding to reduce case-variant duplication in the tokenizer.
  • Frequency-weighted GPTQ calibration that boosts high-frequency tokens during Hessian accumulation.
  • Frequency-weighted embedding quantization with higher precision for the most frequent tokens.
  • Mixed-precision quantization scheme using INT6 for most embeddings and INT7 for embeddings in the earlier summary.