PR #1707

open

Add SP10240 + FreqGPTQ + lowercase tokenization: 1.07399 BPB

by nothingLivaView on GitHub

val_bpb

1.0740

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Other

other

Lowercase tokenization via casefolding training data before building a custom SP10240 tokenizer to remove case-variant duplication and improve tokenization efficiency.

parameters: {"tokenizer":"SP10240","casefold":true}

Quantization

GPTQ

bits: 6

scope: block weights

mixed int6/int7

bits: null

scope: all

Architecture

depth recurrence

Depth recurrence architecture used as the base model structure.

parameters: {"layers":10}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.6

momentum: null

other_params: {"matrix_lr":0.028,"scalar_lr":0.028,"tied_embed_lr":0.042}

LR Schedule

warmdown

parameters: {"warmdown":0.6}

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

weight decay

parameters: {"value":0.6}

Novel Contributions

Lowercase tokenization using casefolding to reduce case-variant duplication in the tokenizer.
Frequency-weighted GPTQ calibration that boosts high-frequency tokens during Hessian accumulation.
Frequency-weighted embedding quantization with higher precision for the most frequent tokens.
Mixed-precision quantization scheme using INT6 for most embeddings and INT7 for embeddings in the earlier summary.