PR #1666

open

Record: BESE 288-vocab Novel Tokenizer — 1.1531 BPB (3-seed mean)

by mrbeseView on GitHub

val_bpb

1.1531

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

12.72 MB

Training Techniques

Architecture

depth recurrence

Applies recurrent depth over layers 3-5 for multiple forward loops during training.

parameters: {"layers":[3,5],"loops":3}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

Applies rotary position embeddings to only part of the head dimension.

parameters: {"dimensions":16}

BigramHash

Adds a hashed bigram feature embedding for evaluation and modeling.

parameters: {"vocab":2048,"dim":128}

XSA

Uses XSA in the last 4 layers.

parameters: {"layers":4}

MLP3x

Uses a 3x MLP expansion ratio.

parameters: {"multiplier":3}

LeakyReLU

Uses LeakyReLU squared as the activation function.

Value Residual

Enables value embeddings in later layers.

parameters: {"dim":128,"layers":[9,10]}

Quantization

late QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.9965}

SWA

parameters: {"interval":50,"condition":"lr_scale < 0.2"}

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

LN scale

parameters: {"enabled":true}

Optimizer

Parallel Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"warmup_from":0.92,"warmup_steps":1500}

Adam

weight_decay: 0.095

momentum: null

other_params: {"beta1":0.9,"beta2":0.95}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":5000}

Compression

lzma

level: null

Other

other

Custom two-layer BESE tokenizer with 40 structured base tokens and 248 BPE merges, replacing SentencePiece.

parameters: {"vocab_size":288}

Novel Contributions

First custom tokenizer submission on the record track
Two-layer BESE tokenizer with 40 base tokens and 248 BPE merges
Byte-count invariant tokenizer design for exact BPB accounting
Custom tokenizer reduces embedding table size to fund deeper recurrence and other model capacity
Depth recurrence, parallel residuals, and n-gram eval-time logit tilt enabled by saved artifact budget
Three-seed mean result with all runs under the wallclock limit