PR #1621

open

BESE v5.3: Novel 288-token tokenizer (non-record 16MB)

by mrbeseView on GitHub

val_bpb

1.1531

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

12.72 MB

Training Techniques

Architecture

depth recurrence

Layers 3-5 are repeated 3 times during training/evaluation after activation at 35% progress.

parameters: {"layers":[3,4,5],"loops":3,"activation_frac":0.35}

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"squared":true,"negative_slope":0.5}

Parallel residuals

GPT-J style parallel residual connections starting from layer 7.

parameters: {"start_layer":7}

Partial RoPE

Applies rotary position embeddings to only part of the hidden dimensions.

parameters: {"dimensions":16}

Value Residual

Value embedding enabled for later layers.

parameters: {"dimension":128,"layers":[9,10]}

BigramHash

Adds a precomputed bigram hash embedding/bias feature.

parameters: {"vocab":2048,"dimension":128}

XSA

XSA module used in the last 4 layers.

parameters: {"layers":4}

Quantization

INT6

bits: 6

scope: all

STE QAT

bits: 6

scope: all

Optimizer

Parallel Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}

Adam

weight_decay: 0.095

momentum: null

other_params: {"beta1":0.9,"beta2":0.95}

Weight Averaging

EMA

parameters: {"decay":0.9965}

SWA

parameters: {"interval_steps":50,"lr_scale_threshold":0.2}

Compression

LZMA

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

LN scale

parameters: {"enabled":true}

LR Schedule

warmdown

parameters: {"warmdown_steps":5000}

Novel Contributions

Novel two-layer BESE tokenizer with a 288-token vocabulary
Structured 40-token base alphabet plus 248 BPE merges
Byte-count-correct tokenizer design with proof of BPB invariance
Reduced embedding table size versus SentencePiece to free budget for more model capacity
Eval-time n-gram logit tilt using a precomputed bigram/trigram table
Depth recurrence with parallel residuals under a 16MB artifact budget