PR #1666
openRecord: BESE 288-vocab Novel Tokenizer — 1.1531 BPB (3-seed mean)
by mrbeseView on GitHub
val_bpb
1.1531
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
12.72 MB
Training Techniques
Architecture
depth recurrence
Applies recurrent depth over layers 3-5 for multiple forward loops during training.
parameters: {"layers":[3,5],"loops":3}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":16}
BigramHash
Adds a hashed bigram feature embedding for evaluation and modeling.
parameters: {"vocab":2048,"dim":128}
XSA
Uses XSA in the last 4 layers.
parameters: {"layers":4}
MLP3x
Uses a 3x MLP expansion ratio.
parameters: {"multiplier":3}
LeakyReLU
Uses LeakyReLU squared as the activation function.
Value Residual
Enables value embeddings in later layers.
parameters: {"dim":128,"layers":[9,10]}
Quantization
late QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.9965}
SWA
parameters: {"interval":50,"condition":"lr_scale < 0.2"}
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
LN scale
parameters: {"enabled":true}
Optimizer
Parallel Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"warmup_from":0.92,"warmup_steps":1500}
Adam
weight_decay: 0.095
momentum: null
other_params: {"beta1":0.9,"beta2":0.95}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":5000}
Compression
lzma
level: null
Other
other
Custom two-layer BESE tokenizer with 40 structured base tokens and 248 BPE merges, replacing SentencePiece.
parameters: {"vocab_size":288}
Novel Contributions
- First custom tokenizer submission on the record track
- Two-layer BESE tokenizer with 40 base tokens and 248 BPE merges
- Byte-count invariant tokenizer design for exact BPB accounting
- Custom tokenizer reduces embedding table size to fund deeper recurrence and other model capacity
- Depth recurrence, parallel residuals, and n-gram eval-time logit tilt enabled by saved artifact budget
- Three-seed mean result with all runs under the wallclock limit