PR #705

open

Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)

by seanwardView on GitHub

val_bpb

1.2151

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.795055 MB

Training Techniques

Architecture

tied embeddings

Shares the byte embedding table with the output projection.

parameters: null

SmearGate

Adds SmearGate feature processing in the byte-level model.

parameters: null

BigramHash

Uses hashed byte-bigram embeddings to capture local byte-pair statistics.

parameters: {"buckets":4096,"dim":32}

MLP3x

Uses a 3x hidden-size MLP with LeakyReLU² activation.

parameters: {"hidden_multiplier":3,"hidden_dim":1536}

U-Net style skip connections

Adds learned encoder-decoder skip connections across transformer layers.

parameters: null

Quantization

int6

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":512,"context_length":4096}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":2500}

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

gradient clipping

parameters: {"max_norm":0.3}

Other

other

Tokenizer-free raw UTF-8 byte-level modeling with no tokenizer, BPE, or SentencePiece.

parameters: {"vocab_size":256}

Novel Contributions

First tokenizer-free byte-level model to beat the sp1024 baseline in Parameter Golf
Raw UTF-8 byte modeling with vocab size 256 and no tokenizer/BPE/SentencePiece
Hashed byte-bigram embeddings to capture local byte-pair statistics
SmearGate and U-Net style skip connections in a pure self-attention transformer
LeakyReLU² activation in the MLP
Sliding-window evaluation at stride 512 over 4096-byte contexts
Int6 quantization combined with zstd-22 compression
4-seed significance test showing consistent improvement over baseline