PR #705
openByte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)
by seanwardView on GitHub
val_bpb
1.2151
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.795055 MB
Training Techniques
Architecture
tied embeddings
Shares the byte embedding table with the output projection.
parameters: null
SmearGate
Adds SmearGate feature processing in the byte-level model.
parameters: null
BigramHash
Uses hashed byte-bigram embeddings to capture local byte-pair statistics.
parameters: {"buckets":4096,"dim":32}
MLP3x
Uses a 3x hidden-size MLP with LeakyReLU² activation.
parameters: {"hidden_multiplier":3,"hidden_dim":1536}
U-Net style skip connections
Adds learned encoder-decoder skip connections across transformer layers.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":2500}
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
gradient clipping
parameters: {"max_norm":0.3}
Other
other
Tokenizer-free raw UTF-8 byte-level modeling with no tokenizer, BPE, or SentencePiece.
parameters: {"vocab_size":256}
Novel Contributions
- First tokenizer-free byte-level model to beat the sp1024 baseline in Parameter Golf
- Raw UTF-8 byte modeling with vocab size 256 and no tokenizer/BPE/SentencePiece
- Hashed byte-bigram embeddings to capture local byte-pair statistics
- SmearGate and U-Net style skip connections in a pure self-attention transformer
- LeakyReLU² activation in the MLP
- Sliding-window evaluation at stride 512 over 4096-byte contexts
- Int6 quantization combined with zstd-22 compression
- 4-seed significance test showing consistent improvement over baseline