PR #1044
openH-Net: First Learned Byte-Level Tokenization (README Wishlist) -- 1.90 BPB, 22M params
by greqoneView on GitHub
val_bpb
1.8989
Architecture
Hybrid
Optimizer
—
Artifact Size
15.4MB
Training Techniques
Architecture
Hybrid
H-Net-style learned byte-level tokenization with dynamic chunking gate, chunk/dechunk layers, and a transformer operating on compressed chunks; replaces Mamba-2 with causal depthwise Conv1d encoder/decoder.
parameters: {"layers":9,"d_model":512,"heads":8,"kv_heads":4,"chunk_ratio":0.25}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied output head / tied embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
EMA
Uses EMA-based dechunking / expansion back to full byte sequence.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Regularization
weight decay
parameters: null
Other
other
Differentiable chunking gate using cosine similarity with straight-through estimation to learn byte-level boundaries dynamically.
parameters: {"target_ratio":0.25}
other
Auxiliary chunk ratio loss used to steer boundary density during training.
parameters: {"weight":1}
other
Vectorized ChunkLayer and DeChunkLayer implemented with cumsum-based segment IDs, scatter operations, and broadcasted exponential decay.
parameters: null
Novel Contributions
- First tiny-scale H-Net implementation for Parameter Golf
- First learned byte-level tokenization submission using dynamic chunking
- Vectorized ChunkLayer/DeChunkLayer without Python batch loops
- Pure-PyTorch depthwise causal Conv1d replacement for Mamba-2 SSM layers
- Demonstrated end-to-end training of a chunking gate with auxiliary ratio loss
- Produced a sub-16MB artifact at 15.4MB