PR #1044

open

H-Net: First Learned Byte-Level Tokenization (README Wishlist) -- 1.90 BPB, 22M params

by greqoneView on GitHub
val_bpb
1.8989
Architecture
Hybrid
Optimizer
Artifact Size
15.4MB

Training Techniques

Architecture
Hybrid
H-Net-style learned byte-level tokenization with dynamic chunking gate, chunk/dechunk layers, and a transformer operating on compressed chunks; replaces Mamba-2 with causal depthwise Conv1d encoder/decoder.
parameters: {"layers":9,"d_model":512,"heads":8,"kv_heads":4,"chunk_ratio":0.25}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied output head / tied embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
EMA
Uses EMA-based dechunking / expansion back to full byte sequence.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Regularization
weight decay
parameters: null
Other
other
Differentiable chunking gate using cosine similarity with straight-through estimation to learn byte-level boundaries dynamically.
parameters: {"target_ratio":0.25}
other
Auxiliary chunk ratio loss used to steer boundary density during training.
parameters: {"weight":1}
other
Vectorized ChunkLayer and DeChunkLayer implemented with cumsum-based segment IDs, scatter operations, and broadcasted exponential decay.
parameters: null

Novel Contributions

  • First tiny-scale H-Net implementation for Parameter Golf
  • First learned byte-level tokenization submission using dynamic chunking
  • Vectorized ChunkLayer/DeChunkLayer without Python batch loops
  • Pure-PyTorch depthwise causal Conv1d replacement for Mamba-2 SSM layers
  • Demonstrated end-to-end training of a chunking gate with auxiliary ratio loss
  • Produced a sub-16MB artifact at 15.4MB