PR #1104
open[Non-record] 1-Stage Byte-level H-Net at 17.5M: Dynamic Chunking Learns Word Boundaries (39x-91x fewer params than H-Net paper)
by DariusFeherView on GitHub
val_bpb
1.3595
Architecture
Transformer
Optimizer
—
Artifact Size
15.78 MB
Training Techniques
Architecture
KV head count
Uses different KV head counts for byte-level and subword-level H-Net variants.
parameters: {"byte260":4,"sp1024":2}
U-Net skip connections
Adds a residual skip from encoder output to the dechunked representation before the decoder.
parameters: null
EMA
Uses EMA-based smoothing in the DeChunkLayer driven by routing probabilities.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
ratio loss
parameters: {"weight":0.05}
ratio loss
parameters: {"weight":0.03}
Compression
zlib
level: null
Novel Contributions
- Matched byte-level (byte260) vs subword-level (sp1024) H-Net ablation study
- Demonstrates that dynamic chunking on raw bytes learns whitespace-aligned, word-like boundaries
- Reports quantitative boundary metrics such as whitespace agreement and chunk-size coefficient of variation
- Provides qualitative boundary visualizations for trained and untrained models
- Implements a working multi-GPU DDP training path for padded chunk sequences
- Shows a 4-hour extended byte-level run improving BPB beyond the best 10-minute subword run