PR #1104

open

[Non-record] 1-Stage Byte-level H-Net at 17.5M: Dynamic Chunking Learns Word Boundaries (39x-91x fewer params than H-Net paper)

by DariusFeherView on GitHub
val_bpb
1.3595
Architecture
Transformer
Optimizer
Artifact Size
15.78 MB

Training Techniques

Architecture
KV head count
Uses different KV head counts for byte-level and subword-level H-Net variants.
parameters: {"byte260":4,"sp1024":2}
U-Net skip connections
Adds a residual skip from encoder output to the dechunked representation before the decoder.
parameters: null
EMA
Uses EMA-based smoothing in the DeChunkLayer driven by routing probabilities.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
ratio loss
parameters: {"weight":0.05}
ratio loss
parameters: {"weight":0.03}
Compression
zlib
level: null

Novel Contributions

  • Matched byte-level (byte260) vs subword-level (sp1024) H-Net ablation study
  • Demonstrates that dynamic chunking on raw bytes learns whitespace-aligned, word-like boundaries
  • Reports quantitative boundary metrics such as whitespace agreement and chunk-size coefficient of variation
  • Provides qualitative boundary visualizations for trained and untrained models
  • Implements a working multi-GPU DDP training path for padded chunk sequences
  • Shows a 4-hour extended byte-level run improving BPB beyond the best 10-minute subword run