PR #1104

open

[Non-record] 1-Stage Byte-level H-Net at 17.5M: Dynamic Chunking Learns Word Boundaries (39x-91x fewer params than H-Net paper)

val_bpb

1.3595

Architecture

Transformer

Optimizer

—

Artifact Size

15.78 MB

Training Techniques

Architecture

KV head count

Uses different KV head counts for byte-level and subword-level H-Net variants.

parameters: {"byte260":4,"sp1024":2}

U-Net skip connections

Adds a residual skip from encoder output to the dechunked representation before the decoder.

parameters: null

EMA

Uses EMA-based smoothing in the DeChunkLayer driven by routing probabilities.

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Regularization

ratio loss

parameters: {"weight":0.05}

ratio loss

parameters: {"weight":0.03}

Compression

zlib

level: null

Matched byte-level (byte260) vs subword-level (sp1024) H-Net ablation study
Demonstrates that dynamic chunking on raw bytes learns whitespace-aligned, word-like boundaries
Reports quantitative boundary metrics such as whitespace agreement and chunk-size coefficient of variation
Provides qualitative boundary visualizations for trained and untrained models
Implements a working multi-GPU DDP training path for padded chunk sequences
Shows a 4-hour extended byte-level run improving BPB beyond the best 10-minute subword run