PR #1842

open

[Non-record] H-Net dynamic chunking — Path B (attention-only, byte-level)

by Azura-WhistonView on GitHub
val_bpb
1.8838
Architecture
Transformer
Optimizer
AdamW
Artifact Size
9.49MB

Training Techniques

Architecture
weight tying
Tied embedding and output head weights.
parameters: null
attention
Replaced Mamba-2 layers with pre-norm Transformer blocks (Path B, attention-only).
parameters: {"layers":12}
DynamicChunking
Single-stage dynamic chunking module using cosine similarity routing and chunk compression.
parameters: {"target_ratio":0.16667}
EMA
Chunk-level EMA dechunking/upsampling over the compressed sequence.
parameters: null
Initialization
identity init
Identity initialization for W_q and W_k so the router starts as raw cosine similarity of adjacent encoder states.
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
linear warmup
parameters: {"warmup_steps":500}
cosine decay
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.1}

Novel Contributions

  • Attention-only Path B implementation of H-Net dynamic chunking on parameter-golf
  • Byte-level (260 vocab) submission replacing the prior BPE tokenizer
  • Single-stage dynamic chunking with target ratio 1/6
  • Chunk-level EMA dechunking implementation matching the reference repo's math
  • End-to-end verified int8 compressed artifact under the 16MB cap
  • Quantization ablation across int8/int6/int4/int3/int2/int1 with zlib and lzma