PR #1842
open[Non-record] H-Net dynamic chunking — Path B (attention-only, byte-level)
by Azura-WhistonView on GitHub
val_bpb
1.8838
Architecture
Transformer
Optimizer
AdamW
Artifact Size
9.49MB
Training Techniques
Architecture
weight tying
Tied embedding and output head weights.
parameters: null
attention
Replaced Mamba-2 layers with pre-norm Transformer blocks (Path B, attention-only).
parameters: {"layers":12}
DynamicChunking
Single-stage dynamic chunking module using cosine similarity routing and chunk compression.
parameters: {"target_ratio":0.16667}
EMA
Chunk-level EMA dechunking/upsampling over the compressed sequence.
parameters: null
Initialization
identity init
Identity initialization for W_q and W_k so the router starts as raw cosine similarity of adjacent encoder states.
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
linear warmup
parameters: {"warmup_steps":500}
cosine decay
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.1}
Novel Contributions
- Attention-only Path B implementation of H-Net dynamic chunking on parameter-golf
- Byte-level (260 vocab) submission replacing the prior BPE tokenizer
- Single-stage dynamic chunking with target ratio 1/6
- Chunk-level EMA dechunking implementation matching the reference repo's math
- End-to-end verified int8 compressed artifact under the 16MB cap
- Quantization ablation across int8/int6/int4/int3/int2/int1 with zlib and lzma