PR #1842

open

[Non-record] H-Net dynamic chunking — Path B (attention-only, byte-level)

by Azura-WhistonView on GitHub

val_bpb

1.8838

Architecture

Transformer

Optimizer

AdamW

Artifact Size

9.49MB

Training Techniques

Architecture

weight tying

Tied embedding and output head weights.

parameters: null

attention

Replaced Mamba-2 layers with pre-norm Transformer blocks (Path B, attention-only).

parameters: {"layers":12}

DynamicChunking

Single-stage dynamic chunking module using cosine similarity routing and chunk compression.

parameters: {"target_ratio":0.16667}

EMA

Chunk-level EMA dechunking/upsampling over the compressed sequence.

parameters: null

Initialization

identity init

Identity initialization for W_q and W_k so the router starts as raw cosine similarity of adjacent encoder states.

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

lzma

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

linear warmup

parameters: {"warmup_steps":500}

cosine decay

parameters: null

Regularization

weight decay

parameters: {"weight_decay":0.1}

Novel Contributions

Attention-only Path B implementation of H-Net dynamic chunking on parameter-golf
Byte-level (260 vocab) submission replacing the prior BPE tokenizer
Single-stage dynamic chunking with target ratio 1/6
Chunk-level EMA dechunking implementation matching the reference repo's math
End-to-end verified int8 compressed artifact under the 16MB cap
Quantization ablation across int8/int6/int4/int3/int2/int1 with zlib and lzma