PR #992

open

[Non-Record] H-Net with Dynamic Sequence Chunking

val_bpb

1.4054

Architecture

Hybrid

Optimizer

Muon

Artifact Size

11.9 MB

Training Techniques

Architecture

Hybrid

H-Net hierarchical architecture with encoder -> dynamic sequence chunker -> inner compressed sequence -> upsample -> decoder.

parameters: null

GQA

Uses grouped query attention in the transformer stack.

parameters: null

RoPE

Uses rotary positional embeddings.

parameters: null

ReLU²

Uses squared ReLU activation in the transformer stack.

parameters: null

weight tying

Tied embeddings / weight tying implied by the competition baseline stack.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 512

eval_length: null

First H-Net architecture submission in Parameter Golf
Dynamic sequence chunking with learned content-dependent boundaries
Hierarchical compression into a shorter latent chunk sequence for modeling
Upsampling back to full resolution for autoregressive prediction
Empirical finding that layout/depth allocation matters more than width
Demonstration that more aggressive compression can improve results on the stronger layout