PR #992

open

[Non-Record] H-Net with Dynamic Sequence Chunking

by TimS-mlView on GitHub
val_bpb
1.4054
Architecture
Hybrid
Optimizer
Muon
Artifact Size
11.9 MB

Training Techniques

Architecture
Hybrid
H-Net hierarchical architecture with encoder -> dynamic sequence chunker -> inner compressed sequence -> upsample -> decoder.
parameters: null
GQA
Uses grouped query attention in the transformer stack.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
ReLU²
Uses squared ReLU activation in the transformer stack.
parameters: null
weight tying
Tied embeddings / weight tying implied by the competition baseline stack.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 512
eval_length: null

Novel Contributions

  • First H-Net architecture submission in Parameter Golf
  • Dynamic sequence chunking with learned content-dependent boundaries
  • Hierarchical compression into a shorter latent chunk sequence for modeling
  • Upsampling back to full resolution for autoregressive prediction
  • Empirical finding that layout/depth allocation matters more than width
  • Demonstration that more aggressive compression can improve results on the stronger layout