val_bpb
1.4054
Architecture
Hybrid
Optimizer
Muon
Artifact Size
11.9 MB
Training Techniques
Architecture
Hybrid
H-Net hierarchical architecture with encoder -> dynamic sequence chunker -> inner compressed sequence -> upsample -> decoder.
parameters: null
GQA
Uses grouped query attention in the transformer stack.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
ReLU²
Uses squared ReLU activation in the transformer stack.
parameters: null
weight tying
Tied embeddings / weight tying implied by the competition baseline stack.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 512
eval_length: null
Novel Contributions
- First H-Net architecture submission in Parameter Golf
- Dynamic sequence chunking with learned content-dependent boundaries
- Hierarchical compression into a shorter latent chunk sequence for modeling
- Upsampling back to full resolution for autoregressive prediction
- Empirical finding that layout/depth allocation matters more than width
- Demonstration that more aggressive compression can improve results on the stronger layout