PR #1305

open

[Non-record] Scaled Byte-level H-Net matches 4-hour subword-level baseline (H-Net val_bpb = 1.2070)

by DariusFeherView on GitHub

val_bpb

1.2070

Architecture

Hybrid

Optimizer

—

Artifact Size

15.8 MB

Training Techniques

Architecture

H-Net

Scaled byte-level hierarchical network with encoder, main transformer, and decoder stages

parameters: {"layers":12,"encoder_layers":3,"main_layers":6,"decoder_layers":3,"model_dim":512,"heads":8,"kv_heads":4,"chunk_target_size":6,"vocab_size":260}

GQA

Grouped query attention with 4 KV heads

parameters: {"kv_heads":4}

Quantization

GPTQ

bits: 6

scope: all

QAT

bits: null

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

magnitude pruning

parameters: {"prune_pct":0.03}

LR Schedule

warmdown

parameters: {"warmdown_steps":25000}

Other

other

torch.compile-compatible fixed chunking with a boundary cap per batch to enable compilation

parameters: {"chunk_divisor":4}

Novel Contributions

Scaled byte-level H-Net from 9 layers to 12 layers
Matched the 4-hour subword-level baseline with byte-level modeling
Used INT6 GPTQ plus QAT to fit within the artifact budget
Introduced torch.compile-compatible fixed chunking for faster training
Applied sliding window evaluation with stride 64
Analyzed fixed-cap chunk truncation and dynamic chunking at inference
Compared byte260 and sp1024 H-Net variants