PR #1305

open

[Non-record] Scaled Byte-level H-Net matches 4-hour subword-level baseline (H-Net val_bpb = 1.2070)

by DariusFeherView on GitHub
val_bpb
1.2070
Architecture
Hybrid
Optimizer
Artifact Size
15.8 MB

Training Techniques

Architecture
H-Net
Scaled byte-level hierarchical network with encoder, main transformer, and decoder stages
parameters: {"layers":12,"encoder_layers":3,"main_layers":6,"decoder_layers":3,"model_dim":512,"heads":8,"kv_heads":4,"chunk_target_size":6,"vocab_size":260}
GQA
Grouped query attention with 4 KV heads
parameters: {"kv_heads":4}
Quantization
GPTQ
bits: 6
scope: all
QAT
bits: null
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
magnitude pruning
parameters: {"prune_pct":0.03}
LR Schedule
warmdown
parameters: {"warmdown_steps":25000}
Other
other
torch.compile-compatible fixed chunking with a boundary cap per batch to enable compilation
parameters: {"chunk_divisor":4}

Novel Contributions

  • Scaled byte-level H-Net from 9 layers to 12 layers
  • Matched the 4-hour subword-level baseline with byte-level modeling
  • Used INT6 GPTQ plus QAT to fit within the artifact budget
  • Introduced torch.compile-compatible fixed chunking for faster training
  • Applied sliding window evaluation with stride 64
  • Analyzed fixed-cap chunk truncation and dynamic chunking at inference
  • Compared byte260 and sp1024 H-Net variants