PR #852

open

Hymba-11L: SOTA High-Density Takeover (1.1189 BPB)

by Prush69View on GitHub
val_bpb
1.1189
Architecture
Hymba-11L hybrid architecture combining Selective Scan (Mamba) and Rotary Attention
Optimizer
Muon
Artifact Size
14.5 MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"asynchronous_reduce_scatter":true,"asynchronous_all_gather":true,"orthogonalization":"Newton-Schulz 5","communication_computation_overlap":true}
Architecture
Selective Scan (Mamba)
Hybrid architecture component used alongside rotary attention for sequence modeling.
parameters: null
RoPE
Rotary attention used as part of the hybrid architecture.
parameters: null
BigramHash
Hybrid embedding system for vocab-efficiency and dimensionality reduction.
parameters: null
Quantization
QAT
bits: 4
scope: all
Test-Time Training
full TTT
parameters: {"epochs":3}
Compression
zstd
level: 22
Other
other
3D parameter banking with sharded slices stored in larger tensors to reduce kernel launch overhead and facilitate bulk sharding.
parameters: null
other
LeakyReLU(0.5)^2 activation used to accelerate polynomial approximation in MLP blocks.
parameters: {"activation":"LeakyReLU(0.5)^2"}

Novel Contributions

  • Parallel Muon optimizer with asynchronous reduce_scatter/all_gather to overlap communication and computation
  • 3D parameter banking for sharded core weights
  • High-density 3-epoch test-time training enabled by reclaimed compute budget
  • 4-bit TurboQuant QAT with entropy-flattened weights
  • LeakyReLU(0.5)^2 activation for faster convergence
  • BigramHash-based embedding dimensionality reduction