PR #852

open

Hymba-11L: SOTA High-Density Takeover (1.1189 BPB)

by Prush69View on GitHub

val_bpb

1.1189

Architecture

Hymba-11L hybrid architecture combining Selective Scan (Mamba) and Rotary Attention

Optimizer

Muon

Artifact Size

14.5 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"asynchronous_reduce_scatter":true,"asynchronous_all_gather":true,"orthogonalization":"Newton-Schulz 5","communication_computation_overlap":true}

Architecture

Selective Scan (Mamba)

Hybrid architecture component used alongside rotary attention for sequence modeling.

parameters: null

RoPE

Rotary attention used as part of the hybrid architecture.

parameters: null

BigramHash

Hybrid embedding system for vocab-efficiency and dimensionality reduction.

parameters: null

Quantization

QAT

bits: 4

scope: all

Test-Time Training

full TTT

parameters: {"epochs":3}

Compression

zstd

level: 22

Other

other

3D parameter banking with sharded slices stored in larger tensors to reduce kernel launch overhead and facilitate bulk sharding.

parameters: null

other

LeakyReLU(0.5)^2 activation used to accelerate polynomial approximation in MLP blocks.

parameters: {"activation":"LeakyReLU(0.5)^2"}

Novel Contributions

Parallel Muon optimizer with asynchronous reduce_scatter/all_gather to overlap communication and computation
3D parameter banking for sharded core weights
High-density 3-epoch test-time training enabled by reclaimed compute budget
4-bit TurboQuant QAT with entropy-flattened weights
LeakyReLU(0.5)^2 activation for faster convergence
BigramHash-based embedding dimensionality reduction