PR #2145

open

[Non-record] MHALM V2 non-record submission (1.3477 bpb)

by aquemyView on GitHub

val_bpb

1.3477

Architecture

Hybrid

Optimizer

Muon

Artifact Size

13.0 MB

Training Techniques

Architecture

BigramHash

Adds a bigram hash embedding path alongside token embeddings.

parameters: {"dimensions":160,"size":16384}

SmearGate

Uses a smear gate after the embedding and bigram hash inputs.

parameters: null

U-Net skip connections

Uses U-Net style skip connections across HybridAtlasBlocks.

parameters: null

XSA

Uses causal self-attention in the temporal path.

parameters: {"heads":8,"layers_per_pass":2}

RoPE

Applies rotary positional embeddings in attention.

parameters: null

Gated Attention

Combines SSM and attention with a learned gate.

parameters: null

weight tying

Uses tied output projection with the embedding matrix.

parameters: null

KV head count

Uses five parallel kernel readout heads: Nyström, Gabor, Laplacian, Tucker GL, and Linear.

parameters: {"heads":5}

other

Geometric language model with Stäckel coordinate encoders and kernel-based spatial readout.

parameters: {"encoders":3,"encoder_dim":160}

other

Two-pass FFT SSM plus attention temporal path.

parameters: {"passes":2}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"decoupled_with":"AdamW for scalars"}

Weight Averaging

SWA

parameters: {"start":"last 40%"}

Compression

zstd

level: 22

Regularization

logit softcap

parameters: {"value":30}

weight decay

parameters: null

Other

other

Stäckel penalty encouraging soft diagonal covariance in the coordinate encoders.

parameters: {"beta":0.02}

Novel Contributions

Geometric language model using learned Stäckel coordinate encoders
Five parallel kernel readout heads for spatial readout
Two-pass FFT SSM plus attention temporal path
Stiefel-enforced chart encoders
Learned mixture between spatial and temporal paths
Surgical Muon routing with AdamW for scalar parameters
SWA and zstd compression to fit the submission artifact