PR #476

open

[Non-record] MHALM v1 (1.4574 bpb)

by aquemyView on GitHub

val_bpb

1.4574

Architecture

Multi-head language model with kernel-based readout heads and a ComplexSSM + causal self-attention temporal stack

Optimizer

Muon

Artifact Size

10.8 MB

Training Techniques

Architecture

BigramHash

Uses a BigramHash embedding/bucket mechanism with 10240 buckets to augment token representations.

parameters: {"buckets":10240}

weight tying

Output projection is weight-tied with the embedding.

parameters: null

multi-kernel readout heads

Replaces a single linear output layer with five kernel heads: Spherical, Gabor, Laplacian, Tucker, and Linear, combined by a learned mixer.

parameters: {"heads":5}

ComplexSSM

Adds a complex-valued state-space model for long-range context processing.

parameters: null

causal self-attention

Uses 2 layers of causal self-attention with RoPE and query gain for local token interactions.

parameters: {"layers":2,"heads":8}

U-Net skip connection

Encoder outputs from Block 0 feed into Block 1 via a skip connection.

parameters: null

Weight Averaging

SWA

parameters: {"checkpoints":201,"last_fraction":0.4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"encoder matrices"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"everything else"}

Other

other

Learned softmax-weighted mixer combines logits from five kernel heads with a soft cap to prevent domination by any single head.

parameters: null

Novel Contributions

Multi-kernel language model with five geometric readout heads
Kernel heads based on Spherical, Gabor, Laplacian, Tucker, and Linear similarity measures
128-dimensional Stäckel coordinate space for representation learning
BigramHash augmentation with 10240 buckets
ComplexSSM for long-range context combined with causal self-attention
Learned mixer over head logits with soft capping
SWA over the last 40% of training