PR #476

open

[Non-record] MHALM v1 (1.4574 bpb)

val_bpb
1.4574
Architecture
Multi-head language model with kernel-based readout heads and a ComplexSSM + causal self-attention temporal stack
Optimizer
Muon
Artifact Size
10.8 MB

Training Techniques

Architecture
BigramHash
Uses a BigramHash embedding/bucket mechanism with 10240 buckets to augment token representations.
parameters: {"buckets":10240}
weight tying
Output projection is weight-tied with the embedding.
parameters: null
multi-kernel readout heads
Replaces a single linear output layer with five kernel heads: Spherical, Gabor, Laplacian, Tucker, and Linear, combined by a learned mixer.
parameters: {"heads":5}
ComplexSSM
Adds a complex-valued state-space model for long-range context processing.
parameters: null
causal self-attention
Uses 2 layers of causal self-attention with RoPE and query gain for local token interactions.
parameters: {"layers":2,"heads":8}
U-Net skip connection
Encoder outputs from Block 0 feed into Block 1 via a skip connection.
parameters: null
Weight Averaging
SWA
parameters: {"checkpoints":201,"last_fraction":0.4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"encoder matrices"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"everything else"}
Other
other
Learned softmax-weighted mixer combines logits from five kernel heads with a soft cap to prevent domination by any single head.
parameters: null

Novel Contributions

  • Multi-kernel language model with five geometric readout heads
  • Kernel heads based on Spherical, Gabor, Laplacian, Tucker, and Linear similarity measures
  • 128-dimensional Stäckel coordinate space for representation learning
  • BigramHash augmentation with 10240 buckets
  • ComplexSSM for long-range context combined with causal self-attention
  • Learned mixer over head logits with soft capping
  • SWA over the last 40% of training