PR #1067

open

BSM (Bounded State Manifold) - A box intersection non-transformer architecture, 1.4242 val BPB

by dheeren-tejaniView on GitHub
val_bpb
1.4242
Architecture
Hybrid
Optimizer
Muon
Artifact Size
~17.08 MB

Training Techniques

Architecture
BoxIntersectionMixer
Non-attention causal token mixing via geometric bounding box intersection using max/min pooling over box edges.
parameters: {"layers":12,"dimension":768,"sequence_length":1024}
U-Net skip connections
Planned improvement mentioned in README for connecting encoder outputs to decoder inputs.
parameters: null
Quantization
STE QAT
bits: null
scope: block weights
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"lr":0.04}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.04}
Initialization
OrthoInit
Centers are initialized orthogonally.
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.2}
Compression
lzma
level: null

Novel Contributions

  • Bounded State Manifold (BSM) non-attention architecture
  • Geometric bounding-box intersection for causal token mixing
  • O(N) max/min pooling-based mixer
  • Ternary weight quantization with straight-through estimator
  • Muon optimizer for matrix parameters
  • Orthogonal initialization of token centers