PR #1665

open

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)

val_bpb
1.3571
Architecture
Hybrid
Optimizer
Muon
Artifact Size
7.6 MB

Training Techniques

Architecture
Hybrid
6 Mamba-3 SSD blocks combined with 2 attention blocks for global mixing.
parameters: {"layers":8,"mamba_blocks":6,"attention_blocks":2,"attention_positions":[2,5],"dim":512,"d_state":128,"ngroups":1,"expand":2}
GQA
Attention blocks use grouped query attention.
parameters: {"num_heads":8,"num_kv_heads":4}
depth recurrence
Depth recurrence on SSM layers was disabled because it hurt performance.
parameters: null
weight tying
Shared B/C projections across SSM heads via ngroups=1.
parameters: {"ngroups":1}
Quantization
mixed int6
bits: 6
scope: MLP, attention, and Mamba projection weights
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Other
other
Applied an n-gram tilt using a pre-computed trigram prior as an additive logit bias during evaluation.
parameters: {"ngram":"trigram"}
other
Custom BESE byte-level tokenizer with 288 vocabulary size.
parameters: {"vocab_size":288}
Weight Averaging
EMA
parameters: {"decay":0.9965}
SWA
parameters: {"start_step":1200}
LR Schedule
warmdown
parameters: {"warmdown_steps":5000}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • First submission combining a custom byte-level tokenizer with Mamba-3 SSD
  • BESE 288-vocab tokenizer for improved artifact efficiency
  • Hybrid architecture mixing Mamba-3 SSD blocks with attention blocks
  • Demonstrated competitive BPB at roughly half the artifact budget
  • Pure PyTorch SSD implementation with a causality fix in chunked SSD