PR #1665

open

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)

by mrbeseView on GitHub

val_bpb

1.3571

Architecture

Hybrid

Optimizer

Muon

Artifact Size

7.6 MB

Training Techniques

Architecture

Hybrid

6 Mamba-3 SSD blocks combined with 2 attention blocks for global mixing.

parameters: {"layers":8,"mamba_blocks":6,"attention_blocks":2,"attention_positions":[2,5],"dim":512,"d_state":128,"ngroups":1,"expand":2}

GQA

Attention blocks use grouped query attention.

parameters: {"num_heads":8,"num_kv_heads":4}

depth recurrence

Depth recurrence on SSM layers was disabled because it hurt performance.

parameters: null

weight tying

Shared B/C projections across SSM heads via ngroups=1.

parameters: {"ngroups":1}

Quantization

mixed int6

bits: 6

scope: MLP, attention, and Mamba projection weights

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Other

other

Applied an n-gram tilt using a pre-computed trigram prior as an additive logit bias during evaluation.

parameters: {"ngram":"trigram"}

other

Custom BESE byte-level tokenizer with 288 vocabulary size.

parameters: {"vocab_size":288}

Weight Averaging

EMA

parameters: {"decay":0.9965}

SWA

parameters: {"start_step":1200}

LR Schedule

warmdown

parameters: {"warmdown_steps":5000}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

First submission combining a custom byte-level tokenizer with Mamba-3 SSD
BESE 288-vocab tokenizer for improved artifact efficiency
Hybrid architecture mixing Mamba-3 SSD blocks with attention blocks
Demonstrated competitive BPB at roughly half the artifact budget
Pure PyTorch SSD implementation with a causality fix in chunked SSD