PR #664

open

Non-record: hybrid spiking Transformer (SNN)with a multi-step spiking MLP

by tsbioskyView on GitHub

val_bpb

1.2982

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.78 MB

Training Techniques

Architecture

spiking MLP

Replaces the standard Transformer feed-forward block with a multi-step leaky integrate-and-fire (LIF-style) spiking MLP while keeping dense attention and the rest of the Transformer pipeline unchanged.

parameters: {"layers":9,"width":512,"attention_heads":8,"kv_heads":4,"sequence_length":1024,"snn_steps":2}

tied embeddings

Uses tied input/output embeddings as part of the baseline architecture.

parameters: null

RoPE

Uses rotary position embeddings in the Transformer baseline.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

Quantization

int8

bits: 8

scope: final serialized model

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"matrix-shaped parameters"}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"token embeddings and scalar/vector parameters"}

Regularization

spike-rate regularization

parameters: {"rate_loss":0.0001,"rate_target":0.15}

Other

other

Surrogate-gradient training for spiking neurons using a sigmoid straight-through estimator.

parameters: {"grad_scale":4}

Novel Contributions

Hybrid Transformer + SNN-MLP design
Replaces only the feed-forward block with a multi-step LIF-style spiking MLP
Preserves the original Parameter Golf training, evaluation, and export pipeline
Uses surrogate-gradient training for the spiking pathway
Applies spike-rate regularization to control firing behavior
Fits under the 16 MB submission limit after int8 + zlib compression