PR #664

open

Non-record: hybrid spiking Transformer (SNN)with a multi-step spiking MLP

by tsbioskyView on GitHub
val_bpb
1.2982
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.78 MB

Training Techniques

Architecture
spiking MLP
Replaces the standard Transformer feed-forward block with a multi-step leaky integrate-and-fire (LIF-style) spiking MLP while keeping dense attention and the rest of the Transformer pipeline unchanged.
parameters: {"layers":9,"width":512,"attention_heads":8,"kv_heads":4,"sequence_length":1024,"snn_steps":2}
tied embeddings
Uses tied input/output embeddings as part of the baseline architecture.
parameters: null
RoPE
Uses rotary position embeddings in the Transformer baseline.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
Quantization
int8
bits: 8
scope: final serialized model
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix-shaped parameters"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"token embeddings and scalar/vector parameters"}
Regularization
spike-rate regularization
parameters: {"rate_loss":0.0001,"rate_target":0.15}
Other
other
Surrogate-gradient training for spiking neurons using a sigmoid straight-through estimator.
parameters: {"grad_scale":4}

Novel Contributions

  • Hybrid Transformer + SNN-MLP design
  • Replaces only the feed-forward block with a multi-step LIF-style spiking MLP
  • Preserves the original Parameter Golf training, evaluation, and export pipeline
  • Uses surrogate-gradient training for the spiking pathway
  • Applies spike-rate regularization to control firing behavior
  • Fits under the 16 MB submission limit after int8 + zlib compression