PR #664
openNon-record: hybrid spiking Transformer (SNN)with a multi-step spiking MLP
by tsbioskyView on GitHub
val_bpb
1.2982
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.78 MB
Training Techniques
Architecture
spiking MLP
Replaces the standard Transformer feed-forward block with a multi-step leaky integrate-and-fire (LIF-style) spiking MLP while keeping dense attention and the rest of the Transformer pipeline unchanged.
parameters: {"layers":9,"width":512,"attention_heads":8,"kv_heads":4,"sequence_length":1024,"snn_steps":2}
tied embeddings
Uses tied input/output embeddings as part of the baseline architecture.
parameters: null
RoPE
Uses rotary position embeddings in the Transformer baseline.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
Quantization
int8
bits: 8
scope: final serialized model
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix-shaped parameters"}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"token embeddings and scalar/vector parameters"}
Regularization
spike-rate regularization
parameters: {"rate_loss":0.0001,"rate_target":0.15}
Other
other
Surrogate-gradient training for spiking neurons using a sigmoid straight-through estimator.
parameters: {"grad_scale":4}
Novel Contributions
- Hybrid Transformer + SNN-MLP design
- Replaces only the feed-forward block with a multi-step LIF-style spiking MLP
- Preserves the original Parameter Golf training, evaluation, and export pipeline
- Uses surrogate-gradient training for the spiking pathway
- Applies spike-rate regularization to control firing behavior
- Fits under the 16 MB submission limit after int8 + zlib compression