PR #2064

open

non-record-16mb: SpikingMLP + GRU readout (T=4, H_GRU=64)

by Gary0302View on GitHub
val_bpb
1.3779
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,954,474 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings.
parameters: null
RMSNorm
RMS normalization in the Transformer blocks.
parameters: null
SpikingMLP
Replaced each block's ReLU² MLP with a per-token LIF spiking MLP plus GRUCell readout over T=4 micro-steps.
parameters: {"T":4,"h_gru":64,"beta":0.9,"thresh":0.5}
Quantization
int8
bits: 8
scope: weights
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Replaces the standard ReLU² MLP with a Spiking-LIF MLP.
  • Uses a per-token GRUCell readout over 4 LIF micro-steps.
  • Frames the GRU as an adapter on a random linear map / binary spike code.
  • Introduces per-token recurrent state-space dynamics inside the block forward pass.
  • Keeps the model under the 16 MB cap with int8 + zlib artifact compression.