PR #2064

open

non-record-16mb: SpikingMLP + GRU readout (T=4, H_GRU=64)

val_bpb

1.3779

Architecture

Transformer

Optimizer

Muon

Artifact Size

13,954,474 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary positional embeddings.

parameters: null

RMSNorm

RMS normalization in the Transformer blocks.

parameters: null

SpikingMLP

Replaced each block's ReLU² MLP with a per-token LIF spiking MLP plus GRUCell readout over T=4 micro-steps.

parameters: {"T":4,"h_gru":64,"beta":0.9,"thresh":0.5}

Quantization

int8

bits: 8

scope: weights

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Replaces the standard ReLU² MLP with a Spiking-LIF MLP.
Uses a per-token GRUCell readout over 4 LIF micro-steps.
Frames the GRU as an adapter on a random linear map / binary spike code.
Introduces per-token recurrent state-space dynamics inside the block forward pass.
Keeps the model under the 16 MB cap with int8 + zlib artifact compression.