val_bpb
1.3779
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,954,474 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings.
parameters: null
RMSNorm
RMS normalization in the Transformer blocks.
parameters: null
SpikingMLP
Replaced each block's ReLU² MLP with a per-token LIF spiking MLP plus GRUCell readout over T=4 micro-steps.
parameters: {"T":4,"h_gru":64,"beta":0.9,"thresh":0.5}
Quantization
int8
bits: 8
scope: weights
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Replaces the standard ReLU² MLP with a Spiking-LIF MLP.
- Uses a per-token GRUCell readout over 4 LIF micro-steps.
- Frames the GRU as an adapter on a random linear map / binary spike code.
- Introduces per-token recurrent state-space dynamics inside the block forward pass.
- Keeps the model under the 16 MB cap with int8 + zlib artifact compression.