PR #2116

open

Non-record: Inhibitory layer on PR #1851 stack (val_bpb 1.06438)

by cloud-777-boyView on GitHub
val_bpb
1.0644
Architecture
Transformer
Optimizer
Artifact Size
15,996,198 bytes

Training Techniques

Architecture
inhibitory layers
Low-rank gating mechanism applied to attention and MLP residual paths to provide a subtractive primitive.
parameters: {"rank":22,"paths":["attention residual","MLP residual"]}
LeakyReLU
Base stack uses LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
Partial RoPE
Rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Some layers are looped multiple times in the stack.
parameters: {"layers":[3,4,5],"loops":2,"activated_at_frac":0.35}
weight tying
Tied input embeddings and output embeddings.
parameters: null
XSA
XSA applied across all layers.
parameters: {"layers":11}
SmearGate
SmearGate used in the model stack.
parameters: {"window":12}
GQA
Grouped-query attention configuration in the base stack.
parameters: {"heads":8,"kv_heads":4}
MLP
Expanded MLP width in the transformer block.
parameters: {"multiplier":4}
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: null
Quantization
int8
bits: 8
scope: inhibitor weights
Test-Time Training
score-first TTT
parameters: {"phases":3}
Compression
brotli
level: null
Other
other
Biologically inspired inhibitory primitive motivated by cortical inhibitory interneurons and the fly mushroom body APL neuron.
parameters: null

Novel Contributions

  • Introduces inhibitory layers as a subtractive primitive for transformer residual streams
  • Applies low-rank sigmoid gates to both attention and MLP residual paths
  • Uses a 0.95 sigmoid initialization to avoid saturated, weak-gradient gates
  • Serializes inhibitor weights as row-int8 to fit under the 16MB artifact limit
  • Builds on the PR #1851 stack and combines with phased post-TTT