PR #2116

open

Non-record: Inhibitory layer on PR #1851 stack (val_bpb 1.06438)

by cloud-777-boyView on GitHub

val_bpb

1.0644

Architecture

Transformer

Optimizer

—

Artifact Size

15,996,198 bytes

Training Techniques

Architecture

inhibitory layers

Low-rank gating mechanism applied to attention and MLP residual paths to provide a subtractive primitive.

parameters: {"rank":22,"paths":["attention residual","MLP residual"]}

LeakyReLU

Base stack uses LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

Partial RoPE

Rotary positional embedding applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

depth recurrence

Some layers are looped multiple times in the stack.

parameters: {"layers":[3,4,5],"loops":2,"activated_at_frac":0.35}

weight tying

Tied input embeddings and output embeddings.

parameters: null

XSA

XSA applied across all layers.

parameters: {"layers":11}

SmearGate

SmearGate used in the model stack.

parameters: {"window":12}

GQA

Grouped-query attention configuration in the base stack.

parameters: {"heads":8,"kv_heads":4}

MLP

Expanded MLP width in the transformer block.

parameters: {"multiplier":4}

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: null

Quantization

int8

bits: 8

scope: inhibitor weights

Test-Time Training

score-first TTT

parameters: {"phases":3}

Compression

brotli

level: null

Other

other

Biologically inspired inhibitory primitive motivated by cortical inhibitory interneurons and the fly mushroom body APL neuron.

parameters: null

Novel Contributions

Introduces inhibitory layers as a subtractive primitive for transformer residual streams
Applies low-rank sigmoid gates to both attention and MLP residual paths
Uses a 0.95 sigmoid initialization to avoid saturated, weak-gradient gates
Serializes inhibitor weights as row-int8 to fit under the 16MB artifact limit
Builds on the PR #1851 stack and combines with phased post-TTT