PR #377

open

Hybrid INL + Sort-Split MoE (1.41/1.46 bpb TTT, 15.5MB, 1xH100)

by Complexity-MLView on GitHub

val_bpb

1.4072

Architecture

Hybrid Transformer

Optimizer

—

Artifact Size

15.5MB

Training Techniques

Architecture

GQA + RoPE

Classical grouped-query attention with rotary positional embeddings in early layers.

parameters: {"layers":[0,1,2,3,4]}

INL BetaMu attention

Error-driven O(n) attention using causal cumsum over (x - mu) instead of QKV attention matrices.

parameters: {"layers":[5,6,7,8]}

Sort-Split MoE

Deterministic argsort + fixed split routing with 4 experts, keeping all experts busy and compatible with fullgraph compilation.

parameters: {"experts":4}

ALiBi

Learned slopes per head used as positional encoding in INL layers.

parameters: null

Token-routed MoE

Deterministic token_id % 4 routing across 4 experts with mask-multiply pattern.

parameters: {"experts":4}

PID Dynamics / INL Ultra-Lite

Learnable equilibrium mu traverses all layers with fixed alpha/beta/gate and clamped velocity for stabilizing hidden state trajectories.

parameters: {"layers":9}

SwiGLU

Replaces relu^2 activation with SwiGLU in expert MLPs.

parameters: {"experts":4}

LR Schedule

cosine warm restarts (SGDR)

parameters: {"cycle_lengths":[5000,10000,20000],"peak_lr_decay":0.7}

Weight Averaging

SWA

parameters: {"checkpoints":23}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Novel Contributions

Hybrid architecture combining classical GQA attention with INL error-driven O(n) attention
Sort-and-split MoE routing with deterministic argsort + fixed split
Token-routed MoE with perfectly balanced 4-expert routing
PID-style dynamics with a learnable equilibrium mu traversing all layers
ALiBi positional encoding in INL layers
Cosine warm restarts learning-rate schedule
SWA checkpoint averaging