Identifying influential nodes in complex contagion mechanism

Song, Jiahui; Wang, Gaoxia

doi:10.3389/fphy.2023.1046077

ORIGINAL RESEARCH article

Front. Phys. , 26 June 2023

Sec. Complex Physical Systems

Volume 11 - 2023 | https://doi.org/10.3389/fphy.2023.1046077

Identifying influential nodes in complex contagion mechanism

Jiahui Song¹

Gaoxia Wang^1,2*

¹College of Science, China Three Gorges University, Yichang, China
²Three Gorges Mathematical Research Center, China Three Gorges University, Yichang, China

Identifying influential nodes in complex networks is one of the most important and challenging problems to help optimize the network structure, control the spread of the epidemic and accelerate the spread of information. In a complex network, the node with the strongest propagation capacity is known as the most influential node from the perspective of propagation. In recent years, identifying the key nodes in complex networks has received increasing attention. However, it is still a challenge to design a metric that has low computational complexity but can accurately identify important network nodes. Currently, many centrality metrics used to evaluate the influence capability of nodes cannot balance between high accuracy and low time complexity. Local centrality suffers from accuracy problems, while global metrics require higher time complexity, which is inefficient for large scale networks. In contrast, semi-local metrics are with higher accuracy and lower time cost. In this paper, we propose a new semi-local centrality measure for identifying influential nodes under complex contagion mechanisms. It uses the higher-order structure within the first and second-order neighborhoods of nodes to define the importance of nodes with near linear time complexity, which can be applied to large-scale networks. To verify the accuracy of the proposed metric, we simulated the disease propagation process in four real and two artificial networks using the SI model under complex propagation. The simulation results show that the proposed method can identify the nodes with the strongest propagation ability more effectively and accurately than other current node importance metrics.

1 Introduction

We are surrounded by a variety of complex systems and most of the complex systems we come across can be abstracted into a complex network model with a certain topology, these networks contain a large number of nodes, which are interconnected by some strong or weak relationship. Complex network models are widely used in the research of various fields. The identification and ranking of influential nodes has always been one of the most fundamental problems in modern network science, it may facilitate our understanding of the structure, function and characteristic of networks [1, 2]. In recent years, efforts have been made to identify influential nodes in complex networks. So far, researchers have proposed a variety of metrics, including traditional measures such as degree centrality [3], betweenness centrality [4], Closeness centrality [5] and K-core centrality [6]. Depending on the node location and topology under consideration, The current node importance metrics are divided into three main categories: local measure, semi-local measure and global measure [7]. Local measures are relatively straightforward and have low time complexity because they only consider the number of nodes in the first-order neighborhood. Instead, global measures have higher accuracy and time complexity, using the overall information of the network to determine the importance of a node. Considering the low accuracy of local measures and the high time complexity of global measures, semi-local centrality measures are proposed, which use the information in the second-order neighborhood of the nodes to improve the accuracy of the measure while maintaining a relatively low time complexity. In terms of local measure, Xu et al. [8] proposed a local clustering h-index (LCH) centrality measure for identifying and ranking influential nodes in complex networks. It takes into account both topological connections between neighboring nodes, the number and quality of neighboring nodes. Nodes can be distinguished more effectively and ranked more accurately Salavati et al. [9] proposed a novel local node sorting algorithm that utilizes local structure to improve closeness centrality aiming to reduce computational complexity. The method is able to find the best set of seed nodes with high propagation capacity and low time complexity is suitable for large-scale networks Ruan et al. [10] considered the links of a node and the connectivity within the neighborhood of that node and proposed an efficient method based on semi-local features to identify key nodes that play an important role in maintaining network connectivity Zhao et al. [11] proposed NL metrics based on the importance of neighborhoods and links, considered the influence of second-order neighborhoods on the nodes, used the connectivity and irreplaceability of links to distinguish the topological positions of nodes Hu et al. [12] proposed a new ranking method using structural holes to identify influential nodes (E-Burt), which takes fully into account the total connection strength of nodes within their local area and the number of connected links Zhu et al. [13] proposed a local h-index centrality method to identify and rank influential nodes in the network. The LH-index method considers both the h-index values of the node itself and its neighbors. In terms of global measure, Enduri et al. [14] used the relative change in the local network on the average shortest path when nodes are removed to define node importance. The method identifies the initial seed nodes and effectively measures the spread of information within the network. From an information theory point of view Yang et al. [15] proposed the EnRenew algorithm aimed to identify a set of influential nodes via information entropy. Firstly, the information entropy of each node is calculated as initial spreading ability. Then, select the node with the largest information entropy and renovate its l-length reachable nodes’ spreading ability by an attenuation factor, repeat this process until specific number of influential nodes are selected. The impressive results on the SIR simulation model Wen et al. [16] proposed a method for identifying influencers in complex networks via the local information dimensionality. The proposed method considers the local structural properties around the central node. A node is more influential when its local information dimensionality is higher. Compared with the other four importance measures in six real networks, the simulation results show the superiority of this measure. From the perspective of artificial intelligence, Fan et al. [17] proposed a deep reinforcement learning framework to find the most important group of spreaders in the network. The proposed framework opens up a new direction for using deep learning techniques to understand how complex networks are organized, which allows us to design more powerful networks to enhance or inhibit propagation. Considering the loop structure in the network, Fan et al. [18] defined two loop-based node characteristics, namely, loop number and loop ratio, which can be used to measure the importance of nodes (in terms of network connectivity). It was verified that nodes with higher cycle ratios are more important for network connectivity and the number of cycles better quantifies the impact of diffusion based on the cycle structure than the general clustering-based node centrality Lin et al. [19] defined the cycle number matrix, a matrix containing cycle information in the network and a metric to quantify the importance of nodes, i.e., the cycle ratio Zhang et al. [20] introduced the node cycle ratio to determine how close the network is to a tree-like network [18–20]. Innovatively introduce the loop structure, which has been previously neglected in the analysis and modeling of networks into the importance measure of nodes, opening another new perspective for identifying nodes with high propagation capacity. It has been proven that loops as a mesoscopic structure in the network, play an important role in the structure and function of the network. In terms of the transmission dynamics of the network, Chen et al. [21] proposed a new method for dynamic ordering of nodes using a probabilistic model to measure the ordering of nodes. This simple and effective method opens new ideas for the identification of important nodes in network propagation dynamics. In terms of semi-local measure, A new semi-local centrality measure was proposed by Kamal et al. [22]. It uses the positive effect of the secondary neighborhood clustering coefficient and the negative effect of the node clustering coefficient to define the importance of a node. That is, a node with a high clustering coefficient may be a “hub” node or a “bridge” node. If the sum of the clustering coefficients of a node’s secondary neighbors is high, then the node’s secondary neighbors are located in the dense part of the network. If a node has low clustering, high and dense secondary neighbors, it is identified as a structural hole Lei et al. [23] introduced a centrality metric (HIC) to identify influential nodes in complex networks. It combines node neighborhood, location and topology features to identify influential nodes. In terms of identifying multiple influential nodes, Mishra et al. [24] identify a set of influential nodes in undirected and unweighted networks, using only the local topology of the network in the absence of global information about the network. In this paper, we propose a centrality measure based on the higher-order structure within the second-order neighborhood of a node. Taking into account the fully connected quadrilateral and diagonal quadrilateral in the first-order neighborhood and the loop and diagonal quadrilateral in the second-order neighborhood, the higher-order structures considered are more diverse and more varied. We extend the traditional clustering coefficient to a higher-order form. This metric is also used as a measure to compare with the current representative measure of node propagation ability in terms of node circulation rate (CR) [19], Measurement of positive and negative clustering coefficient based on node neighborhood (FI) [22] and measurement based on structural hole features (SHF) [10]. We used the SI model to simulate the virus propagation process on six networks under the condition of complex contagion. The numerical simulation results show that the proposed node importance metric can identify the nodes with the strongest propagation ability more quickly and accurately.

The structure of this paper is as follows. In the first part, a brief introduction to the three metrics used for comparison and the basic concepts related to this paper is given first, focusing on the higher-order structure of networks and defining the concept of high-order node degree, it also illustrates how the higher-order structure affects the propagation dynamics on the network. In the second part, a new measure for evaluating the importance of nodes is introduced, extending the traditional clustering coefficients based on nodes to the form of higher-order within the second-order neighborhood of nodes and a optimal algorithm for finding the higher-order structure within the second-order neighborhood of nodes is also proposed in the Supplementary Materials. In the third part, the models and data sets used are described. In the fourth part, simulations are done in six networks using the SI model under complex propagation mechanism and the superiority of the proposed metric is derived. Finally, we conclude with a summary of the work done and some prospects for future research.

2 Preliminaries

2.1 Basic concepts of complex networks

A complex network can be represented by graph G = (V, E), where G is an undirected connected graph, with n nodes, m edges, $V$ represents the set of nodes, $E$ represents the set of links, The adjacency matrix of G of $A = [a_{i j}]$ has n rows n columns, and the elements in A are defined $a_{i j}$ as follows: (i) $a_{i j} = 1$ , the ith node is connected to the jth node; (ii) $a_{i j} = 0$ , the ith node is infinitely connected to the jth node. Degree is the most basic concept in complex networks. The degree $k_{i}$ of a node i refers to the number of nodes directly connected to it. In an undirected and unweighted network, suppose that the distance between the nodes directly connected to node i and node i is 1, then the first-order neighborhood of node i refers to the set of nodes with a distance of 1. The second-order neighborhood of node i is the set of nodes with a distance of 2.

2.2 Importance measure of the nodes

Because there are more existing indicators of the importance of nodes. Therefore, only three of the most representative indicators of the propagation ability of nodes are selected for comparative illustration.

2.2.1 Node cycle rate (CR)

Loops are another widely observed structure that plays an important role in both structural organization and functional implementation. A loop can be simply defined as a closed path with the same start and end nodes. The size of a loop is equal to the number of links it contains. The loop containing node i with the smallest size is defined as the shortest loop associated with node i and the corresponding size is called the perimeter of node i. Define the loop rate of a node, i.e.,

C R (i) = \sum_{j} \frac{c_{i i}}{c_{i j}} (c_{i i}, c_{i j} > 0) (1)

Where $c_{i j}$ represents the number of loops in the network via nodes i, j, $c_{i i}$ represents the number of loops in the network containing node i. They are all available from the loop matrix of the network.

2.2.2 Measurement of positive and negative clustering coefficient based on node neighborhood (FI)

This metric was proposed by Kamal and it is a semi-local centrality metric used to identify influential propagators in complex networks. A node is an important “connector” between dense parts of the network (modules) if it has low clustering, high degree and a large sum of clustering coefficients of the nodes in its second-order neighborhood, i.e., a structural hole (influential propagator). The formula is defined as,

F I (i) = k_{i} \times \frac{1}{c (i) + \frac{1}{k_{i}}} + \sum_{j \in V^{(2)} (i)} c (j) (2)

Where $k_{i}$ is the degree of node i, $c (i)$ is the clustering coefficient of node i, $V^{(2)} (i)$ which is the set of second-order neighborhood nodes, and $c (j)$ is the clustering coefficient of second-order neighborhood nodes.

2.2.3 Measurement based on structural hole features (SHF)

A node is more important to the network when it is characterized by more structural holes. Based on this Ruan et al. proposed a node importance ranking algorithm (SHF) that incorporates structural hole features to identify the network nodes that play an important role in maintaining network connectivity. The formula is defined as,

S H F (i) = \sum_{j \in V (i)} {(\frac{1}{k_{i}} + \frac{1}{k_{i} k_{j}})}^{2} \times (1 + |V (i) ⋂ V (j)|) (3)

where $k_{i}$ and $k_{j}$ denote the degree values of node i and node j, respectively. $V (i)$ and $V (j)$ are the sets of first-order neighborhood nodes of node i and node j, respectively.

2.3 High-order structure and high-order node degree

2.3.1 Higher-order structure in networks

Networks are inherently limited to describing pairwise interactions, whereas real-world systems are often characterized by higher-order interactions involving three or more units. Therefore, higher-order structures, such as hypergraphs and simplicial complex are better tools for depicting the real structure of many social, biological and man-made systems [25]. A triangle is the smallest higher-order structure, which is a third-order (number of nodes is 3) connected subgraph formed by three nodes connected to each other. It is the basic unit of association and the basic topology of the network. The size of the higher-order structure is between a single node and a large dense community. It is a fully or partially connected subgraph consisting of a few nodes connected in a guaranteed closed-loop situation, which occurs more frequently in real networks than in their corresponding random networks and increases with the increase of network size and connection density between nodes. As shown in Figure 1.

FIGURE 1

FIGURE 1. From (i–iii), they are third-order, fourth-order and five-order, respectively. Where $G_{i}^{k}$ represents the ith k-order structure. It can be seen that the variety of higher order structures grows exponentially as the order increases. In (i), there exists only a unique higher-order structure, i.e., $G_{1}^{3}$ . In (ii,iii), all of them are partially connected except for $G_{3}^{4}$ and $G_{9}^{5}$ which are fully connected.

2.3.2 High-order node degree

The traditional node clustering coefficient measures how many triangles with the node as the vertex in the first-order neighborhood, i.e., how sparsely the neighboring nodes are connected to each other in the first-order neighborhood. The formula is:

C_{i} = \frac{2 L_{i}}{k_{i} (k_{i} - 1)} (4)

Where, $L_{i}$ represents the actual number of links in the first-order neighborhood of node i, and $k_{i} (k_{i} - 1) / 2$ represents the maximum number of all possible links formed. The traditional node clustering coefficient only takes into account the connections among neighbors in first-order neighborhoods, as shown in Figure 2, $k_{a}^{∆} = 6$ , where $∆$ is the number of triangles with a as the vertex. This local metric uses only information from the first-order neighbors to determine the importance of the nodes. Inspired by this, we enrich the variety of higher-order structures, extending the range to second-order neighborhoods to determine the importance of nodes. High-order node degree is defined as the sum of the number of triangles, diagonal quadrilateral and fully connected quadrilateral in the first-order neighborhood of a node and the number of diagonal quadrilateral and loops in the second-order neighborhood. Expressed as $k^{Φ}$ . As shown in Figure 2, $k_{a}^{Φ} =$ 10, $Φ$ is the number of higher-order structures of the node a involved in the first and second-order neighborhoods. This is actually a semi-local measure, and the analysis range is increased to include higher order structures in second-order neighborhoods, extending from triangles to higher-order quadrilaterals. To determine the importance of a node, in addition to using node information from first-order neighborhoods, node information from second-order neighborhoods is used to improve accuracy. At the same time, this does not significantly increase the time complexity of the computation.

FIGURE 2

FIGURE 2. A toy networks. In the first-order neighborhood of node A (red), the nodes located on the same higher-order structure with node A are B、C、D、E、F、H、J (green). The second-order neighbor nodes located on the same higher-order structure as node A are G and I (black). The triangle in the first-order neighborhood of node A are ABC、ABD、ACD、ADE、AEF、AHJ, the diagonal quadrilateral is ADEF, a fully connected quadrilateral is ABCD. The diagonal quadrilateral in the second-order neighborhood of node A is AHIJ, the empty quadrilateral (loop) is AFGH. In addition, the yellow link represents the internal link of the small community formed by the four types of high-order structures in the first and second-order neighborhood of node A, the blue links represent the small community’s links to external contacts.

2.3.3 Effect of the higher-order structure on the propagation dynamics

Several studies have shown that the presence of higher-order interactions may severely affect the dynamics of network systems, from diffusion [26, 27] and synchronization [28, 29] to social [30–33] and evolutionary processes [34], possibly leading to the emergence of abrupt (explosive) transitions between states. Under different mechanisms of simple and complex contagion, the presence of higher-order structures can make the propagation process on the network significantly different [35–38]. Diffusion is often described as either “simple contagion” or “complex contagion”, where simple contagion is a process in which a node is easily infected through a single contact with an infected neighbor. Complex infection is a collaborative merged infection process in which nodes are usually exposed to multiple infected neighbors multiple times before changing state. Under a simple transmission mechanism, infected nodes transmit the virus through their links at a fixed infection rate per unit time. Predisposing nodes change states and become infected, whose rate is related linearly to the number of infected neighbors. In the complex contagion definition of disease interactions, a susceptible node is co-infected by more than one neighboring infected node. In fact, there exists a threshold of co-infection beyond which the threshold is exceeded. As the infection rate increases, the clustered network structure will enhance the propagation of the co-combination infection. This is the reinforcement mechanism from the higher order structure. As shown in Figure 3, the different infection pathways of susceptible node i are shown. Under the simple infection mechanism, node i contacts with one (A, C) or more (B, D) infected nodes through the links and is infected at each time step at the rate $λ$ through these links. In E, nodes i, j are infected by the triangle at the infection rate $λ_{1}^{*}$ . In G, node i is infected by the fully connected quadrilateral at the infection rate $λ_{3}^{*}$ . We would like to highlight F, which has been ignored in previous higher-order contagion models. It is believed that node e is not directly connected to i, so the two cases E and F are confused. In fact, these two cases should be discussed separately. Although node e is not directly connected to i, e exerts its influence directly on the triangle to act indirectly on i. This is for the real-life scenario with strongly superimposed viruses, so the contagion is stronger than E. The infection rates for all cases are roughly ordered, i.e., $λ_{3}^{*}$ > $λ_{2}^{*}$ > $λ_{1}^{*}$ > $λ$ .

FIGURE 3

FIGURE 3. There are simple contagion and complex contagion. (A–D) for simple contagion. In the case of complex contagion we have (E–G). Because of the higher-order interaction, the rate of infection varies with different modes of contagion. $λ$ represents the transmission rate under the simple transmission mechanism, $λ_{1}^{*}$ represents the transmission rate under the complex transmission mechanism in the form of links, $λ_{2}^{*}$ represents the transmission rate under the complex transmission mechanism in the form of triangles, $λ_{3}^{*}$ represents the transmission rate under the complex transmission mechanism in the form of triangles and links.

3 The proposed method

The node importance metric proposed in this paper is a generalization of the traditional clustering coefficients of nodes. The scope is extended from the first-order neighborhood of nodes to the second-order neighborhood, the types of higher-order structures are extended from a single triangle to loops and quadrilaterals. In terms of the topology of the network, the influence of the loop structure on the propagation is also taken into account. In terms of propagation mechanism, the influence of fully connected quadrilateral and diagonal quadrilateral on the propagation of viruses with mutual superposition is considered. The equation of the metric is expressed as follows:

C K (i) = \frac{k_{i}^{Φ}}{N_{α} + N_{β}} \times \frac{L^{e x t}}{L^{int} + L^{e x t}} (5)

Where $N_{α} = k_{i} (k_{i} - 1$ ) $∕$ 2 + $2 k_{i}! / 3! (k_{i} - 3)!$ represents the maximum number of triangles, diagonal quadrangles and fully connected quadrangles in the first-order neighborhood of node i. $N_{β} = k_{i} (k_{i} - 1) \times k_{i}^{2}$ represents the sum of the maximum number of diagonal quadrilateral and loops in the second-order neighborhood of node i. $k_{i}, k_{i}^{2}$ representing the number of nodes in the first and second-order neighborhoods, respectively. $L^{int} a n d L^{e x t}$ are the inner links (Links between nodes located on higher-order structures within the first and second-order neighborhoods of a node, such as the yellow edges in Figure 2) and the outer links (Links with only one node located on higher-order structures within first and second-order neighborhoods, such as the blue edges in Figure 2), respectively. In fact we can consider the region containing the higher-order structure within the second-order neighborhood of a node as a small community. $\frac{L^{e x t}}{L^{int} + L^{e x t}}$ is used to portray the difference in density inside and outside the community. In a complex mechanism of propagation, the community is seen locally as a “accelerator” of propagation, a high density of external links will ensure that the spread is global. $0 \leq C K (i) \leq 1$ . Note that this formula is only applicable to the case of $k_{i} \geq 3$ . At the same time, the measurement we propose is only applicable to undirected and unweighted networks.

According to the proposed node importance measure, we take node A in Figure 2 as an example to calculate its CK value. $k_{A}^{Φ}$ = 10, $N_{α} = 8 (8 - 1$ ) $∕$ 2 + $2 \times 8! / 3! (8 - 3)! =$ 140 and $N_{β} = 8 (8 - 1) \times 17 =$ 952. $L^{e x t} = 17, L^{int} = 18 .$ So the CK of node A in Figure 2 is $C K (A) = (10 / 1092) \times (17 / 35) \approx 0.004$ . Next, let’s consider the computational complexity of the proposed importance measure. For a network with n nodes, the average degree of nodes is $〈k〉$ , the computational complexity of calculating the number of first-order neighbors of each node $i s Ο (〈k〉)$ , the computational complexity of calculating the number of second-order neighbors of each node is $Ο ({〈k〉}^{2})$ , then the total computational complexity is $Ο (n 〈k〉 (〈k〉 + 1))$ . Therefore, the near linear computational complexity makes the proposed metric scalable to large-scale networks. For $k_{i}^{Φ}$ , it is not easy to find the high-order vertex degree of node i in large dense networks. In this paper, we also propose a method to quickly find all the above higher-order structures using only the adjacency matrix of the nodes within the second-order neighborhood containing node i. See Supplementary Material.

4 Experimental setup

4.1 Data set

In order to test the effectiveness of the proposed method, we apply it to real-world networks and artificial networks. Real world networks include (i) electricity: the network of power grids; (ii) network science: the coauthor network in network science consists of 1,589 nodes. We chose the largest dense community of 400 nodes in this network. (iii) Email: the email network of the University Rovirai Virgili. (iv) Yeast: the protein-protein interaction network. The two artificial networks are Watts-Strogatz (WS) small world network [39] and Barabasi-Alber (BA) scale-free network [40]. First, a small world network (WS) is constructed, in which nodes $n$ = 1,000, links $m$ = 1,475, nodes are connected with a probability of $p$ = 0.01. The average degree of the network is <k>≈3. The scale-free network (BA) takes N = 1,000, $m_{0}$ = 5, $m$ = 4 (for the initial network with $m_{0}$ nodes, each time a new node is introduced to connect with $m$ existing nodes, $m$ ≤ $m_{0}$ ), generates a scale-free network with N = 1,000 nodes and $m t$ = 3,980 links ( $t$ steps are required to generate the required scale-free network). Table 1 shows the statistical characteristics of six networks. In this paper, all these real-world networks and analog network are regarded as undirected and unweighted.

TABLE 1

TABLE 1. Some statistical properties of six networks. Its topological features include: number nodes ( $|V|$ ), number of edges $(| E|$ ), average degree $(〈 k〉$ ), maximum degree ( $k_{\max}$ ), average clustering coefficient ( $〈C〉$ ), propagation threshold ( $λ_{t h}^{*}$ = $〈k〉 / 〈k^{2}〉$ ).

4.2 Dynamic transmission model

To compare the accuracy of the proposed metric in identifying node propagation capabilities, the susceptible infection (SI) model is used to simulate propagation and evolution in real networks. The SI model contains only two types of nodes, susceptible (S) and infected (I). Once a node is infected, it will never be able to recover. Assume that the total network scale of disease transmission is N, S(t) represents the number of susceptible nodes at time t, I(t) represents the number of infected nodes at time t (S(t) + I(t) = N). At t = 0, all nodes are susceptible, that is S (0) = N and I(t) = 0. At t = 1, we selected a node in the network as the initial source of infection. It is assumed that each node has $〈k〉$ contactable neighbor nodes in unit time $∆ t$ (here approximated by the average degree of the network) and $〈k_{1}〉$ of these $〈k〉$ neighbor nodes are susceptible, the initial transmission rate of the disease from infected nodes to susceptible nodes is $λ$ . At each time step $∆ t$ , infected nodes can simultaneously spread the virus to more than one susceptible node. According to the assumption of uniform mixing (each node will touch the infected node with equal probability), the probability of an infected node encountering a susceptible node is $\frac{S (t)}{N}$ , so the infected node will contact $\frac{〈k_{1}〉}{〈k〉} ∙ \frac{S (t)}{N}$ susceptible nodes per unit time. Since there are I(t) total of infected nodes at time t to spread the disease with the transmission rate $λ$ , the average number of newly infected nodes dI(t) within the differential time dt is: $d I (t)$ = $λ ∙$ I(t) $∙ \frac{〈k_{1}〉}{〈k〉} ∙ \frac{S (t)}{N}$ dt. In other words, the rate of change of I(t) is: $\frac{d I (t)}{d t}$ = $λ \frac{〈k_{1}〉}{〈k〉} ∙ \frac{I (t) ∙ S (t)}{N}$ .

In this paper, we use SI model under complex propagation mechanism. This is different from the SI model in the simple transmission mechanism because we consider the strengthening mechanism of the higher-order structure in the transmission process and define that the diseases can be superimposed and co-infected. In our model, as soon as the initial infected node infects the node on the same link as it, the complex propagation mechanism will be triggered and the complex propagation process will begin in the form of a link. At the same time, the propagation rate does not remain constant, but increases gradually with the participation of more and more higher-order structures. In this case, the transmission rate is $λ^{*}$ = c $∙ λ$ . Where c is a perturbation parameter, it increases with the increase of the number of participating higher-order structures in the propagation process, that is the more involved higher-order structures, the faster c increases, and vice versa. As time increases more and more higher-order structures are involved in the propagation process, the virus becomes more contagious and spreads faster and more widely. Figure 4 shows the basic propagation process under the complex propagation model, where red nodes are infected nodes and blue nodes are susceptible nodes. At the initial time $t_{0}$ , a node is selected as the initial source of infection. With the progress of transmission, ${∆ t}_{1} > {∆ t}_{2}$ and $λ_{1}^{*} < λ_{2}^{*}$ can be obtained by involving more and more higher-order structures. In addition, the outbreak of the propagation process is related to the propagation threshold [8]. If the transmission rate is much smaller than the threshold, the transmission process will be limited to a smaller area and will end prematurely. Conversely, if the transmission rate is much greater than the threshold, the transmission process will spread instantaneously over a large part of the network. Therefore, it is reasonable to choose a transmission rate near the neighborhood of the propagation threshold to evaluate the validity of a centrality metric. That is, a transmission rate that is too small will result in a diffusion process of limited size and affect the accuracy of the observation. In contrast, too large a value of the transmission rate will lead to a large-scale diffusion of the network, the propagation ability of individual nodes cannot be identified. Notice here that, unlike simple propagation, the propagation threshold is based on a complex propagation mechanism.

FIGURE 4

FIGURE 4. An evolutionary diagram of a complex propagation process. The red nodes are infected nodes and the blue nodes are susceptible nodes.

5 Evaluation methods

From the perspective of propagation dynamics, the greater the influence of a node, the stronger the diffusion ability of the node. That is, the number of nodes in the network that are ultimately infected is more. We use the number of final infected nodes $F_{i} (t_{c})$ after $t_{c}$ time step to reflect the propagation ability of nodes in the network. The formula is as follows:

F_{i} (t_{c}) = \frac{1}{M} \sum_{m = 1}^{M} f_{i} (m) (6)

Where, M is the total number of experimental repetitions, $f_{i} (m) = n_{i} / N$ is the proportion of the final infected nodes in the steady state of the network. Set the propagation threshold of propagation rate $λ^{*} > λ_{t h}^{*}$ , where $〈k〉$ and $〈k^{2}〉$ are the average degree and the variance respectively. We assume that at the beginning, all nodes in the network are susceptible except the initial infection source.

Kendall’s tau correlation coefficient is selected to determine the consistency between the ranking list obtained by the specific measurement method and ranking list obtained by SI model based on standard Monte Carlo simulation. Give two ranking lists X and Y, $(x_{i}, y_{i})$ and $(x_{j}, y_{j})$ are the two node pairs in these lists. The Kendall’s tau correlation coefficient is defined as:

τ_{(X, Y)} = \frac{2 (n_{1} - n_{2})}{N (N - 1)} (7)

In this paper, X represents the ranking result of nodes obtained from a centrality measure, and Y represents the ranking result of the number of infected nodes obtained from the SI model based on standard Monte Carlo simulation. If $x_{i} > x_{j}$ , $y_{i} > y_{j}$ or $x_{i} < x_{j}, y_{i} < y_{j}$ . That is to say $x_{i}, x_{j}$ the order in X is the same as $y_{i}$ , $y_{j}$ in y. Otherwise, it is called inconsistent. $n_{1}$ and $n_{2}$ represents the number of consistent and inconsistent pairs, N is the length of the sorting list X. Obviously, $τ$ between −1 and 1. Here, $τ = - 1$ it means that the order of these two lists is completely opposite. At that time, $τ = 1$ , the order of the two lists was the same.

In order to distinguish the propagation ability of all nodes, each node should assign a unique indicator through centrality measurement. The proportion of repeating elements in a sequence is called the monotonicity of the sequence. In order to quantify the monotonicity of different sorting methods, we use [8] and define it as:

M (R) = [1 - \frac{\sum_{r \in R} N_{r} (N_{r} - 1)}{N (N - 1)}] (8)

Where $N$ is the length of the ranking list R, $N_{r}$ indicating the number of nodes with the same sorting value R. The range of M values is 0–1. The best value of M is 1, which means that each node in the network has a unique and identifiable sorting value. On the contrary, the worst value of M is 0, which means that all nodes in the network have the same ranking.

Pearson coefficient (R) is used to explain the correlation between different sorting methods. The formula of R is as follows [22]:

R = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}} (9)

Where, $X_{i}, Y_{i}$ represents the set of propagation capacity of each node in the network under any two different sorting methods, where $X_{i} = (x_{1}, x_{2}, . . ., x_{n}), Y_{i} = (y_{1}, y_{2}, . . ., y_{n}) . (x_{i}, y_{i})$ represents the combination of propagation capabilities of the same node under two different sorting methods. For an N node network, there will be such a combination are the average of $N (N - 1) ∕$ 2. The value of R is between [−1,1]. R = 1, then the two sorting methods show positive linear correlation; R = 0, then the two sorting methods are irrelevant; R = −1, then the two sorting methods are negative linear correlation. When R is in [0,1], the two sorting methods are positively correlated, and the closer to 1, the stronger the positive correlation; When R is at [−1,0], the two sorting methods are negatively correlated, and the closer to −1, the stronger the negative correlation.

6 Experimental results and analysis

In the experiments, the proposed node importance metric was compared with three other metrics, including CR, FI, and SHF. We use different methods to analyze the performance of the proposed centrality metric in terms of propagation ability, correlation and monotonicity.

6.1 Analysis of transmission capacity

At time step t = 1, the nodes with the largest values of SHF, FI, CR, and CK are selected as the initial infection sources. Set the initial transmission rate of the disease to be slightly greater than the propagation threshold of the network. The final proportion of infected nodes in the six networks was observed after 100 time steps. The experimental results are shown in Figure 5, it can be seen from Figures 5A-E that the proposed metric CK results in the largest proportion of final infected nodes after 100 time steps. This is followed by FI, SHF, and CR. In fact, under the complex contagion mechanism, nodes with larger FI metric values have dense second-order neighbors and their second-order neighbors are part of the dense part of the network, the disease triggers the complex contagion mechanism once it spreads to the second-order neighbors of the node, which facilitates the global spread of the disease. However, the clustering coefficients of the nodes are small, which makes it harder for the initial spread of the disease to trigger complex contagion on a large scale, so the FI is ultimately secondary to the CK. For SHF, this metric only considers the degree of node and the number of common neighbors in the first-order neighborhood. Therefore, the initial spread of the disease is carried out under complex transmission with triangles and subsequently spreads to second-order neighborhoods. This inhibits the spread of the disease to some extent due to the lack of information related to the second-order neighborhood, so the SHF is ultimately subordinate to the FI. It is difficult to trigger complex propagation with an infection source on a loop (except for triangles), this is mainly because all nodes on the loop except the nodes at both ends of the link cannot achieve direct interaction, i.e., the loop lacks higher-order structures that trigger complex contagion, so it can be seen that the CR metric has the lowest number of final infected nodes. In Figure 5F, it can be seen that the only difference is that the propagation rate of FI is slightly greater than CK after t = 35. This is mainly because BA is a network of many small communities connected by the large-degree nodes and FI happens to identify the large degree nodes in the BA network. The tight clustering maximizes the complex propagation mechanism when the virus propagates to the second-order neighborhoods of the nodes identified by the FI. In addition, we looked at disease transmission rates over 100 time steps. Here we set the initial transmission of the disease is slightly greater than that of each network propagation threshold value, which in turn were 0.26, 0.14, 0.06, 0.065, 0.16, and 0.10. The results are shown in Figure 6. It can be seen that when the node identified by metric index CK is taken as the initial transmission source, the transmission rate of the disease in the 6 networks is always the largest in 100 time steps. This also shows from the side that there are the most high-order structures involved in the whole propagation process and further explains the accuracy of CK in identifying the most important nodes under the complex propagation mechanism. In Eqs 1, 4–6, the effect of measurement index FI is better than that of SHF. In Eq. 2, the effect of FI and SHF is roughly equivalent. In Eq. 3, SHF is superior to FI. However, the effect of CR is always the worst in the six networks. Secondly, the measurement index FI in Eq. 6 also has a good effect. To sum up, under the complex transmission mechanism based on SI model, when the node identified by the metric index CK is the initial infection source, the number of nodes infected and disease transmission rates in the network will be the largest after 100 time steps.

FIGURE 5

FIGURE 5. The number of final infected nodes caused by four different centrality measures on six networks in (A-F). The horizontal axis corresponds to the infection time and the vertical axis represents the diffusion capacity $F_{i} (t_{c})$ . The results are averaged by 100 independent experiments on the complex SI model.

FIGURE 6

FIGURE 6. Network propagation rate $λ^{*}$ as a function of time t in (1), (2), (3), (4), (5) and (6). Each network is the average result of 200 separate simulations on the complex SI model.

6.2 Kendall’s tau correlation coefficient

In order to test the accuracy of the proposed centrality measure, it is compared with three different centrality measures on six networks under the complex SI model. The closer to 1 is $τ$ , the better the performance of the sorting method. The selection value $λ^{*}$ changes near the propagation threshold. The value $λ_{t h}^{*}$ is related to the topological characteristics of a given network, so the abscissa of different networks in Figure 6 is different. Figure 7 describes the Kendall’s tau correlation coefficient of four central indicators varying with the probability of infection. In Figures 7G-K, CK was superior to the other three centrality indicators over the entire range of infection probabilities. In Figure 7G, the $τ$ values of CK and FI were roughly equivalent around the prevalence threshold. This is mainly because $λ^{*}$ as a relatively small value, the influence of the node is limited to a small range resulting in the virus not spreading. It can be seen that CK gradually outperforms FI as $λ^{*}$ increases. In Figure 7H, K, the four centrality measures differed most from each other. This is mainly because Network science and WS networks are characterized by large clustering coefficients and small average path lengths, which create the most powerful conditions for triggering complex propagation mechanisms. The complex contagion is triggered at the initial propagation, which makes CK the most accurate in identifying the important nodes. At the same time, SHF is superior to FI when $λ^{*}$ is in [0.124, 0.135] and [0.157, 0.195], which is also due to the larger clustering coefficients. Because SHF takes into account the number of common neighbors in the first-order neighborhood of a node while FI does not, this leads to a decrease in the accuracy of FI in identifying influential nodes for period of time when $λ^{*}$ exceeds $λ_{t h}^{*}$ . In Figure 7L, CK is better than FI before $λ^{*}$ = 0.093. This is because BA is composed of “hub” nodes connecting many dense small communities, which can trigger less complex propagation during the initial propagation of the virus because there are fewer higher-order structures in the first-order neighborhoods of the nodes identified by FI. At after $λ^{*}$ = 0.093, FI is gradually better than CK. Because as the virus continues to spread to second-order neighborhoods, the dense cluster of the second-order neighborhoods of FI-identified nodes plays a huge role in triggering complex propagation, which greatly improves the accuracy of the FI metric. Furthermore, we can see that the $τ$ -value of CR is the lowest among all networks, which means that CR cannot identify important nodes under the complex propagation mechanism. It can be seen that under the complex propagation mechanism based on SI model, the $τ$ value of the metric index CK is always the highest, which also means that most nodes identified by CK are the nodes with the strongest propagation ability in the network.

FIGURE 7

FIGURE 7. The Kendall’s tau correlation coefficient ( $τ$ ) between the ranking results obtained from the four different metrics and the $F_{i} (t_{c})$ ranking list obtained from the SI model with different infection probabilities $λ^{*}$ in (G-L), $λ^{*}$ Changes near the neighborhood of $λ_{t h}^{*}$ . The results obtained by averaging 100 independent simulations of the complex SI model, where the vertical dotted line represents the propagation threshold $λ_{t h}$ .

6.3 Monotonicity

A good centrality measure should be able to distinguish between nodes with different propagation capabilities. Table 2 shows the monotonicity of the four centrality measures in six different networks. The closer M is to 1 the better the performance of the centrality measure is. It can be seen that among the four centrality measures, the monotonicity value of CK (black bold) is always close to 1 in all networks. In summary, CK can better distinguish nodes with different propagation capabilities. We conclude that under the complex propagation mechanism based on SI model, the metric index CK can well distinguish nodes with different propagation capabilities in the network.

TABLE 2

TABLE 2. Monotonicity of four centrality measures in six networks.

6.4 Correlation analysis

Figure 8 shows the correlation matrix between CK and the other three centrality metrics. Each of these elements represents the average of the correlation coefficient r between the two metrics over the six networks, the results are accurate to two decimal places. It can be seen that CK, FI has the strongest correlation, followed by CK SHF、CK CR. This is because FI considers triangles in the first-order neighborhood of a node, diagonal quadrilateral and loops in the second-order neighborhood. SHF considers only triangles in the first-order neighborhood of a node. CR only considers the loops in the second-order neighborhood of the node. The proposed metric CK based on the higher-order vertex degree of the node not only considers the loop in the second-order neighborhood but also the diagonal quadrilateral, the fully connected quadrilateral in the first-order neighborhood and the diagonal quadrilateral, which is an extension of the other three metrics considering a more diverse higher-order structure. The same result is obtained from Figure 9. Each point denotes one node in the network, the color of the node indicates the nodal spreading ability simulated by the complex SI model, denoted by $F_{i} (t_{c})$ ( $t_{c}$ = 100). Note that the more propagating a node is, the closer its color is to red. In (m) (q) (v), we can clearly see that CK and FI have the strongest correlation, followed by CK, SHF, and CK, CR. The correlation between CK and CR is weak, because many nodes are assigned the same value. Furthermore, the larger the value of CK, the redder the color of the node, which is in accordance with the actual spreading ability $F_{i} (t_{c})$ of nodes. It can be seen that when identifying the nodes with the strongest propagation ability in the network under the complex propagation mechanism, if the metric index CK is not available, FI may be an optimal alternative.

FIGURE 8

FIGURE 8. The average correlation matrix for the four indices of node importance over six networks. Each element is the averaged value of the correlation R between the two indices corresponding to its position over the six networks, and the value is visualized by the color. The correlation increases gradually from blue to yellow.

FIGURE 9

FIGURE 9. Average correlation between the proposed CK centrality and the other three measures of centrality in the six networks in (m), (q) and (v). Each point represents a node, and its color denotes the $F_{i} (t_{c})$ ( $t_{c}$ = 100) value of the node. The data are obtained from the complex SI model by averaging the results of 100 independent simulations on six networks when $λ^{*} > λ_{t h}$ .

7 Conclusion

In this paper, a new node importance metric is proposed for identifying nodes with the strongest propagation capacity under complex contagion mechanisms. The metric innovatively considers diagonal quadrilateral, loop (empty quadrilateral) and fully connected quadrilateral in the first and second-order neighborhood of a node. The impact of these structures on the complex propagation dynamics on the network is considered. Meanwhile, the high-order node degree are defined and a complex propagation model which has been neglected in the past is introduced. The superiority of the proposed metric is obtained by analyzing the simulation results on real and synthetic networks. In addition, only some of the higher-order forms in the first and second-order neighborhoods of the nodes are considered in this paper, in fact their higher-order structures are much richer than we can imagine. Moreover, as the analysis range (the neighborhood of nodes) extends outward, its computational complexity is exponentially increasing, and we will face more difficult problems. This work will be explored in the next step. Our future work will mainly focus on finding the optimal combination of higher-order structures in the node’s neighborhood that can most efficiently facilitate the information dissemination. At the same time, we will also propose some targeted protection strategies for the higher-order organization of the network.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions

JS proposed the advice and conducted the numerical experiments. GW checked the paper in the later stage. All authors contributed to the article and approved the submitted version.

Acknowledgments

We would like to thank all the anonymous reviewers for their helpful suggestions to improve the quality of our manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphy.2023.1046077/full#supplementary-material

References

1. Domenico MD, Solé-Ribalta A, Omodei E. Ranking in interconnected multilayer networks reveals versatile nodes. Nat Commun (2017) 6:6868. doi:10.1038/ncomms7868

Identifying influential nodes in complex contagion mechanism

1 Introduction

2 Preliminaries

2.1 Basic concepts of complex networks

2.2 Importance measure of the nodes

2.2.1 Node cycle rate (CR)

2.2.2 Measurement of positive and negative clustering coefficient based on node neighborhood (FI)

2.2.3 Measurement based on structural hole features (SHF)

2.3 High-order structure and high-order node degree

2.3.1 Higher-order structure in networks

2.3.2 High-order node degree

2.3.3 Effect of the higher-order structure on the propagation dynamics

3 The proposed method

4 Experimental setup

4.1 Data set

4.2 Dynamic transmission model

5 Evaluation methods

6 Experimental results and analysis

6.1 Analysis of transmission capacity

6.2 Kendall’s tau correlation coefficient

6.3 Monotonicity

6.4 Correlation analysis

7 Conclusion

Data availability statement

Author contributions

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good