Authors Anna Levina Tanguy Fardet
License CC-BY-4.0
PHYSICAL REVIEW RESEARCH 3, 043124 (2021) Weighted directed clustering: Interpretations and requirements for heterogeneous, inferred, and measured networks Tanguy Fardet and Anna Levina University of Tübingen, Tübingen 72076, Germany and Max Planck Institute for Biological Cybernetics, Tübingen 72076, Germany (Received 8 June 2021; accepted 18 October 2021; published 17 November 2021) Weights and directionality of the edges carry a large part of the information we can extract from a complex network. However, many network measures were formulated initially for undirected binary networks. The necessity to incorporate information about the weights led to the conception of multiple extensions, particularly for definitions of the local clustering coefficient discussed here. We uncover that not all of these extensions are fully weighted; some depend on the degree and thus change a lot when an infinitely-small-weight edge is exchanged for the absence of an edge, a feature that is not always desirable. We call these methods “hybrid” and argue that, in many situations, one should prefer fully weighted definitions. After listing the necessary requirements for a method to analyze many various weighted networks properly, we propose a fully weighted continuous clustering coefficient that satisfies all the previously proposed criteria while also being continuous with respect to vanishing weights. We demonstrate that the behavior and meaning of the Zhang-Horvath clustering and our proposed continuous definition provide complementary results and significantly outperform other definitions in multiple relevant conditions. We demonstrate, using synthetic and real-world networks, that when the network is inferred, noisy, or very heterogeneous, it is essential to use the fully weighted clustering definitions. DOI: 10.1103/PhysRevResearch.3.043124 I. INTRODUCTION stressed in multiple studies [1–4]. This is notably the case for the middleman motif, which is a marker of feedforward The clustering coefficient (CC) was originally introduced loops in transcriptional networks and of information trans- for binary undirected networks to quantify strong connect- fer redundancy, e.g., in neuroscience. More generally, such edness within a local neighborhood. It was defined as the motifs will influence the evolution of dynamical processes fraction of all possible triangles that were realized, i.e., the on the networks, for instance, synchronization patterns, and ratio between all triangles in which node i participates (n,i ) have been shown to characterize families of networks such as and the total number of triangles that could theoretically be transcription or language networks [2]. Finally, clustering is made given its degree di , which is the number of triplets (nT,i ): used in other measurements to access the small-world propen- n,i n,i sity of networks [5], and the choice of a specific definition Cibin = = . (1) nT,i di (di − 1) can therefore influence whether the network of interest will register as small world or not. From a neighbor-centric perspective, it can be seen perhaps In many applications, network topology and weights are more intuitively as the probability that two neighbors of a measured only up to a certain precision [6,7]. For example, in node are connected. However, as network science expanded, neuroscience, the functional connectivity networks are mea- more and more graphs were encountered, where directedness sured using the indirect inference of connections from the and edge weights play a central role. Generalizations of the recorded activity [8,9]. Accepting the inevitability of noise in clustering coefficient were therefore introduced to account a network brings forward new requirements on the network for asymmetry in the connections between pairs of nodes or measures, namely, that they are stable to the noise and do heterogeneity in their strength. not change dramatically if the weights are perturbed or weak The importance of clustering, including its directed vari- connections are randomly omitted. ants, to understand complex dynamics on networks has been There is no agreement among researchers as to which weighted extension of the clustering coefficient definition is most appropriate. The three predominantly used methods at the moment [10–12] differ in many properties of their def- Published by the American Physical Society under the terms of the initions. Part of the reason for the absence of a single best Creative Commons Attribution 4.0 International license. Further weighted clustering lies in a different interpretation of weights distribution of this work must maintain attribution to the author(s) in various data sets. Consequently, a different weighted exten- and the published article’s title, journal citation, and DOI. Open sion might be most appropriate for various data and specific access publication funded by the Max Planck Society. scientific questions. However, to understand which method to 2643-1564/2021/3(4)/043124(16) 043124-1 Published by the American Physical Society TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) use when and why, we need to understand their differences factor [the global max(w)] as long as the normalization con- precisely. dition is fulfilled since only the normalization matters. We The difficulty of extending graph measures to weighted omitted the last two conditions of Saramäki et al.’s paper networks is not specific to the clustering coefficient but can (invariance to weight permutation and ignorance of weights occur whenever ratios of degrees or path length are involved. not participating in any triangle). Although they might be of We will therefore also discuss a second clustering-related interest for some specific applications, we do not consider measure, called the closure coefficient and introduced as the them to be generally desired properties for a clustering co- fraction of all open walks of length 2 starting from node i that efficient. Indeed, the clustering coefficient is a node-centric are part of a triangle [13]. This will also enable us to discuss measure, so there is no reason to expect invariance when the the complementarity of closure and clustering as the former weights of a node’s edges change. As for edge ignorance, in provides an important complement to analyze the tendency of the original classifying paper [14] this property is introduced nodes to form 3- and 4-cliques. as a particular feature of the Onnela et al. [11] definition, with- We introduce here a distinction between fully weighted out reasons why it may be a desired condition. We also did and hybrid definitions and discuss why, for several classes not require that all weights in a triangle should be accounted of networks, fully weighted and directed definitions should for because this condition is necessarily met if the continuity be preferred to other clustering definitions that are currently condition is fulfilled. used for network analysis. We also propose a definition that Continuity can be expressed mathematically as follows: for obeys additional conditions, including continuity of the results a graph G(V, E ), if a weighted edge (u, v, w) with u, v ∈ V with respect to infinitesimal changes in edge weights, which and weight w ∈ R is added to this graph to form a new graph has significant consequences for the resilience to noise in in- G (V, E ), with E = E + {(u, v, w)}, then the clustering mea- ferred networks. We demonstrate why fully weighted methods sure is continuous if and only if ∀i ∈ V , Ci(G ) −−−→ + Ci(G) . are essential for measured and inferred networks, which are w→0 This condition in crucial to ensure a reasonable behavior of pervasive in biological fields such as neuroscience, and for the clustering coefficient in inferred networks. networks dealing with flows of information, money, or goods Though some definitions of previously proposed weighted that display a very broad weight distribution. clustering coefficient definitions obey most of the required properties, none of them completely fulfill the continuity condition—despite previous claims [15]. This is why we will II. INTERPRETATION AND PURPOSE OF WEIGHTED later propose a definition that fulfills all aforementioned con- CLUSTERING ditions. An extensive comparison of the properties fulfilled by A. Desired properties of weighted clustering coefficients different clustering definitions can be found in Appendix B. Weighted measures are crucial for many network types where the binary connectivity is either uninformative (fully connected network) or displays similar or lower heterogeneity B. State of the art for weighted clustering compared with the weighted structure. In this paper, we focus First we introduce and classify the existing weighted clus- on two classes of real-world networks: inferred or measured tering coefficient definition. For all clustering definitions in networks where there can be a large number of spurious the main text, we use the following notation: A is an adja- (false positive) edges with small weights; and networks asso- cency matrix, and W = {wi j } is the normalized weight matrix, ciated with flows of information or goods, which often display obtained from the original weight matrix W̃ = {w̃i j } by wi j = broad weight distributions. This is notably the case for many w̃i j / maxi, j (w̃i j ). networks in neuroscience, and more generally in informa- Hybrid definitions were the first extensions of the binary tion, transportation, or other social and economic networks. clustering coefficient definitions. They combine properties Weights are essential to understand the dynamical processes associated with a weighted connectivity matrix (i.e., the in- that occur in these networks, requiring measures that go be- tensity of the triangle) with properties that could be already yond the binary structure. obtained from an adjacency matrix (i.e., node degrees). These There could be multiple requirements for weighted clus- definitions move from an integer counting the number of trian- tering coefficients [14] depending on the particular question gles (n ) to a sum of real numbers (computed as a function of of interest and on the network properties. The main require- edge weights) that we call “intensities” of triangles (I ). The ments that we considered necessary for a weighted clustering choice of a particular function for the intensity of the triangles coefficient are (i) normalization (Ci ∈ [0, 1]), (ii) consistency determines the properties of the clustering coefficient. with the binary definition (for binary networks, it should Two popular hybrid weighted clusterings were given by give back the classical result), (iii) linearity (scaling by α all Barrat et al. [10] and Onnela et al. [11], which we will refer edges involving node i and all edges in triangles including to as the Barrat and Onnela definitions, respectively. node i scales Ci by α), and (iv) continuity (weak influence of For a node i in a graph, the Barrat definition [10] quantifies the addition or deletion of edges having very small weights, the fraction of the node’s strength that is invested in triangles meaning that an edge with infinitesimally small weight should (see Appendix C 1 for more details): be equivalent to the absence of that edge). Compared with a previously proposed list of conditions given by Saramäki et al. [14], we added a continuity condition (WA2 )ii w CiB = = Cibin i , (2) but did not include a requirement of a specific normalization 2si (di − 1) wi 043124-2 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) where si = j=i wi j is the strength of the node, wi is TABLE I. Limit values for undirected weighted clustering coeffi- cients of vertex i (solid circles) for different weight configurations in the average weight of the edges involving i, and wi = wi j +wik graphs with vanishing weights. Solid lines depict edges of weight j=k 2n,i ai j aik a jk is the average weight of edges involving w = max(w) = 1, and dotted lines denote edges with vanishing i that are part of a triangle computed over all triangles in weight . Only the proposed continuous clustering (bottom row) which node i participates. In terms of triangle intensity, this returns values consistent with the continuity condition, whereas the definition was originally written as Barrat (C B ), Onnela (C O ), and Zhang-Horvath definitions (C Z ) devi- wi j +wik ate for it. j=k 2 ai j aik a jk Ci = B 2si (di − 1) 1 wi j + wik = ai j aik a jk , (3) di (di − 1) j=k 2wi thus defining the intensity of triangle i jk as Ii B jk = wi j +wik 2wi ai j aik a jk as the function of two of the triangle’s weights and the average weight of the edges connected to node i, wi . Proposed a bit later, the Onnela definition [11] scales the Z (max) Ii = ITZi jk = wi j wik if all existing triplets Ti jk were closed binary clustering by the average intensity of the triangles (see jk by an edge of weight 1 (the maximum possible weight in the Appendix C 2 for more details): normalized network). This way, if i is involved in a single 1 3 triangle, the clustering coefficient is equal to the weight of W [ 3 ] ii Ci = O = Cibin Ii O , (4) the edge closing the triplet centered on i (see also Table I). di (di − 1) jk Though this definition does not fulfill the continuity jk = (wi j wik w jk ) O 1/3 property, we will show that it still provides a consistent in- with the triangle intensity defined as Ii and the average intensity taken over all triangles in which i terpretation of weighted clustering, as discussed in Ref. [16], participates. and is well suited to tackle networks with a large fraction of For all hybrid methods the denominator relies on the node’s false positives. degree, meaning that the addition or deletion of edges will Other fully weighted definitions that have been proposed always significantly affect the clustering coefficient even if and discussed since [15,17,18] do not bring significant ad- the edge has an infinitely small weight. Such methods can ditions compared with the Zhang-Horvath definition while thus lead to inaccurate results when applied to the inferred net- actually losing some of its properties and its straightforward works, where a significant fraction of edges are false positives interpretation. They are not considered further in this paper; with small weights. In the following we will also demonstrate see Appendix A for further explanations. that they cannot reliably detect the most strongly clustered nodes in structured networks. Only fully weighted definitions C. A continuous definition for weighted clustering and closure can rise up to these challenges. For an undirected graph G, we define the proposed contin- Fully weighted definitions are variants of the clustering uous clustering of node i as coefficient that do not include any binary measures (anything 2 √ 2 that can be derived from the adjacency matrix alone, e.g., j=k Ii jk j=k ( wi j w jk wik ) 3 degrees). In addition to substituting the number of triangles Ci = = √ . (6) j=k IT i jk j=k w ji wik by the sum of triangle intensities, they also move away from counting triplets—defining the maximum number of possible We define the weighted intensity of triangles and triplets, √ √ triangles max(n,i ) = nT,i = di (di − 1). Instead, they intro- Ii jk = 3 wi j w jk wik and IT i jk = w ji wik , respectively, using duce the triplet intensity IT such that, for a node i, IT,i is a the geometric mean of the weights involved. Because of this, real-valued function of the weights associated with i. one strong weight in a triangle or triplet cannot compensate One of the first fully weighted definitions for the cluster- the presence of smaller weights, in contrast to what may ing coefficient was provided by Zhang and Horvath [12] to happen if one uses the arithmetic mean. This provides the analyze gene coexpression networks: desired property that the intensity of triangles and triplets will Z go to zero if even a single edge weight goes to zero. Note j=k wi j wik w jk Ii jk bin that, though the triangle intensity is defined as the geometric Ci = Z = Ci , (5) w w j=k i j ik Z IT i jk mean of the three weights involved, it is the square of this intensity that is used in the definition. The reason for this which we will refer to as the Zhang-Horvath definition. Note choice is twofold: to assign higher influence to large triangles that the fact that the definition can be expressed as a function and to ensure the linearity of the coefficient. Importantly, this of the binary clustering does not contradict the fully weighted definition of the triangle intensity I assigns the same role to nature of the measure, as it stems from a simple recombination all participating edges. The proposed clustering could be in- of the terms. terpreted as the ratio of the triangle intensity that is invested in This definition can be interpreted as the ratio of strong triangles (given by the sum of the squared intensities of jk = wi j wik w jk of the triangles Z the summed intensities Ii triangles, which increases the importance of strong triangles) i jk = (i, j, k) to the maximum possible summed intensities to the triplet intensity, which would represent the maximum 043124-3 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) TABLE II. Definitions of the continuous intensities for each partial mode pattern in directed graphs. Column 1, pattern names; column 2, pattern illustration; column 3, number of triangles for node i; column 4, number of triplets for node i; column 5, continuous intensities of the triangles for node i; and column 6, continuous intensities of triplets for node i. The clustering coefficients associated with each mode m are given by Ci(m)bin = n,i (m) /nT,i (m) for binary networks and Ci(m) = I,i /IT,i for the continuous definition. (m) (m) possible triangle intensity if all weights connecting adjacent tained via the formula nodes were equal to 1. I,i The proposed continuous definition fulfills all the condi- Cg = i . (9) tions we put forward above, and to the best of our knowledge i IT,i it is the only one to do so. Similarly to previous definitions, the proposed clustering coefficient can also be rewritten in terms D. Directed weighted clustering of node properties: Fagiolo [19] proposed how to generalize clustering to di- 2 3 rected networks. He defined the different patterns or motifs W[3] (shown in Table II) that can exist in these networks and Ci = 1 ii , (7) adapted the Onnela definition [11] as the first weighted di- [ ] 2 si 2 − si rected clustering coefficient. The Barrat definition [10] was generalized to directed graphs in Ref. [20] following the same where si = j=i wi j is the normalized strength of node i distinction into Fagiolo’s cycle, middleman, fan-in, and fan- and W [α] = wiαj and si[α] = j=i wiαj are the fractional weight out motifs. matrix and strength for any α ∈ R. Similarly, the Zhang-Horvath definition [12] and the con- As for the previous definitions, the continuous clustering tinuous definition can be generalized in a straightforward can be interpreted as a function of intensities and the binary manner for directed networks. For this, we only need to clustering: redefine the intensities of each directed triangle and triplet motif (as shown in Table II for the continuous definition and 2 2 2 n Ii jk Ii jk Var(Ii jk ) + Ii jk bin in Table VII for the Zhang-Horvath definition). This simply Ci = = Cibin = Ci , (8) requires replacing A with W in all expressions of n,i (m) and nT IT i jk IT i jk IT i jk (m) a with w in all expressions of nT,i . As the total directed with means Ii jk and IT i jk and variance taken over all the clustering is defined as the sum of all modes, we can write triangles or triplets the node i participates in, respectively. it as In the limit where all triangles associated with node i have Z (tot) I,i (W + W T )3ii similar intensities, we can neglect the variance term, leaving CiZ (tot) = = . (10) j=k (wi j + w ji )(wik + wki ) Z (tot) Ii jk 2 IT,i Cibin . In this case, in contrast to the Zhang-Horvath def- IT i jk inition [Eq. (5)], the absolute value of the intensity matters, Finally, the continuous clustering can be extended for each not only its ratio to the maximum possible intensity. For a directed mode (see Table II), and for the total directed cluster- given average triangle intensity, the positive contribution of ing, this leads to the variance implies that nodes with more variable intensities, 2 2 3 i.e., at least one triangle with a high intensity, will have higher (tot) (tot) I,i 1 W [ 3 ] + W [ 3 ]T ii Ci = (tot) = 1 2 , (11) clustering coefficients than nodes with identical triangles of [2] 2 IT,i si,tot − si,tot − 2si↔ average intensity. Finally, the global clustering can also be defined in a [ 21 ] with si,tot = j (wi j + w ji ) being the total strength, si,tot = straightforward For simplicity, we define I,i = fashion. 1 1 ↔ j=k Ii jk and IT,i = 2 j=k IT i jk , leading to Ci = I,i /IT,i . (w + w ji ) being the total root strength, and si = 2 2 j √ij Using this definition, the continuous global clustering is ob- j wi j w ji being the reciprocal strength. 043124-4 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) (a) (b) FIG. 1. Only the continuous clustering coefficient uncovers the true structure in the weighted core-periphery network. A network has 11 strongly connected core nodes (black edges) that interact with well-clustered periphery nodes with weaker connection strengths (light-brown edges); see Appendix E 1 for details on the network. (a) Graphical view of the network; the edge width gives the strength of the connection, and the node color gives its clustering coefficient. (b) Distribution of clustering coefficients for the three types of nodes over ten realizations of such a core-periphery network. Only the continuous definition differentiates between the central, the outer-core, and the periphery nodes. In all the other methods, the clustering coefficients of the 10 “outer-core” and 22 periphery nodes overlap: The Onnela definition only distinguishes the central node, whereas the Barrat and Zhang-Horvath definitions do not hint at a core-periphery structure. As for the undirected case, the global clustering coefficient A. Sensitivity to weight-encoded topological features associated with each (m) directed (m)pattern can be obtained via the Here, we investigate how weighted structures can be de- formula Cg(m) = i I,i / i IT,i . tected or missed using different clustering definitions. As an example we consider weighted core-periphery graphs in III. THE ADVANTAGES OF FULLY WEIGHTED Fig. 1. There, core nodes are characterized by both a dense DEFINITIONS binary connectivity and large weights, whereas periphery In this section we discuss the sensitivity to the weight- nodes display both sparser connectivity (though they still encoded topological features and stability to noise in network have large degrees) and weaker weights; for more details, see measurements of the different clustering methods. A previous Appendix E 1. We generate ten realizations of the network study [14] already noted the fact that previous definitions did and consider distribution of clustering coefficients of differ- not fulfill the continuity condition by analyzing the behavior ent types of nodes. Continuous definition leads to distinct of the different coefficients for nodes that participate in a clustering of different types of nodes, making the true struc- single triangle. Table I illustrates some of these cases and ture of the network clearly visible already in the clustering shows that the proposed continuous definition is the only one distributions. Because of their hybrid nature, the Barrat and to behave as expected. Onnela definitions cannot capture this underlying structure Yet we note that the Zhang-Horvath definition is also very as it is mostly encoded in the weights and not at all in the resilient to noise because, except for the corner cases asso- degrees (core nodes are not binary hubs). Though it is purely ciated with single triangles, its behavior is continuous in all weighted, the Zhang-Horvath definition is also not suited to other situations. Moreover, contrary to what was asserted in detect this type of weighted structure because its interpretation Ref. [14], it provides a perfectly sensible behavior given its of a node’s clustering only accounts for the relative triangle interpretation of clustering as the ratio of the triangle intensity intensity given the node’s weights. The continuous clustering is sensitive to any topological jk = wi j wik w jk to its maximum possible intensity given the Z Ii Z (max) property that is encoded via specific weight distributions. weights of node i: Ii jk = ITZi jk = wi j wik if w jk = 1. Furthermore, in contrast to the Zhang-Horvath definition, it Because the Barrat and Onnela definitions [10,11] are the accounts not only for the ratio of the triangle intensity over most well known and (to the best of our knowledge) the only the triplet intensity (how strong the triangles are compared methods implemented in popular graph libraries, we restrict with the maximum possible value given the node’s weights) our comparison to the Zhang-Horvath definition and these but also for the absolute value of the intensity: A weak tri- two definitions. A more comprehensive discussion of other angle, even if it corresponds to the highest possible value definitions of the weighted clustering coefficient can be found given the node’s weights, will decrease the node’s clustering in Appendix A. 043124-5 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) (a) (b) (c) (d) (e) (f) (g) (h) (a) (c) (a) (d) (b) (c) (b) (d) FIG. 2. Fully weighted methods are less sensitive to spurious edges. A “measured network” can be represented as the union of a “ground truth” (G.T.)—here, a Watts-Strogatz network in dark brown—and spurious small-weight connections (“noise” graphs with (a) random [Erdős- Rényi (ER)] or (b) scale-free (SF) connectivity, in red). We assess the influence of the weight distribution of spurious connections (red) by checking weights that are (c) all equal and small or (d) following an exponential distribution and overlapping with the real weights (dark brown). (e)–(g) Ground-truth clustering distribution (filled dark brown) compared with the distributions associated with the measured networks for each method (dashed lines). Weight and noise types, combining noise shown in (a) or (b) with weight shown in (c) or (d), which we refer to as conditions “(a)+(c),” “(a)+(d),” “(b)+(c),” and “(b)+(d),” are associated with colors from brown to orange in the same order as in (h). Distributions associated with the exponential noise [conditions (a)+(d) and (b)+(d)] differ most from the original distributions for the Zhang-Horvath and continuous clusterings [(e) and (f)] and are broader for the Onnela clustering (g). (h) Correlation between the ground-truth clustering and clustering in measured networks for indicated spurious edge topology and weights. Fully weighted clusterings retain most of the correlation for condition (a)+(d), with R2 > 0.55, and only lose the original information for condition (b)+(d). The results were obtained for ten realizations of the spurious edges; error bars give confidence intervals. Network properties are detailed in Appendix E. Z, C, O, and B, Zhang-Horvath, continuous, Onnela, and Barrat definitions, respectively. in the continuous definition whereas it increases it with the We illustrate the impact of spurious edges on measured Zhang-Horvath method. In that sense, the continuous clus- clustering coefficients using the example of Watts-Strogatz tering provides a more global evaluation of the clustering small-world networks. We consider different topologies for coefficient compared with the Zhang-Horvath definition that the subnetwork formed by the spurious edges: either an provides more local information. Erdős-Rényi random network [Fig. 2(a)], associated with un- The Barrat definition has several limitations because it is correlated noise, or a scale-free network [Fig. 2(b)], which close to being weight insensitive [11,21]. It is particularly would correlate noise with certain nodes in the network. unsuitable for assessing networks with a potentially large Additionally, weights on the spurious edges could be much number of low-weight spurious connections or very hetero- smaller than the weight of the actual edges [Fig. 2(c)] or geneous weight distributions. For this reason, we will mostly have an overlapping distribution [Fig. 2(d)]. We refer to the leave this clustering aside in the rest of this paper. combinations of these noise and weight types as conditions “(a)+(c),” “(a)+(d),” “(b)+(c),” and “(b)+(d).” Both fully B. Continuity and resilience to noise weighted methods are unaffected by low noise [conditions The stability of a network measure to noise is of particular (a)+(c) and (b)+(c)] and are also less influenced by the spuri- importance for networks that are obtained via experimental ous edges when their weights are large enough to overlap with measurements since these are often subject to noise and sta- the real weight distribution [conditions (a)+(d) and (b)+(d)]. tistical biases, notably for inferred networks. Methods abiding On the other hand, because hybrid methods explicitly depend by the continuity condition are especially resilient to the pres- on the nodes’ degrees, they are very susceptible to the pres- ence of low-weight spurious edges. Violation of continuity ence of spurious edges [Figs. 2(g) and 2(h)]. can have significant and pervasive consequences for inference The difference in behavior between the methods can be of network properties in many network structures. We have easily explained by a first-order expansion. We consider already seen the simple examples in Table I; here, we demon- change in the clustering coefficient of a node i with degree strate that they are not just corner cases, but occur in larger, di after addition of a spurious edge e = (i, v) with weight real-world networks. 1. For the Barrat and Onnela methods, the new clustering 043124-6 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) coefficient becomes spurious that is lower than 1%. We consider a thresholding 1 procedure, where at each level of the threshold (pmax ), only O C I,i +O 3 di − 1 O 1 →0 the edges with smaller p values are kept. This procedure can Ci = = Ci + O 3 CiO , di (di + 1) di + 1 be seen as an attempt to remove spurious edges, though no +wik correct threshold level is known. Compared with the hybrid I,i + k 2 avk aik B di − 1 B →0 CiB = = Ci + O() CiB , method, both fully weighted definitions are much less sensi- IT,i + si + B di tive to thresholding [Figs. 3(b) and 3(c)]. Thus we see that meaning that, for both methods, the coefficients will deviate the resilience to noise we showed analytically and on toy networks is relevant for a real-world network. Furthermore, from the original clustering CiO/B by a noninfinitesimal value, the resilience is not limited to the general shape of the dis- even when the perturbation was infinitesimal; see Appendix C tribution but indeed preserves the precise values and ranks: for a complete derivation. A larger fraction of the nodes displaying high clustering in On the other hand, the continuous clustering becomes the full graph are still among the highest ranking nodes in 2 the thresholded graphs when fully weighted methods are used I,i + k∼i ( wvk wki ) 3 Ci = compared with the hybrid method [Figure 3(e)]. [ 1 ]√ IT,i + 2si 2 The clustering coefficient is a network measure that cap- [ 1 ]√ tures features beyond purely local parameters, such as degree 2si 2 2 or strength. However, as we see for the mouse connec- = Ci 1 − +O 3 IT,i tome, the hybrid method strongly correlates with the average √ weight associated with a node i: si /di . At the same time, = Ci + O( ) −−−→ + Ci , (12) the fully weighted definitions are much less correlated with →0 it [Fig. 3(d)]. As a result, they can bring more independent showing only an infinitesimal deviation to the similarly in- information regarding the weighted network structure than the finitesimal perturbation. hybrid method. This trend is even stronger for the Zhang- Similarly, except for the single-triangle cases discussed in Horvath definition, which does not account for the absolute Table I, the Zhang-Horvath clustering becomes intensity of triangles. In contrast, the continuous definition provides some intermediate behavior as the intensity of trian- Z I,i + O() CiZ = = CiZ + O() −−−→ CiZ . (13) gles often correlates, if only in part, with the average weight Z + O() IT,i + →0 associated with the node. Finally, though the continuous and Zhang-Horvath defini- It is worth noting that one continuity issue, associated with tions often provide somewhat similar results, they may differ nodes participating in only one triangle, occurs for all defini- significantly, e.g., for the cerebellar cortex in Fig. 3(f). Com- tions but the continuous one. Since this situation is pervasive bining the results of both methods can thus be informative, in networks with low degree or binary clustering, using the for example, to single out nodes that possess only weak con- continuous clustering definition can be of particular impor- nections (and will therefore register as weakly clustered for tance in such cases; see Appendix F 4. the continuous definition) yet connect to other nodes that are strongly connected (thus registering as strongly clustered for IV. APPLICATION TO REAL-WORLD NETWORKS the Zhang-Horvath definition). A. Mouse mesoscale connectome B. Decentralized social media: The Fediverse In neuroscience, the networks on different scales give a vital piece of information to understand the brain better. Un- The Fediverse [23] is a set of federated social media that surprisingly, connectomics, the mapping of the connections in can communicate via a collection of common protocols [24], the nervous system, has gained significant attention and devel- the most well known being ActivityPub. This network can be oped dramatically over the last years. Most of the networks in seen as a set of alternatives to corporate platforms such as neuroscience are weighted, and the obtained connectivities are Facebook or Twitter. Social media on the Fediverse usually either measured or inferred, making them a typical example promote ideas of decentralization, interoperability, free/libre for the challenges discussed above. The mouse mesoscale and open-source software (FLOSS), and the absence of algo- connectome [22] is a fascinating example of such networks, rithmic filters in favor of human curation and moderation. both because it provides information about the entire mouse We analyze here a snapshot of this network that was ob- brain and because it contains an evaluation of the probability tained in 2018 by Zignani et al. [25] (data are available [26]). of false positives for all connections. In contrast to the original publication, we chose here to look Here, we investigate how the choice of the clustering co- at the mesoscale level, i.e., at connections between instances efficient definition can alter the results. The network is very (the equivalent of a community on the Fediverse, where at inhomogeneous, with broadly varying degrees [Fig. 3(a)]. The least one, but up to several thousand users can have an ac- edges in the mesoscale connectome are assigned p values count). This mesoscale view leads to a network of weighted that quantify their probability to correspond to a real physical directed interactions between communities of strongly con- connection (p denotes the probability of the connection to be nected users. Indeed, users of a single instance (the technical a spurious edge). They have therefore different significance name for a community on the Fediverse) can see and interact levels, with only 13% of all edges having p values smaller with all public messages posted by other members on that than 0.01, i.e., only 13% of all edges have a probability to be same server. At the same time, they can only see a subset of 043124-7 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) FIG. 3. For the mouse connectome, different clustering methods give significantly different results. (a) Top view of the mouse brain; node size indicates the total degree, and node color indicates the out-degree (lighter colors for higher degrees). (b) Distribution of the total clustering coefficients if only edges with the p value p < pmax are preserved. There are smaller changes in clustering distribution for fully weighted definitions than for the Onnela definition. From pale yellow to dark gray, thresholding keeps 9, 13, 32, and 100%, respectively, of the original network. (c) The fraction of the nodes with the highest total clustering (top 5, 10, and 25%) that are preserved across all subsamplings in (b) for the Zhang-Horvath (brown), continuous (orange), and Onnela definitions (pale yellow). (d) Correlation of the three total clustering definitions with average total node weight (si /di ) shows that fully weighted definitions capture additional information beyond degree and strength. (e) Fraction of the 10% highest clustering nodes that are common between two of the definitions (filled markers in the right panel include the central region of the left panel) or among all three definitions (black crosses, central region) as shown in the Venn diagram. (f) Clustering ranks of the areas within brain regions (showing which regions contain nodes with high clustering coefficients) can significantly vary depending on the definition (Zhang-Horvath, brown; continuous, orange; and Onnela, pale yellow). the posts from people on other servers (either because they 3825 nodes represent more than half of the entire network follow their author or because other members of the instance and are connected via 81 371 edges. A chord diagram of follow the author or shared this specific post). connections between locations shows that, besides the fact For each instance I1 , an edge towards another instance I2 that the position of most instances is unspecified (UNS), the means that at least one user on I1 follows at least one member largest hosting countries are Japan, the United States, and of I2 . The precise value of the weight associated with this France, with Japanese and French communities interacting edge gives the fraction of all followers from the source in- mostly among themselves [Figs. 4(a) and 4(b)]. The network stance that are associated with members of the target instance. is both very sparse and strongly heterogeneous, with a median The weights are thus a proxy to characterize the fraction degree of 5 but node degrees and sizes varying over 3–5 orders of the community’s attention that is associated with content of magnitude. This broad distribution has notable implications produced by another community; this means of course that, for different clustering definitions. For the Zhang-Horvath for each node, its outgoing strength so is equal to 1. Note definition, it increases the likelihood of running into the corner that in this network, information flow occurs in the direction cases of single triangles, increasing the average clustering opposite to that of the edge, because the directed edge denotes compared with the other methods. For the Onnela definition, that the source is paying attention to what the target posts. it strongly correlates clustering values to the degree (and thus The network snapshot contains 3825 nodes corresponding to the size) of the instance. mostly to instances running Mastodon [27], one of the most As for the mouse network, all weighted methods lead to re- prominent microblogging platforms on the Fediverse. These sults that differ strongly from the binary clustering [Figs. 4(d) 043124-8 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) (a) (b) (c) (d) (e) (f) FIG. 4. Properties of the network of Fediverse instances. (a) Chord diagram of connections between locations (DEU, Germany; FRA, France; JPN, Japan; UNS, unspecified; USA, United States of America). (b) Spatial representation of the network showing connections that amount to at least 1% of the strongest connection; nodes are placed on each country’s capital, and their sizes represent the number of instances hosted in that country. The zoom on Europe shows all connections in the European subnetwork. (c) The count of users per instance follows a heavy-tailed distribution. (d) The network displays strong structural clustering, most nodes with nonzero in- and out-degrees displaying binary clustering values close to 1, whereas the expected value for an Erdős-Rényi network with the same number of edges would be almost zero (dotted line). (e) The median values of the fan-out clustering for different instance sizes show that the strong heterogeneity of the network can have a notable influence for the Onnela method (yellow), whereas the Zhang-Horvath (dark brown) and continuous methods (orange) display much weaker correlation with the size of the instance. (f) The fan-in motifs have intensities that are several orders of magnitude lower than fan out and display weaker dependency on the instance size. The binary (gray) clustering coefficient misses the difference between fan out and fan in, which predominantly relies on the weights’ effect. and 4(e)]. Some of the results from the hybrid method tend to the brain is indeed associated with situations where a signal correlate strongly with some first-order properties of the nodes can be transferred not only directly from one node to another, [Fig. 4(e)], whereas the fully weighted methods bring more but also indirectly via a third (middleman) node, as can be independent information. The Fediverse network displays a seen in middleman, fan-in, and fan-out patterns (cf. Table II). peculiar feature as its fan-out and fan-in clusterings differ This overexpression of redundant patterns is visible, for the significantly despite the usual correlation between these two fully weighted methods, in the high values of the middleman, patterns, as can be seen from the comparison of Figs. 4(e) and fan-in, and fan-out motifs in Fig. 5(a). 4(f). The situation for the Fediverse is significantly more com- plex. Indeed, 98% of all edges belong to at least one triangle, and some are involved in up to thousands of triangles. A C. Using local clustering to infer dynamical properties deeper understanding of the clustering requires precise in- Analysis of the clustering coefficient for different struc- vestigation of how the weight distribution correlates with tured patterns offers a way to obtain a precise idea of specific patterns of triangles. For instance, though fan-in and the critical dynamical patterns within a network. To deter- fan-out patterns co-occur and are thus usually correlated, the mine the significance of a particular pattern, we compare its Fediverse displays an unexpected discrepancy between the prominence in the original network with its prominence in weights associated with the two patterns, as seen in Fig. 5(b), null-model networks obtained via appropriate randomization. which notably differentiates it from the mouse connectome For the mouse brain, as a randomized control, we take the (see also Appendix F 3). original network and only shuffle the weights, thus preserving the weight distribution and binary structure. Comparing the V. DISCUSSION actual values in the original graphs with those of the random- ized graphs, we can see the importance of looking at fully In this paper, we introduced directed weighted definitions weighted measures. for clustering analysis. Using analytic derivations, generated Both the Zhang-Horvath and continuous definitions iden- network models, and real data, we showed that the behavior tify the preference of the mouse brain network for redundant of fully weighted measures displayed enhanced sensitivity, information flows, whereas the hybrid method of Onnela does selectivity, and robustness compared with hybrid measures. not capture this feature. Redundant information transfer in To facilitate access to these measures, all clustering methods 043124-9 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) (a) (b) FIG. 5. Different structures of clustering patterns and information flow in the mouse brain and the Fediverse. The original clustering values are compared with graphs with the same adjacency matrix but shuffled edge weights (mouse, 10 uniformly shuffled networks; Fediverse, 200 graphs shuffled across all outgoing edges of each node to preserve out-strength normalization). The contours show different density levels of the point clouds associated with the original and shuffled pairs of clustering coefficients. (a) The Zhang-Horvath and continuous definitions capture the redundancy of information flow in the mouse brain (the middleman, fan-in, and fan-out motifs are higher in the original graph). However, the Onnela method does not capture this feature. (b) For the Fediverse, only fan-out and middleman motifs are significantly stronger than in the random graphs. Patterns where the original values are significantly greater than the randomized ones are marked by a + and the initial of the method (brown Z, Zhang-Horvath method; orange C, continuous method; yellow O, Onnela method). The original clustering is considered to be significantly higher (lower) if 75% of the points are above (below) the dotted identity line. The addition of * denotes one-sided fractions greater than 95%. were implemented in PYTHON and made compatible with the numbers of spurious edges with small weights. We expanded three main libraries in this language (NETWORKX, IGRAPH, and the results from previous studies comparing existing weighted GRAPH-TOOL) [28]. definitions [14–16,21], extending their definition to directed The use of a specific weighted clustering coefficient should networks and providing complete mathematical justifications be an informed choice, based on the precise questions and ob- for previous observations regarding linearity and continuity. jects of study and thus on the specific properties that a relevant Our analyses show that either of the fully weighted defini- measure should fulfill to answer these questions. We therefore tions may be preferred depending on the network properties. wish to highlight the importance of continuity as a potentially For networks with large degrees, the Zhang-Horvath defini- crucial notion for networks with highly heterogeneous weight tion may be preferred if the number of low-weight spurious distribution or numerous spurious edges with low weights. edges is very high, as it is least susceptible to noise. In In previous studies, slight variations on this notion had been networks with heterogeneous weight distributions where the introduced mathematically [15] or hinted at by the study of absolute value of triangle strength is of interest, we showed corner cases [14]. However, the continuity property was in- that the continuous definition provides more relevant results advertently considered only in particular classes of networks than the Zhang-Horvath definition. Indeed, in networks in- in Ref. [15] (disregarding, for example, small-degree cases), volving fluxes of goods or matter as well as for information erroneously marking previous methods as continuous. The processing (e.g., in brain or telecommunications networks), other study (Ref. [14]) asserted discontinuity of previous mea- one may be interested in the absolute amount that can be sures simply as a feature, without discussing its implications transferred between nodes, making nodes with small weights or ways to avoid it. Our proposal of fully continuous clus- of little relevance. A similar issue occurs for networks where tering methods based on simple mathematical principles and many nodes participate in a single triangle (see Appendix F 4) requirements fully solves the issue of continuity. In addition, making the Zhang-Horvath clustering noncontinuous. In such we show in Appendix D that these principles can be extended networks, the use of the Zhang-Horvath definition is likely to other measures such as the local closure. to assign high clustering to single-triangle nodes with low We discussed how each weighted definition is associated weights, whereas the continuous definition will not, as was with a specific interpretation of weighted clustering as a func- shown in Fig. 1. tion of the binary clustering, triangle, and triplet intensities Finally, we illustrate the usefulness of weighted cluster- (see Appendix B and Table VI for a summary). Corre- ing methods to investigate clustering in the examples of a spondingly, these interpretations are associated with specific connectome and a decentralized social network. The fully properties of each clustering coefficient. In rare cases where, weighted methods were especially suitable to reveal key dif- by design, there are edges with small weights that should be ferences in their weighted structural properties. Indeed, we treated significantly differently from an absence of connec- showed that middleman, fan-in, and fan-out patterns, charac- tion, researchers may want to check whether the properties of teristic of pathways enabling redundant information transfer the Barrat or Onnela definitions fit their needs or if they should between nodes, were overexpressed in the mouse brain. In the come up with new, more appropriate definitions. In many Fediverse, the methods revealed an unexpected discrepancy other cases, combining mathematical analysis and concrete between fan-in and fan-out modes, probably associated with examples, we asserted that fully weighted measures outper- social interaction patterns that would mandate further investi- form hybrid ones to evaluate clustering in networks with large gation. 043124-10 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) TABLE III. Undirected weighted clustering coefficients of ver- More specifically, Wang et al. [15] argue in favor of the use of tex i (solid circles) for different weight configurations. Solid lines a specific version using the harmonic mean (hm): depict edges of weight w = max(w) = 1, whereas dotted lines are 2 associated with edges with vanishing weight . Only the proposed jk 2 + w1 1 1 continuous clustering (bottom row) displays the required properties, wi j + wik jk CiM,hm = 2 , (A4) compared with the Zhang-Horvath (C Z ), Holme et al. [17] (C H ), and j=k 2 + max 1 Miyajima-Sakuragawa (hm) definitions (C M,hm ). 1 1 wi j + wik lm (wlm ) which we will refer to as the Miyajima-Sakuragawa (hm) definition. However, this definition suffers from three major shortcomings: (i) Despite what is asserted by the authors, it does not fulfill the continuity condition. (ii) It is not locally linear, meaning that two nodes that have the same neighborhood but with all weights differing by a factor λ will not have the ratio of their clustering coefficients equal to λ. ACKNOWLEDGMENTS (iii) It introduces an undesired asymmetry in the definition of the triangle intensity: For a given triangle i jk , the com- This research was funded by a Humboldt Research Fellow- puted intensity will be different for each node as it depends ship for Postdoctoral Researchers and a Sofja Kovalevskaja on which one is considered as i. Award from the Alexander von Humboldt Foundation, en- dowed by the Federal Ministry of Education and Research. We thank the anonymous reviewers for their comments, which APPENDIX B: COMPARISON OF CLUSTERING helped to significantly improve our manuscript. We acknowl- PROPERTIES edge the support from the BMBF through the Tübingen AI Multiple properties of the main definitions for the cluster- Center (FKZ 01IS18039B). ing coefficient are listed in Table IV. These properties depend on the definitions of triangle and triplet intensity that are APPENDIX A: LIMITATIONS OF OTHER FULLY recapitulated in Table V. WEIGHTED DEFINITIONS The different formulas for the intensities lead to different interpretations of weighted clustering (Table VI). The Barrat See Table III for a complete comparison of the fully and Zhang-Horvath definitions only quantify ratios of trian- weighted clustering definitions. gle and triplet strength, while the Onnela and continuous definitions are sensitive to the absolute value of the triangle 1. Holme et al. [17] intensity. The latter provides an intermediate interpretation between the Zhang-Horvath and Onnela definitions as it reacts Most studies consider the definition of Holme et al. [17] to to both the ratio of intensities and the absolute value of the be triangle intensity. jkwi j wik w jk CiH = , (A1) APPENDIX C: DERIVATION OF THE EVOLUTION maxi j (wi j ) jk wi j w jk OF HYBRID CLUSTERING COEFFICIENTS which would make it inconsistent with the binary definition. 1. Barrat method However, the discussion in their paper states that consistency Upon addition to a graph G(N, E ) of an edge (i, v) of was one of their requirements, letting us think that they actu- weight , giving G (N, E ) = E + {(i, v)}, one can compute ally meant to define it as the evolution of the initial clustering coefficient CiB to its new value CiB by rewriting the definition from Ref. [10] as in j=k wi j wik w jk Ci = H , (A2) Ref. [14]: maxi j (wi j ) j=k wi j w jk 1 wi j + wik CiB = ai j aik a jk which would make it equal to the Zhang-Horvath definition di (di − 1) j=k 2wi [12], and we will therefore not consider Eq. (A1) here. 1 wi j = ai j aik a jk . (C1) di (di − 1) j=k wi 2. Miyajima and Sakuragawa [18] Miyajima and Sakuragawa [18] define a multitude of gen- From this, we can deduce the following relationship eralized clustering coefficients based on the use of an arbitrary between the Barrat definition and the binary clustering coeffi- function h : R2 → R such that cient Cibin : jk h[h(wi j , wik ), w jk ] n,i wi /wi w Ci = M,h . (A3) CiB = = Cibin i , (C2) j=k h[h(wi j , w jk ), maxlm (wlm )] di (di − 1) wi 043124-11 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) TABLE IV. Comparison of the different properties of the different clustering coefficients. The Zhang-Horvath and continuous definitions fulfill the maximum number of desirable properties. An “X” means that the method possesses this property. Property Barrat Onnela Miyajima-Sakuragawa (hm) Zhang-Horvath Continuous Consistent with binary definition X X X X X Normalized (C ∈ [0, 1]) X X X X X All weights participate in I X X X X All weights participate equally in I X X X Linear against local scaling of the weights X X X Sensitive to weight permutations X X X X For a triangle = (i, j, k), I = IT f (w jk ) X Continuous X with n,i being the number of triangles in which node i partic- From this, one can define the evolution upon addition of an ipates and wi being the average weight associated with edges edge (i, v) of weight as connected to i and participating in a triangle. Notice that this expression also explains why the Barrat CiO = Cibin Ii O jk definition is so close to the binary clustering: For networks jk + O() O n Ii n where weights are either rather homogeneous or not strongly = correlated to triangles, wi and wi become very close as the di (di + 1) n number of triangles per node increases. n = I O + O() Using Eq. (C2), the new clustering can be defined as di (di + 1) i jk di − 1 O = C + O(). (C5) wi di + 1 i CiB = Cibin wi 3. Directed versions of the clustering coefficients n,i wi +(n,i −n,i ) n,i n,i To generalize the Zhang-Horvath definition of cluster- = di wi + di (di + 1) di +1 ing [12] for directed graphs, we use the same approach as proposed by Fagiolo [19]. These definitions are visible in n,i wi Table VII together with the directed definitions associated = + O() with Barrat et al. [10] and Onnela et al. [11], respectively di2 wi defined in Refs. [20] and [19]. di − 1 B = Ci + O(). (C3) k APPENDIX D: CLOSURE Closure was introduced in Ref. [13] as a complementary 2. Onnela method measure of clustering for binary undirected networks. As for the other definitions, the clustering from Ref. [11] can be defined as a function of the binary clustering: 1. Undirected weighted closure From the Zhang-Horvath definition of clustering, closure O n Ii jk can be generalized in a fully weighted but noncontinuous way CiO = = O . Cibin Ii (C4) di (di − 1) jk as j=k wi j w jk wki Wii3 Hi = 0 = , (D1) j=k=i wi j w jk j wi j (s j − wi j ) TABLE V. Comparison of the formula for triangle and triplet intensity among the different clustering definitions for undirected again comparing the triangle intensities with their maximum networks. possible value if all their second neighbors are also connected to them (i.e., also first neighbors) by an edge of weight 1. Definition Triangle (Ii jk ) Triplet (IT i jk ) wi j + wik TABLE VI. Comparison of the interpretations of the different Barrat ai j aik a jk di (di − 1) 2wi clustering definitions for undirected networks. Onnela (wi j wik w jk )1/3 di (di − 1) 2 2 Miyajima-Sakuragawa (hm) Definition Barrat Onnela Zhang-Horvath Continuous 1 + 1 + w + 2 1 1 1 jk wi j w ik wi j wik Zhang-Horvath wi j wik w jk w w wi bin bin Z Ii jk 2 Ii jk √ √ i j ik Formula C O Ii jk Ci Cibin Cibin Continuous 3 w w w i j jk ik wi j wik wi i ITZi jk IT i jk 043124-12 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) TABLE VII. Definitions of the Barrat, Onnela, and Zhang-Horvath intensities for each partial mode pattern in directed graphs. Column 1, pattern name; column 2, Onnela triangle intensity for node i; column 3, Barrat triangle intensity for node i; column 4, Barrat triplet intensity for node i; column 5, Zhang-Horvath triangle intensity for node i; column 6, Zhang-Horvath triplet intensity for node i. The clustering coefficients associated with each mode m are given by CiO,(m) = I,i O,(m) /nT,i (m) for the Onnela method, CiB,(m) = I,i /IT,i for the Barrat method, B,(m) B,(m) and Ci Z,(m) = I,i /I Z,(m) Z,(m) T,i for the Zhang-Horvath method. Note that, for the Barrat method, the reciprocal strength has been defined in Ref. [20] B as si,↔ = i= j 21 (wi j + w ji ). O,(m) B,(m) B,(m) Z,(m) Z,(m) Mode I,i I,i IT,i I,i IT,i 1 Cycle (W [ 3 ] )3ii 1 2 (WA2 + (WA2 )T )ii 1 (s d 2 i,in i,out + si,out di,in ) − si,↔ B (W )3ii si,in si,out − si,↔ [2] [ 13 ] 1 1 Middleman (W W [ 3 ]T W [ 3 ] )ii 1 2 (WAT A + W T AAT )ii 1 (s d 2 i,in i,out + si,out di,in ) − si,↔ B (W W T W )ii si,in si,out − si,↔ [2] [ 13 ]T [ 13 ] 2 Fan in (W (W ) )ii 1 2 (W T (A + AT )A)ii si,in (si,in − 1) (W T W W )ii (si,in ) − si,in 2 [2] 1 1 Fan out ((W [ 3 ] )2W [ 3 ]T )ii 1 2 (W (A + AT )AT )ii si,out (si,out − 1) (W W W T )ii (si,out )2 − si,out [2] This is equivalent to comparing with the situation where all or via the continuous definition, open paths of length 2 are closed into a triangle by an edge of √ 2 3 weight 1. j=k 3 w w w 2 i j jk ki W[3] Finally, it can also be defined in a continuous way as Hi,CO = c √ = 1 2 ii , (D7) j=k=i wi j w jk [ 2 ] − si,out j W ij √ 2 3 3 w w w 2 j=k W[3] √ 2 ,T 3 Hi = i j jk ki √ = 1 2ii . (D2) j=k 3 w w w k j ji ik 2 W[3] j=k=i wi j w jk [ 2 ] − si c Hi,CI = √ = 1 2 ii , j W ij j=k=i wk j w ji W [ 2 ,T ] − si,in j ij (D8) 2. Directed weighted closure √ 2 2 2 ,T For directed networks, in contrast to Ref. [29], we only 3 w w w 2 j=k i j jk ki W [ 3 ] W [ 3 ] ii c Hi,FO = √ = 1 2 , (D9) consider the extension of the undirected measure to di- j=k=i wi j w jk [ 2 ] − si,out j W rected paths (that is, we consistently consider only second ij in-neighbors via in-neighbors and second out-neighbors via √ 2 2 W [ 3 ],T W [ 3 ] ii 2 3 w w w 2 j=k k j ji ki out-neighbors, but not the out-neighbors of the node’s in- c Hi,FI = √ = 1 2 . (D10) neighbors or vice versa). Therefore we define only four j=k=i wk j w ji [ 2 ,T ] − si,in j W ij variants of the directed closure, two for outgoing paths [cycle out (CO) and fan out (FO)] and two for incoming paths [cycle APPENDIX E: NETWORK GENERATION ALGORITHMS in (CI) and fan in (FI)]. We define the weighted version either directly using the All networks were generated using the Neural Networks weights, as in the Zhang-Horvath definition for the clustering and Graphs’ Topology (NNGT) library [28]. coefficient, Wii3 1. Core-periphery network j=k wi j w jk wki Hi,CO = 0 = , (D3) j wi j (s j,out − wi j ) j=k=i wi j w jk The core-periphery network in Fig. 1 contains (i) 1 cen- 3 tral core node (CCN), (ii) 10 outer-core nodes (OCNs), and j=k wk j w ji wik (W T )ii (iii) 20 periphery nodes (PNs). Hi,CI = 0 = , (D4) j w ji (s j,in − w ji ) j=k=i wk j w ji The nodes are connected as follows: (i) The ten OCNs for a circular graph have fully reciprocal j=k wi j w jk wik (W 2W T )ii connections to their four nearest neighbors. 0 Hi,FO = = , (D5) j=k=i wi j w jk j wi j (s j,out − wi j ) (ii) The OCNs all connect to the CCN with reciprocal connections. j=k wk j w ji wki (W T W 2 )ii Hi,FI = 0 = , (D6) The weights associated with these connections are drawn j=k=i wk j w ji j w ji (s j,in − w ji ) from U (5, 10). The connections with the PNs are as follows: TABLE VIII. Some characteristic properties of the mesoscale TABLE IX. Some characteristic properties of the Fediverse mouse brain. SD, standard deviation. mesoscale network. Mean SD Median Min Max Mean SD Median Min Max In-degree 153.7 25.6 152 104 223 In-degree 21.3 82.6 4 0 2271 Out-degree 153.7 77.6 151 14 357 Out-degree 21.3 77.8 5 0 2038 CCbin tot 0.43 0.02 0.43 0.38 0.49 CCbin tot 0.68 0.37 0.86 0 1 Weight 0.076 0.36 0.038 3.9 × 10−16 20.4 Weight 0.045 0.14 0.0029 6.0 × 10−7 1 043124-13 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) FIG. 6. Distribution of in- and out-degrees (in orange and yellow, FIG. 7. Distribution of in- and out-degrees (in orange and yellow, respectively) and weights in the mouse connectome. respectively) and weights in the Fediverse instances. (i) Each OCN receives one connection from every other The original lattice L(N, k, r) has 21 Nk(1 + r) edges, lead- PN, starting with the first or second PN depending on the ing to the limit cases (i) 21 Nk edges if r = 0, like the OCN’s evenness. undirected lattice, and (ii) Nk edges if r = 1, with all con- (ii) The OCN reciprocates the connections with probability nections being reciprocal. 0.5. (iii) The PNs are connected among themselves following 3. Network properties for Fig. 2 an Erdős-Rényi scheme of density 0.05. Weights associated with connections involving PNs are In the Watts-Strogatz network used in Fig. 2, we used a drawn from U (0.05, 0.5). coordination number k = 20 and a rewiring probability p = 0.03. The Erdős-Rényi network is generated to have an average 2. Watts-Strogatz network degree of 15, whereas the scale-free network is generated via The original Watts-Strogatz network [30] consists of a reg- a degree list drawn from a power-law distribution with an ular lattice basis (characterized by a coordination number k) exponent of 1.2 and normalized to get an average degree of that is then modified, rewiring each edge with a probability 15. p. For directed networks, we used a generalization of that The weights of the Watts-Strogatz network are drawn from method implemented in NNGT which is strictly equivalent an exponential distribution of parameter 1.5 and shifted by except for the fact that edges are now directed: +0.1 so that they are all nonzero. (1) Start from a directed regular lattice with coordination The weights or the “noise” networks are either all equal number k and reciprocity r (taken as 1 in this paper). to 10−4 or drawn from a second exponential distribution of (2) Rewire each edge with probability p. parameter 0.05, shifted by +10−4 . (a) (b) FIG. 8. Different structures of local closure in the mouse brain and Fediverse. The panels compare the local closure of directed patterns for the mouse brain (a) and Fediverse (b). The original values are compared with the averages over an ensemble of graphs with the same adjacency matrix but shuffled edge weights. For the mouse brain, 10 uniformly shuffled networks; for Fediverse, 200 shuffling across all outgoing edges of a each node (to preserve out-strength normalization). (a) The local closure also shows an increase for patterns promoting redundancy of information flow in the mouse brain (fan-in and fan-out motifs are higher in the original graph). (b) For the Fediverse, only fan out is significantly stronger than in the random graphs. Patterns where the original values are significantly greater (smaller) than the randomized ones are marked by a + (−) and the initial of the method (brown Z, Zhang-Horvath method; orange C, continuous method; yellow O, Onnela method). The original closure is considered to be significantly higher (lower) if 75% of the points are above (below) the dotted identity line. The addition of * denotes one-sided fractions greater than 95%. 043124-14 WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021) TABLE X. Number of nodes (N) and edges (E) and fraction of sponding to the number of users registered on that server. single-triangle (1-T) nodes in several collaboration and gene associ- Edges are associated with four attributes: “num follows”, de- ation networks. fined as Fi j , the number of followers from i to j; “follow/size”, defined as Fi j /Si ; a “weight” given by wi j = Fi j /Fi , the ratio Network N E 1-T nodes (%) between the number of followers from i to j divided by the NetScience 1589 2742 23 total number of followers from i; and a “distance” giving the CompGeo 7343 11 898 21 Euclidean distance between the two servers on the latitude- CE-CX 15 229 245 952 8 longitude plane. See Table IX and Fig. 7. CE-HT 2617 2985 2 HS-HT 2570 13 691 7 3. Closure in the shuffled networks SC-HT 2084 63 027 5 The results obtained for the local closure confirm those obtained using the clustering coefficient for the mouse brain and Fediverse networks, as shown in Fig. 8. Namely, we see APPENDIX F: REAL-WORLD NETWORKS an overexpression of fan-in and fan-out patterns for the mouse network and a discrepancy between those two patterns for the 1. Mouse mesoscale connectome Fediverse network. The mouse mesoscale connectome network was obtained from Ref. [22] and contains 426 nodes corresponding to brain 4. Networks with a high number of single-node triangles regions of intermediate scale connected by 65 465 edges. It The presence of nodes participating in a single triangle is a symmetric version of the original network that separates is frequent in very sparse networks such as collaboration or nodes from both hemispheres. Each node is associated with gene-protein interaction networks. a “name” property composed of an abbreviated denomination Table X details several such networks: NetScience in- for the corresponding brain region, as well as a suffix (left or volves coauthorship in the network science community [31], right) corresponding to the hemisphere. Edges are associated CompGeo involves collaborations in computational geome- with three attributes: a “weight,” a “p value,” and a “distance” try [32], CE-CX is a graph of gene associations inferred (corresponding to the Euclidean distance from the source to from the coexpression pattern of two genes (based on the target node). See Table VIII and Fig. 6. high-dimensional gene expression data) for Caenorhabdi- tis elegans, CE-HT is for gene associations inferred from high-throughput protein-protein interactions for C. elegans, 2. Fediverse mesoscale network HS-HT is for gene associations inferred from high-throughput The Fediverse mesoscale network was obtained from Ref. protein-protein interactions for Homo sapiens, and SC-HT is [25] (data are available [26]). It contains 3825 nodes rep- for gene associations inferred from high-throughput protein- resenting servers (instances) that are connected via 81 371 protein interactions for Saccharomyces cerevisiae. edges. Each node is associated with two attributes: a “name”, Gene functional association networks [33] were down- corresponding to the server domain, and a “size”, Si , corre- loaded [34]. [1] O. Mason and M. Verwoerd, Graph theory and networks in [8] M. D. Fox and M. Greicius, Clinical applications of rest- biology, IET Syst. Biol. 1, 89 (2007). ing state functional connectivity, Front. Syst. Neurosci. 4, 19 [2] S. E. Ahnert and T. M. A. Fink, Clustering signatures classify (2010). directed networks, Phys. Rev. E 78, 036112 (2008). [9] J. G. Orlandi, O. Stetter, J. Soriano, T. Geisel, and D. Battaglia, [3] X.-J. Zhang, B. Gu, X.-M. Guan, Y.-B. Zhu, and R.-L. Lv, Cas- Transfer entropy reconstruction and labeling of neuronal con- cading failure in scale-free networks with tunable clustering, nections from simulated calcium imaging, PLoS One 9, e98842 Int. J. Mod. Phys. C 27, 1650093 (2016). (2014). [4] Y.-W. Li, Z.-H. Zhang, D. Fan, Y.-R. Song, and G.-P. Jiang, [10] A. Barrat, M. Barthelemy, R. Pastor-Satorras, and A. Influence of clustering on network robustness against epidemic Vespignani, The architecture of complex weighted networks, propagation, in Science of Cyber Security, Lecture Notes in Proc. Natl. Acad. Sci. USA 101, 3747 (2004). Computer Science, edited by F. Liu, S. Xu, and M. Yung [11] J.-P. Onnela, J. Saramäki, J. Kertész, and K. Kaski, Intensity (Springer International, Cham, Switzerland, 2018), pp. 19–33. and coherence of motifs in weighted complex networks, Phys. [5] S. F. Muldoon, E. W. Bridgeford, and D. S. Bassett, Small- Rev. E 71, 065103(R) (2005). world propensity and weighted brain networks, Sci. Rep. 6, [12] B. Zhang and S. Horvath, A general framework for weighted 22057 (2016). gene co-expression network analysis, Stat. Appl. Genet. Mol. [6] J.-G. Young, G. T. Cantwell, and M. E. J. Newman, Bayesian Biol. 4, 17 (2005). inference of network structure from unreliable data, J. Complex [13] H. Yin, A. R. Benson, and J. Leskovec, The local closure coeffi- Networks 8, cnaa046 (2021). cient: A new perspective on network clustering, in Proceedings [7] R. Guimerà and M. Sales-Pardo, Missing and spurious interac- of the Twelfth ACM International Conference on Web Search tions and the reconstruction of complex networks, Proc. Natl. and Data Mining (Association for Computing Machinery, Mel- Acad. Sci. USA 106, 22073 (2009). bourne, 2019), pp. 303–311. 043124-15 TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021) [14] J. Saramäki, M. Kivelä, J.-P. Onnela, K. Kaski, and J. Kertész, Ouellette, T. N. Nguyen, S. A. Sorensen, C. R. Slaughterbeck, Generalizations of the clustering coefficient to weighted com- W. Wakeman, Y. Li, D. Feng, A. Ho, E. Nicholas et al., A plex networks, Phys. Rev. E 75, 027105 (2007). mesoscale connectome of the mouse brain, Nature (London) [15] Y. Wang, E. Ghumare, R. Vandenberghe, and P. Dupont, Com- 508, 207 (2014). parison of different generalizations of clustering coefficient [23] https://fediverse.party/. and local efficiency for weighted undirected graphs, Neural [24] https://en.wikipedia.org/wiki/Fediverse#Communication_ Comput. 29, 313 (2016). protocols_used_in_the_fediverse. [16] G. Kalna and D. J. Higham, Clustering coefficients for weighted [25] M. Zignani, C. Quadri, S. Gaito, H. Cherifi, and G. P. networks, in AISB’06: Adaptation in Artificial and Biological Rossi, The footprints of a “mastodon”: How a decentralized Systems, edited by T. Kovacs and J. A. R. Marshall (Society architecture influences online social relationships, in IEEE Con- for the Study of Artificial Intelligence and the Simulation of ference on Computer Communications Workshops 2019 (IEEE, Behaviour, Bath, UK, 2006), Vol. 3, pp. 132–138. Piscataway, NJ, 2019), pp. 472–477. [17] P. Holme, S. M. Park, B. J. Kim, and C. R. Edling, Korean [26] https://dataverse.mpi-sws.org/dataset.xhtml?persistentId=doi: university life in a network perspective: Dynamics of a large 10.5072/FK2/AMYZGS. affiliation network, Phys. A (Amsterdam) 373, 821 (2007). [27] https://joinmastodon.org/. [18] K. Miyajima and T. Sakuragawa, Continuous and robust [28] T. Fardet, NNGT 2.5.1, 2021, doi: 10.5281/zenodo.5571901. clustering coefficients for weighted and directed networks, [29] H. Yin, A. R. Benson, and J. Ugander, Measuring directed arXiv:1412.0059 [physics.soc-ph]. triadic closure with closure coefficients, Network Sci. 8, 551 [19] G. Fagiolo, Clustering in complex directed networks, Phys. Rev. (2020). E 76, 026107 (2007). [30] D. J. Watts and S. H. Strogatz, Collective dynamics of ‘small- [20] G. Clemente and R. Grassi, Directed clustering in weighted world’ networks, Nature (London) 393, 440 (1998). networks: A new perspective, Chaos, Solitons Fractals 107, 26 [31] https://networkrepository.com/netscience.php. (2018). [32] http://vlado.fmf.uni-lj.si/pub/networks/data/collab/geom.htm. [21] I. E. Antoniou and E. T. Tsompa, Statistical analysis of [33] A. Cho, J. Shin, S. Hwang, C. Kim, H. Shim, H. Kim, H. weighted networks, Discrete Dyn. Nat. Soc. 2008, 375452 Kim, and I. Lee, WormNet v3: A network-assisted hypothesis- (2008). generating server for Caenorhabditis elegans, Nucleic Acids [22] S. W. Oh, J. A. Harris, L. Ng, B. Winslow, N. Cain, S. Mihalas, Res. 42, W76 (2014). Q. Wang, C. Lau, L. Kuan, A. M. Henry, M. T. Mortrud, B. [34] https://networkrepository.com/bio.php. 043124-16