Weighted directed clustering: Interpretations and requirements for heterogeneous, inferred, and measured networks

Authors Anna Levina, Tanguy Fardet,

Plaintext

PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

Weighted directed clustering: Interpretations and requirements for heterogeneous,
inferred, and measured networks

Tanguy Fardet and Anna Levina
University of Tübingen, Tübingen 72076, Germany
and Max Planck Institute for Biological Cybernetics, Tübingen 72076, Germany

(Received 8 June 2021; accepted 18 October 2021; published 17 November 2021)

Weights and directionality of the edges carry a large part of the information we can extract from a complex
network. However, many network measures were formulated initially for undirected binary networks. The
necessity to incorporate information about the weights led to the conception of multiple extensions, particularly
for definitions of the local clustering coefficient discussed here. We uncover that not all of these extensions
are fully weighted; some depend on the degree and thus change a lot when an infinitely-small-weight edge is
exchanged for the absence of an edge, a feature that is not always desirable. We call these methods “hybrid”
and argue that, in many situations, one should prefer fully weighted definitions. After listing the necessary
requirements for a method to analyze many various weighted networks properly, we propose a fully weighted
continuous clustering coefficient that satisfies all the previously proposed criteria while also being continuous
with respect to vanishing weights. We demonstrate that the behavior and meaning of the Zhang-Horvath
clustering and our proposed continuous definition provide complementary results and significantly outperform
other definitions in multiple relevant conditions. We demonstrate, using synthetic and real-world networks, that
when the network is inferred, noisy, or very heterogeneous, it is essential to use the fully weighted clustering
definitions.

DOI: 10.1103/PhysRevResearch.3.043124

I. INTRODUCTION stressed in multiple studies [1–4]. This is notably the case
for the middleman motif, which is a marker of feedforward
The clustering coefficient (CC) was originally introduced
loops in transcriptional networks and of information trans-
for binary undirected networks to quantify strong connect-
fer redundancy, e.g., in neuroscience. More generally, such
edness within a local neighborhood. It was defined as the
motifs will influence the evolution of dynamical processes
fraction of all possible triangles that were realized, i.e., the
on the networks, for instance, synchronization patterns, and
ratio between all triangles in which node i participates (n,i )
have been shown to characterize families of networks such as
and the total number of triangles that could theoretically be
transcription or language networks [2]. Finally, clustering is
made given its degree di , which is the number of triplets (nT,i ):
used in other measurements to access the small-world propen-
n,i n,i sity of networks [5], and the choice of a specific definition
Cibin = = . (1)
nT,i di (di − 1) can therefore influence whether the network of interest will
register as small world or not.
From a neighbor-centric perspective, it can be seen perhaps In many applications, network topology and weights are
more intuitively as the probability that two neighbors of a measured only up to a certain precision [6,7]. For example, in
node are connected. However, as network science expanded, neuroscience, the functional connectivity networks are mea-
more and more graphs were encountered, where directedness sured using the indirect inference of connections from the
and edge weights play a central role. Generalizations of the recorded activity [8,9]. Accepting the inevitability of noise in
clustering coefficient were therefore introduced to account a network brings forward new requirements on the network
for asymmetry in the connections between pairs of nodes or measures, namely, that they are stable to the noise and do
heterogeneity in their strength. not change dramatically if the weights are perturbed or weak
The importance of clustering, including its directed vari- connections are randomly omitted.
ants, to understand complex dynamics on networks has been There is no agreement among researchers as to which
weighted extension of the clustering coefficient definition is
most appropriate. The three predominantly used methods at
the moment [10–12] differ in many properties of their def-
Published by the American Physical Society under the terms of the initions. Part of the reason for the absence of a single best
Creative Commons Attribution 4.0 International license. Further weighted clustering lies in a different interpretation of weights
distribution of this work must maintain attribution to the author(s) in various data sets. Consequently, a different weighted exten-
and the published article’s title, journal citation, and DOI. Open sion might be most appropriate for various data and specific
access publication funded by the Max Planck Society. scientific questions. However, to understand which method to

2643-1564/2021/3(4)/043124(16) 043124-1 Published by the American Physical Society
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

use when and why, we need to understand their differences factor [the global max(w)] as long as the normalization con-
precisely. dition is fulfilled since only the normalization matters. We
The difficulty of extending graph measures to weighted omitted the last two conditions of Saramäki et al.’s paper
networks is not specific to the clustering coefficient but can (invariance to weight permutation and ignorance of weights
occur whenever ratios of degrees or path length are involved. not participating in any triangle). Although they might be of
We will therefore also discuss a second clustering-related interest for some specific applications, we do not consider
measure, called the closure coefficient and introduced as the them to be generally desired properties for a clustering co-
fraction of all open walks of length 2 starting from node i that efficient. Indeed, the clustering coefficient is a node-centric
are part of a triangle [13]. This will also enable us to discuss measure, so there is no reason to expect invariance when the
the complementarity of closure and clustering as the former weights of a node’s edges change. As for edge ignorance, in
provides an important complement to analyze the tendency of the original classifying paper [14] this property is introduced
nodes to form 3- and 4-cliques. as a particular feature of the Onnela et al. [11] definition, with-
We introduce here a distinction between fully weighted out reasons why it may be a desired condition. We also did
and hybrid definitions and discuss why, for several classes not require that all weights in a triangle should be accounted
of networks, fully weighted and directed definitions should for because this condition is necessarily met if the continuity
be preferred to other clustering definitions that are currently condition is fulfilled.
used for network analysis. We also propose a definition that Continuity can be expressed mathematically as follows: for
obeys additional conditions, including continuity of the results a graph G(V, E ), if a weighted edge (u, v, w) with u, v ∈ V
with respect to infinitesimal changes in edge weights, which and weight w ∈ R is added to this graph to form a new graph
has significant consequences for the resilience to noise in in- G (V, E ), with E = E + {(u, v, w)}, then the clustering mea-

ferred networks. We demonstrate why fully weighted methods sure is continuous if and only if ∀i ∈ V , Ci(G ) −−−→ +
Ci(G) .
are essential for measured and inferred networks, which are w→0
This condition in crucial to ensure a reasonable behavior of
pervasive in biological fields such as neuroscience, and for
the clustering coefficient in inferred networks.
networks dealing with flows of information, money, or goods
Though some definitions of previously proposed weighted
that display a very broad weight distribution.
clustering coefficient definitions obey most of the required
properties, none of them completely fulfill the continuity
condition—despite previous claims [15]. This is why we will
II. INTERPRETATION AND PURPOSE OF WEIGHTED later propose a definition that fulfills all aforementioned con-
CLUSTERING ditions. An extensive comparison of the properties fulfilled by
A. Desired properties of weighted clustering coefficients different clustering definitions can be found in Appendix B.
Weighted measures are crucial for many network types
where the binary connectivity is either uninformative (fully
connected network) or displays similar or lower heterogeneity B. State of the art for weighted clustering
compared with the weighted structure. In this paper, we focus First we introduce and classify the existing weighted clus-
on two classes of real-world networks: inferred or measured tering coefficient definition. For all clustering definitions in
networks where there can be a large number of spurious the main text, we use the following notation: A is an adja-
(false positive) edges with small weights; and networks asso- cency matrix, and W = {wi j } is the normalized weight matrix,
ciated with flows of information or goods, which often display obtained from the original weight matrix W̃ = {w̃i j } by wi j =
broad weight distributions. This is notably the case for many w̃i j / maxi, j (w̃i j ).
networks in neuroscience, and more generally in informa- Hybrid definitions were the first extensions of the binary
tion, transportation, or other social and economic networks. clustering coefficient definitions. They combine properties
Weights are essential to understand the dynamical processes associated with a weighted connectivity matrix (i.e., the in-
that occur in these networks, requiring measures that go be- tensity of the triangle) with properties that could be already
yond the binary structure. obtained from an adjacency matrix (i.e., node degrees). These
There could be multiple requirements for weighted clus- definitions move from an integer counting the number of trian-
tering coefficients [14] depending on the particular question gles (n ) to a sum of real numbers (computed as a function of
of interest and on the network properties. The main require- edge weights) that we call “intensities” of triangles (I ). The
ments that we considered necessary for a weighted clustering choice of a particular function for the intensity of the triangles
coefficient are (i) normalization (Ci ∈ [0, 1]), (ii) consistency determines the properties of the clustering coefficient.
with the binary definition (for binary networks, it should Two popular hybrid weighted clusterings were given by
give back the classical result), (iii) linearity (scaling by α all Barrat et al. [10] and Onnela et al. [11], which we will refer
edges involving node i and all edges in triangles including to as the Barrat and Onnela definitions, respectively.
node i scales Ci by α), and (iv) continuity (weak influence of For a node i in a graph, the Barrat definition [10] quantifies
the addition or deletion of edges having very small weights, the fraction of the node’s strength that is invested in triangles
meaning that an edge with infinitesimally small weight should (see Appendix C 1 for more details):
be equivalent to the absence of that edge).
Compared with a previously proposed list of conditions
given by Saramäki et al. [14], we added a continuity condition (WA2 )ii w
CiB = = Cibin i , (2)
but did not include a requirement of a specific normalization 2si (di − 1) wi

043124-2
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

where si = j=i wi j is the strength of the node, wi is TABLE I. Limit values for undirected weighted clustering coeffi-
cients of vertex i (solid circles) for different weight configurations in
the average weight of the edges involving i, and wi =
wi j +wik graphs with vanishing weights. Solid lines depict edges of weight
j=k 2n,i ai j aik a jk is the average weight of edges involving w = max(w) = 1, and dotted lines denote edges with vanishing
i that are part of a triangle computed over all triangles in weight . Only the proposed continuous clustering (bottom row)
which node i participates. In terms of triangle intensity, this returns values consistent with the continuity condition, whereas the
definition was originally written as Barrat (C B ), Onnela (C O ), and Zhang-Horvath definitions (C Z ) devi-
wi j +wik ate for it.
j=k 2
ai j aik a jk
Ci =
B
2si (di − 1)
1 wi j + wik
= ai j aik a jk , (3)
di (di − 1) j=k 2wi

thus defining the intensity of triangle i jk as Ii B
jk =
wi j +wik
2wi
ai j aik a jk as the function of two of the triangle’s weights
and the average weight of the edges connected to node i, wi .
Proposed a bit later, the Onnela definition [11] scales the Z (max)
Ii = ITZi jk = wi j wik if all existing triplets Ti jk were closed
binary clustering by the average intensity of the triangles (see jk
by an edge of weight 1 (the maximum possible weight in the
Appendix C 2 for more details):
normalized network). This way, if i is involved in a single
1 3 triangle, the clustering coefficient is equal to the weight of
W [ 3 ] ii
Ci =
O
= Cibin Ii
O , (4) the edge closing the triplet centered on i (see also Table I).
di (di − 1) jk
Though this definition does not fulfill the continuity
jk = (wi j wik w jk )
O 1/3 property, we will show that it still provides a consistent in-
with the triangle intensity defined as Ii
and the average intensity taken over all triangles in which i terpretation of weighted clustering, as discussed in Ref. [16],
participates. and is well suited to tackle networks with a large fraction of
For all hybrid methods the denominator relies on the node’s false positives.
degree, meaning that the addition or deletion of edges will Other fully weighted definitions that have been proposed
always significantly affect the clustering coefficient even if and discussed since [15,17,18] do not bring significant ad-
the edge has an infinitely small weight. Such methods can ditions compared with the Zhang-Horvath definition while
thus lead to inaccurate results when applied to the inferred net- actually losing some of its properties and its straightforward
works, where a significant fraction of edges are false positives interpretation. They are not considered further in this paper;
with small weights. In the following we will also demonstrate see Appendix A for further explanations.
that they cannot reliably detect the most strongly clustered
nodes in structured networks. Only fully weighted definitions C. A continuous definition for weighted clustering and closure
can rise up to these challenges.
For an undirected graph G, we define the proposed contin-
Fully weighted definitions are variants of the clustering
uous clustering of node i as
coefficient that do not include any binary measures (anything
2 √ 2
that can be derived from the adjacency matrix alone, e.g., j=k Ii jk j=k ( wi j w jk wik )
3

degrees). In addition to substituting the number of triangles Ci = = √ . (6)
j=k IT i jk j=k w ji wik
by the sum of triangle intensities, they also move away from
counting triplets—defining the maximum number of possible We define the weighted intensity of triangles and triplets,
√ √
triangles max(n,i ) = nT,i = di (di − 1). Instead, they intro- Ii jk = 3 wi j w jk wik and IT i jk = w ji wik , respectively, using
duce the triplet intensity IT such that, for a node i, IT,i is a the geometric mean of the weights involved. Because of this,
real-valued function of the weights associated with i. one strong weight in a triangle or triplet cannot compensate
One of the first fully weighted definitions for the cluster- the presence of smaller weights, in contrast to what may
ing coefficient was provided by Zhang and Horvath [12] to happen if one uses the arithmetic mean. This provides the
analyze gene coexpression networks: desired property that the intensity of triangles and triplets will
Z
go to zero if even a single edge weight goes to zero. Note
j=k wi j wik w jk Ii jk bin that, though the triangle intensity is defined as the geometric
Ci =
Z
= Ci , (5)
w w
j=k i j ik
Z
IT i jk mean of the three weights involved, it is the square of this
intensity that is used in the definition. The reason for this
which we will refer to as the Zhang-Horvath definition. Note choice is twofold: to assign higher influence to large triangles
that the fact that the definition can be expressed as a function and to ensure the linearity of the coefficient. Importantly, this
of the binary clustering does not contradict the fully weighted definition of the triangle intensity I assigns the same role to
nature of the measure, as it stems from a simple recombination all participating edges. The proposed clustering could be in-
of the terms. terpreted as the ratio of the triangle intensity that is invested in
This definition can be interpreted as the ratio of strong triangles (given by the sum of the squared intensities of
jk = wi j wik w jk of the triangles
Z
the summed intensities Ii triangles, which increases the importance of strong triangles)
i jk = (i, j, k) to the maximum possible summed intensities to the triplet intensity, which would represent the maximum

043124-3
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

TABLE II. Definitions of the continuous intensities for each partial mode pattern in directed graphs. Column 1, pattern names; column 2,
pattern illustration; column 3, number of triangles for node i; column 4, number of triplets for node i; column 5, continuous intensities of the
triangles for node i; and column 6, continuous intensities of triplets for node i. The clustering coefficients associated with each mode m are
given by Ci(m)bin = n,i
(m)
/nT,i
(m)
for binary networks and Ci(m) = I,i /IT,i for the continuous definition.
(m) (m)

possible triangle intensity if all weights connecting adjacent tained via the formula
nodes were equal to 1.
I,i
The proposed continuous definition fulfills all the condi- Cg = i . (9)
tions we put forward above, and to the best of our knowledge i IT,i
it is the only one to do so. Similarly to previous definitions, the
proposed clustering coefficient can also be rewritten in terms D. Directed weighted clustering
of node properties: Fagiolo [19] proposed how to generalize clustering to di-
2 3 rected networks. He defined the different patterns or motifs
W[3] (shown in Table II) that can exist in these networks and
Ci = 1 ii , (7) adapted the Onnela definition [11] as the first weighted di-
[ ] 2
si 2 − si rected clustering coefficient. The Barrat definition [10] was
generalized to directed graphs in Ref. [20] following the same
where si = j=i wi j is the normalized strength of node i distinction into Fagiolo’s cycle, middleman, fan-in, and fan-

and W [α] = wiαj and si[α] = j=i wiαj are the fractional weight out motifs.
matrix and strength for any α ∈ R. Similarly, the Zhang-Horvath definition [12] and the con-
As for the previous definitions, the continuous clustering tinuous definition can be generalized in a straightforward
can be interpreted as a function of intensities and the binary manner for directed networks. For this, we only need to
clustering: redefine the intensities of each directed triangle and triplet
motif (as shown in Table II for the continuous definition and
2 2 2
n Ii jk Ii jk Var(Ii jk ) + Ii jk bin in Table VII for the Zhang-Horvath definition). This simply
Ci = = Cibin = Ci , (8) requires replacing A with W in all expressions of n,i (m)
and
nT IT i jk IT i jk IT i jk
(m)
a with w in all expressions of nT,i . As the total directed
with means Ii jk and IT i jk and variance taken over all the clustering is defined as the sum of all modes, we can write
triangles or triplets the node i participates in, respectively. it as
In the limit where all triangles associated with node i have Z (tot)
I,i (W + W T )3ii
similar intensities, we can neglect the variance term, leaving Z (tot)
Ci = Z (tot) = . (10)
IT,i j=k (wi j + w ji )(wik + wki )
2
Ii jk
C bin . In this case, in contrast to the Zhang-Horvath def-
IT i jk i
inition [Eq. (5)], the absolute value of the intensity matters, Finally, the continuous clustering can be extended for each
not only its ratio to the maximum possible intensity. For a directed mode (see Table II), and for the total directed cluster-
given average triangle intensity, the positive contribution of ing, this leads to
the variance implies that nodes with more variable intensities, 2 2 3
i.e., at least one triangle with a high intensity, will have higher (tot)
(tot)
I,i 1
W [ 3 ] + W [ 3 ]T ii
Ci = (tot) = 1 2
, (11)
clustering coefficients than nodes with identical triangles of [2] 2
IT,i si,tot − si,tot − 2si↔
average intensity.
Finally, the global clustering can also be defined in a [ 21 ]
with si,tot = j (wi j + w ji ) being the total strength, si,tot =
straightforward
For simplicity, we define I,i =
fashion. 1 1
↔
j=k Ii jk and IT,i =
2
j=k IT i jk , leading to Ci = I,i /IT,i .
(w + w ji ) being the total root strength, and si =
2 2
j √ij
Using this definition, the continuous global clustering is ob- j wi j w ji being the reciprocal strength.

043124-4
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

(a)

(b)

FIG. 1. Only the continuous clustering coefficient uncovers the true structure in the weighted core-periphery network. A network has 11
strongly connected core nodes (black edges) that interact with well-clustered periphery nodes with weaker connection strengths (light-brown
edges); see Appendix E 1 for details on the network. (a) Graphical view of the network; the edge width gives the strength of the connection,
and the node color gives its clustering coefficient. (b) Distribution of clustering coefficients for the three types of nodes over ten realizations of
such a core-periphery network. Only the continuous definition differentiates between the central, the outer-core, and the periphery nodes. In all
the other methods, the clustering coefficients of the 10 “outer-core” and 22 periphery nodes overlap: The Onnela definition only distinguishes
the central node, whereas the Barrat and Zhang-Horvath definitions do not hint at a core-periphery structure.

As for the undirected case, the global clustering coefficient A. Sensitivity to weight-encoded topological features
associated with each
(m) directed
(m)pattern can be obtained via the Here, we investigate how weighted structures can be de-
formula Cg(m) = i I,i / i IT,i . tected or missed using different clustering definitions. As
an example we consider weighted core-periphery graphs in
III. THE ADVANTAGES OF FULLY WEIGHTED Fig. 1. There, core nodes are characterized by both a dense
DEFINITIONS binary connectivity and large weights, whereas periphery
In this section we discuss the sensitivity to the weight- nodes display both sparser connectivity (though they still
encoded topological features and stability to noise in network have large degrees) and weaker weights; for more details, see
measurements of the different clustering methods. A previous Appendix E 1. We generate ten realizations of the network
study [14] already noted the fact that previous definitions did and consider distribution of clustering coefficients of differ-
not fulfill the continuity condition by analyzing the behavior ent types of nodes. Continuous definition leads to distinct
of the different coefficients for nodes that participate in a clustering of different types of nodes, making the true struc-
single triangle. Table I illustrates some of these cases and ture of the network clearly visible already in the clustering
shows that the proposed continuous definition is the only one distributions. Because of their hybrid nature, the Barrat and
to behave as expected. Onnela definitions cannot capture this underlying structure
Yet we note that the Zhang-Horvath definition is also very as it is mostly encoded in the weights and not at all in the
resilient to noise because, except for the corner cases asso- degrees (core nodes are not binary hubs). Though it is purely
ciated with single triangles, its behavior is continuous in all weighted, the Zhang-Horvath definition is also not suited to
other situations. Moreover, contrary to what was asserted in detect this type of weighted structure because its interpretation
Ref. [14], it provides a perfectly sensible behavior given its of a node’s clustering only accounts for the relative triangle
interpretation of clustering as the ratio of the triangle intensity intensity given the node’s weights.
The continuous clustering is sensitive to any topological
jk = wi j wik w jk to its maximum possible intensity given the
Z
Ii
Z (max) property that is encoded via specific weight distributions.
weights of node i: Ii jk = ITZi jk = wi j wik if w jk = 1. Furthermore, in contrast to the Zhang-Horvath definition, it
Because the Barrat and Onnela definitions [10,11] are the accounts not only for the ratio of the triangle intensity over
most well known and (to the best of our knowledge) the only the triplet intensity (how strong the triangles are compared
methods implemented in popular graph libraries, we restrict with the maximum possible value given the node’s weights)
our comparison to the Zhang-Horvath definition and these but also for the absolute value of the intensity: A weak tri-
two definitions. A more comprehensive discussion of other angle, even if it corresponds to the highest possible value
definitions of the weighted clustering coefficient can be found given the node’s weights, will decrease the node’s clustering
in Appendix A.

043124-5
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

(a) (b) (c) (d)

(e) (f) (g) (h)
(a) (c)
(a) (d)
(b) (c)
(b) (d)

FIG. 2. Fully weighted methods are less sensitive to spurious edges. A “measured network” can be represented as the union of a “ground
truth” (G.T.)—here, a Watts-Strogatz network in dark brown—and spurious small-weight connections (“noise” graphs with (a) random [Erdős-
Rényi (ER)] or (b) scale-free (SF) connectivity, in red). We assess the influence of the weight distribution of spurious connections (red) by
checking weights that are (c) all equal and small or (d) following an exponential distribution and overlapping with the real weights (dark
brown). (e)–(g) Ground-truth clustering distribution (filled dark brown) compared with the distributions associated with the measured networks
for each method (dashed lines). Weight and noise types, combining noise shown in (a) or (b) with weight shown in (c) or (d), which we refer
to as conditions “(a)+(c),” “(a)+(d),” “(b)+(c),” and “(b)+(d),” are associated with colors from brown to orange in the same order as in
(h). Distributions associated with the exponential noise [conditions (a)+(d) and (b)+(d)] differ most from the original distributions for the
Zhang-Horvath and continuous clusterings [(e) and (f)] and are broader for the Onnela clustering (g). (h) Correlation between the ground-truth
clustering and clustering in measured networks for indicated spurious edge topology and weights. Fully weighted clusterings retain most of
the correlation for condition (a)+(d), with R2 > 0.55, and only lose the original information for condition (b)+(d). The results were obtained
for ten realizations of the spurious edges; error bars give confidence intervals. Network properties are detailed in Appendix E. Z, C, O, and B,
Zhang-Horvath, continuous, Onnela, and Barrat definitions, respectively.

in the continuous definition whereas it increases it with the We illustrate the impact of spurious edges on measured
Zhang-Horvath method. In that sense, the continuous clus- clustering coefficients using the example of Watts-Strogatz
tering provides a more global evaluation of the clustering small-world networks. We consider different topologies for
coefficient compared with the Zhang-Horvath definition that the subnetwork formed by the spurious edges: either an
provides more local information. Erdős-Rényi random network [Fig. 2(a)], associated with un-
The Barrat definition has several limitations because it is correlated noise, or a scale-free network [Fig. 2(b)], which
close to being weight insensitive [11,21]. It is particularly would correlate noise with certain nodes in the network.
unsuitable for assessing networks with a potentially large Additionally, weights on the spurious edges could be much
number of low-weight spurious connections or very hetero- smaller than the weight of the actual edges [Fig. 2(c)] or
geneous weight distributions. For this reason, we will mostly have an overlapping distribution [Fig. 2(d)]. We refer to the
leave this clustering aside in the rest of this paper. combinations of these noise and weight types as conditions
“(a)+(c),” “(a)+(d),” “(b)+(c),” and “(b)+(d).” Both fully
B. Continuity and resilience to noise weighted methods are unaffected by low noise [conditions
The stability of a network measure to noise is of particular (a)+(c) and (b)+(c)] and are also less influenced by the spuri-
importance for networks that are obtained via experimental ous edges when their weights are large enough to overlap with
measurements since these are often subject to noise and sta- the real weight distribution [conditions (a)+(d) and (b)+(d)].
tistical biases, notably for inferred networks. Methods abiding On the other hand, because hybrid methods explicitly depend
by the continuity condition are especially resilient to the pres- on the nodes’ degrees, they are very susceptible to the pres-
ence of low-weight spurious edges. Violation of continuity ence of spurious edges [Figs. 2(g) and 2(h)].
can have significant and pervasive consequences for inference The difference in behavior between the methods can be
of network properties in many network structures. We have easily explained by a first-order expansion. We consider
already seen the simple examples in Table I; here, we demon- change in the clustering coefficient of a node i with degree
strate that they are not just corner cases, but occur in larger, di after addition of a spurious edge e = (i, v) with weight
real-world networks. 1. For the Barrat and Onnela methods, the new clustering

043124-6
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

coefficient becomes spurious that is lower than 1%. We consider a thresholding
1 procedure, where at each level of the threshold (pmax ), only
O
C
I,i +O 3 di − 1 O 1 →0 the edges with smaller p values are kept. This procedure can
Ci = = Ci + O 3 CiO ,
di (di + 1) di + 1 be seen as an attempt to remove spurious edges, though no
+wik correct threshold level is known. Compared with the hybrid
I,i + k 2 avk aik
B
di − 1 B →0
CiB = = Ci + O() CiB , method, both fully weighted definitions are much less sensi-
IT,i + si +
B di tive to thresholding [Figs. 3(b) and 3(c)]. Thus we see that
meaning that, for both methods, the coefficients will deviate the resilience to noise we showed analytically and on toy
networks is relevant for a real-world network. Furthermore,
from the original clustering CiO/B by a noninfinitesimal value,
the resilience is not limited to the general shape of the dis-
even when the perturbation was infinitesimal; see Appendix C
tribution but indeed preserves the precise values and ranks:
for a complete derivation.
A larger fraction of the nodes displaying high clustering in
On the other hand, the continuous clustering becomes
the full graph are still among the highest ranking nodes in
2
the thresholded graphs when fully weighted methods are used
I,i + k∼i ( wvk wki ) 3
Ci = compared with the hybrid method [Figure 3(e)].
[ 1 ]√
IT,i + 2si 2 The clustering coefficient is a network measure that cap-
[ 1 ]√ tures features beyond purely local parameters, such as degree
2si 2 2 or strength. However, as we see for the mouse connec-
= Ci 1 − +O 3
IT,i tome, the hybrid method strongly correlates with the average
√ weight associated with a node i: si /di . At the same time,
= Ci + O( ) −−−→ +
Ci , (12) the fully weighted definitions are much less correlated with
→0
it [Fig. 3(d)]. As a result, they can bring more independent
showing only an infinitesimal deviation to the similarly in- information regarding the weighted network structure than the
finitesimal perturbation. hybrid method. This trend is even stronger for the Zhang-
Similarly, except for the single-triangle cases discussed in Horvath definition, which does not account for the absolute
Table I, the Zhang-Horvath clustering becomes intensity of triangles. In contrast, the continuous definition
provides some intermediate behavior as the intensity of trian-

Z
I,i + O()
CiZ = = CiZ + O() −−−→ CiZ . (13) gles often correlates, if only in part, with the average weight
Z + O()
IT,i +
→0 associated with the node.
Finally, though the continuous and Zhang-Horvath defini-
It is worth noting that one continuity issue, associated with
tions often provide somewhat similar results, they may differ
nodes participating in only one triangle, occurs for all defini-
significantly, e.g., for the cerebellar cortex in Fig. 3(f). Com-
tions but the continuous one. Since this situation is pervasive bining the results of both methods can thus be informative,
in networks with low degree or binary clustering, using the for example, to single out nodes that possess only weak con-
continuous clustering definition can be of particular impor- nections (and will therefore register as weakly clustered for
tance in such cases; see Appendix F 4. the continuous definition) yet connect to other nodes that are
strongly connected (thus registering as strongly clustered for
IV. APPLICATION TO REAL-WORLD NETWORKS the Zhang-Horvath definition).
A. Mouse mesoscale connectome
B. Decentralized social media: The Fediverse
In neuroscience, the networks on different scales give a
vital piece of information to understand the brain better. Un- The Fediverse [23] is a set of federated social media that
surprisingly, connectomics, the mapping of the connections in can communicate via a collection of common protocols [24],
the nervous system, has gained significant attention and devel- the most well known being ActivityPub. This network can be
oped dramatically over the last years. Most of the networks in seen as a set of alternatives to corporate platforms such as
neuroscience are weighted, and the obtained connectivities are Facebook or Twitter. Social media on the Fediverse usually
either measured or inferred, making them a typical example promote ideas of decentralization, interoperability, free/libre
for the challenges discussed above. The mouse mesoscale and open-source software (FLOSS), and the absence of algo-
connectome [22] is a fascinating example of such networks, rithmic filters in favor of human curation and moderation.
both because it provides information about the entire mouse We analyze here a snapshot of this network that was ob-
brain and because it contains an evaluation of the probability tained in 2018 by Zignani et al. [25] (data are available [26]).
of false positives for all connections. In contrast to the original publication, we chose here to look
Here, we investigate how the choice of the clustering co- at the mesoscale level, i.e., at connections between instances
efficient definition can alter the results. The network is very (the equivalent of a community on the Fediverse, where at
inhomogeneous, with broadly varying degrees [Fig. 3(a)]. The least one, but up to several thousand users can have an ac-
edges in the mesoscale connectome are assigned p values count). This mesoscale view leads to a network of weighted
that quantify their probability to correspond to a real physical directed interactions between communities of strongly con-
connection (p denotes the probability of the connection to be nected users. Indeed, users of a single instance (the technical
a spurious edge). They have therefore different significance name for a community on the Fediverse) can see and interact
levels, with only 13% of all edges having p values smaller with all public messages posted by other members on that
than 0.01, i.e., only 13% of all edges have a probability to be same server. At the same time, they can only see a subset of

043124-7
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

FIG. 3. For the mouse connectome, different clustering methods give significantly different results. (a) Top view of the mouse brain; node
size indicates the total degree, and node color indicates the out-degree (lighter colors for higher degrees). (b) Distribution of the total clustering
coefficients if only edges with the p value p < pmax are preserved. There are smaller changes in clustering distribution for fully weighted
definitions than for the Onnela definition. From pale yellow to dark gray, thresholding keeps 9, 13, 32, and 100%, respectively, of the original
network. (c) The fraction of the nodes with the highest total clustering (top 5, 10, and 25%) that are preserved across all subsamplings in (b) for
the Zhang-Horvath (brown), continuous (orange), and Onnela definitions (pale yellow). (d) Correlation of the three total clustering definitions
with average total node weight (si /di ) shows that fully weighted definitions capture additional information beyond degree and strength. (e)
Fraction of the 10% highest clustering nodes that are common between two of the definitions (filled markers in the right panel include the
central region of the left panel) or among all three definitions (black crosses, central region) as shown in the Venn diagram. (f) Clustering ranks
of the areas within brain regions (showing which regions contain nodes with high clustering coefficients) can significantly vary depending on
the definition (Zhang-Horvath, brown; continuous, orange; and Onnela, pale yellow).

the posts from people on other servers (either because they 3825 nodes represent more than half of the entire network
follow their author or because other members of the instance and are connected via 81 371 edges. A chord diagram of
follow the author or shared this specific post). connections between locations shows that, besides the fact
For each instance I1 , an edge towards another instance I2 that the position of most instances is unspecified (UNS), the
means that at least one user on I1 follows at least one member largest hosting countries are Japan, the United States, and
of I2 . The precise value of the weight associated with this France, with Japanese and French communities interacting
edge gives the fraction of all followers from the source in- mostly among themselves [Figs. 4(a) and 4(b)]. The network
stance that are associated with members of the target instance. is both very sparse and strongly heterogeneous, with a median
The weights are thus a proxy to characterize the fraction degree of 5 but node degrees and sizes varying over 3–5 orders
of the community’s attention that is associated with content of magnitude. This broad distribution has notable implications
produced by another community; this means of course that, for different clustering definitions. For the Zhang-Horvath
for each node, its outgoing strength so is equal to 1. Note definition, it increases the likelihood of running into the corner
that in this network, information flow occurs in the direction cases of single triangles, increasing the average clustering
opposite to that of the edge, because the directed edge denotes compared with the other methods. For the Onnela definition,
that the source is paying attention to what the target posts. it strongly correlates clustering values to the degree (and thus
The network snapshot contains 3825 nodes corresponding to the size) of the instance.
mostly to instances running Mastodon [27], one of the most As for the mouse network, all weighted methods lead to re-
prominent microblogging platforms on the Fediverse. These sults that differ strongly from the binary clustering [Figs. 4(d)

043124-8
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

(a) (b)

FIG. 4. Properties of the network of Fediverse instances. (a) Chord diagram of connections between locations (DEU, Germany; FRA,
France; JPN, Japan; UNS, unspecified; USA, United States of America). (b) Spatial representation of the network showing connections that
amount to at least 1% of the strongest connection; nodes are placed on each country’s capital, and their sizes represent the number of instances
hosted in that country. The zoom on Europe shows all connections in the European subnetwork. (c) The count of users per instance follows a
heavy-tailed distribution. (d) The network displays strong structural clustering, most nodes with nonzero in- and out-degrees displaying binary
clustering values close to 1, whereas the expected value for an Erdős-Rényi network with the same number of edges would be almost zero
(dotted line). (e) The median values of the fan-out clustering for different instance sizes show that the strong heterogeneity of the network can
have a notable influence for the Onnela method (yellow), whereas the Zhang-Horvath (dark brown) and continuous methods (orange) display
much weaker correlation with the size of the instance. (f) The fan-in motifs have intensities that are several orders of magnitude lower than fan
out and display weaker dependency on the instance size. The binary (gray) clustering coefficient misses the difference between fan out and fan
in, which predominantly relies on the weights’ effect.

and 4(e)]. Some of the results from the hybrid method tend to the brain is indeed associated with situations where a signal
correlate strongly with some first-order properties of the nodes can be transferred not only directly from one node to another,
[Fig. 4(e)], whereas the fully weighted methods bring more but also indirectly via a third (middleman) node, as can be
independent information. The Fediverse network displays a seen in middleman, fan-in, and fan-out patterns (cf. Table II).
peculiar feature as its fan-out and fan-in clusterings differ This overexpression of redundant patterns is visible, for the
significantly despite the usual correlation between these two fully weighted methods, in the high values of the middleman,
patterns, as can be seen from the comparison of Figs. 4(e) and fan-in, and fan-out motifs in Fig. 5(a).
4(f). The situation for the Fediverse is significantly more com-
plex. Indeed, 98% of all edges belong to at least one triangle,
and some are involved in up to thousands of triangles. A
C. Using local clustering to infer dynamical properties deeper understanding of the clustering requires precise in-
Analysis of the clustering coefficient for different struc- vestigation of how the weight distribution correlates with
tured patterns offers a way to obtain a precise idea of specific patterns of triangles. For instance, though fan-in and
the critical dynamical patterns within a network. To deter- fan-out patterns co-occur and are thus usually correlated, the
mine the significance of a particular pattern, we compare its Fediverse displays an unexpected discrepancy between the
prominence in the original network with its prominence in weights associated with the two patterns, as seen in Fig. 5(b),
null-model networks obtained via appropriate randomization. which notably differentiates it from the mouse connectome
For the mouse brain, as a randomized control, we take the (see also Appendix F 3).
original network and only shuffle the weights, thus preserving
the weight distribution and binary structure. Comparing the
V. DISCUSSION
actual values in the original graphs with those of the random-
ized graphs, we can see the importance of looking at fully In this paper, we introduced directed weighted definitions
weighted measures. for clustering analysis. Using analytic derivations, generated
Both the Zhang-Horvath and continuous definitions iden- network models, and real data, we showed that the behavior
tify the preference of the mouse brain network for redundant of fully weighted measures displayed enhanced sensitivity,
information flows, whereas the hybrid method of Onnela does selectivity, and robustness compared with hybrid measures.
not capture this feature. Redundant information transfer in To facilitate access to these measures, all clustering methods

043124-9
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

(a) (b)

FIG. 5. Different structures of clustering patterns and information flow in the mouse brain and the Fediverse. The original clustering values
are compared with graphs with the same adjacency matrix but shuffled edge weights (mouse, 10 uniformly shuffled networks; Fediverse, 200
graphs shuffled across all outgoing edges of each node to preserve out-strength normalization). The contours show different density levels of
the point clouds associated with the original and shuffled pairs of clustering coefficients. (a) The Zhang-Horvath and continuous definitions
capture the redundancy of information flow in the mouse brain (the middleman, fan-in, and fan-out motifs are higher in the original graph).
However, the Onnela method does not capture this feature. (b) For the Fediverse, only fan-out and middleman motifs are significantly stronger
than in the random graphs. Patterns where the original values are significantly greater than the randomized ones are marked by a + and the
initial of the method (brown Z, Zhang-Horvath method; orange C, continuous method; yellow O, Onnela method). The original clustering is
considered to be significantly higher (lower) if 75% of the points are above (below) the dotted identity line. The addition of * denotes one-sided
fractions greater than 95%.

were implemented in PYTHON and made compatible with the numbers of spurious edges with small weights. We expanded
three main libraries in this language (NETWORKX, IGRAPH, and the results from previous studies comparing existing weighted
GRAPH-TOOL) [28]. definitions [14–16,21], extending their definition to directed
The use of a specific weighted clustering coefficient should networks and providing complete mathematical justifications
be an informed choice, based on the precise questions and ob- for previous observations regarding linearity and continuity.
jects of study and thus on the specific properties that a relevant Our analyses show that either of the fully weighted defini-
measure should fulfill to answer these questions. We therefore tions may be preferred depending on the network properties.
wish to highlight the importance of continuity as a potentially For networks with large degrees, the Zhang-Horvath defini-
crucial notion for networks with highly heterogeneous weight tion may be preferred if the number of low-weight spurious
distribution or numerous spurious edges with low weights. edges is very high, as it is least susceptible to noise. In
In previous studies, slight variations on this notion had been networks with heterogeneous weight distributions where the
introduced mathematically [15] or hinted at by the study of absolute value of triangle strength is of interest, we showed
corner cases [14]. However, the continuity property was in- that the continuous definition provides more relevant results
advertently considered only in particular classes of networks than the Zhang-Horvath definition. Indeed, in networks in-
in Ref. [15] (disregarding, for example, small-degree cases), volving fluxes of goods or matter as well as for information
erroneously marking previous methods as continuous. The processing (e.g., in brain or telecommunications networks),
other study (Ref. [14]) asserted discontinuity of previous mea- one may be interested in the absolute amount that can be
sures simply as a feature, without discussing its implications transferred between nodes, making nodes with small weights
or ways to avoid it. Our proposal of fully continuous clus- of little relevance. A similar issue occurs for networks where
tering methods based on simple mathematical principles and many nodes participate in a single triangle (see Appendix F 4)
requirements fully solves the issue of continuity. In addition, making the Zhang-Horvath clustering noncontinuous. In such
we show in Appendix D that these principles can be extended networks, the use of the Zhang-Horvath definition is likely
to other measures such as the local closure. to assign high clustering to single-triangle nodes with low
We discussed how each weighted definition is associated weights, whereas the continuous definition will not, as was
with a specific interpretation of weighted clustering as a func- shown in Fig. 1.
tion of the binary clustering, triangle, and triplet intensities Finally, we illustrate the usefulness of weighted cluster-
(see Appendix B and Table VI for a summary). Corre- ing methods to investigate clustering in the examples of a
spondingly, these interpretations are associated with specific connectome and a decentralized social network. The fully
properties of each clustering coefficient. In rare cases where, weighted methods were especially suitable to reveal key dif-
by design, there are edges with small weights that should be ferences in their weighted structural properties. Indeed, we
treated significantly differently from an absence of connec- showed that middleman, fan-in, and fan-out patterns, charac-
tion, researchers may want to check whether the properties of teristic of pathways enabling redundant information transfer
the Barrat or Onnela definitions fit their needs or if they should between nodes, were overexpressed in the mouse brain. In the
come up with new, more appropriate definitions. In many Fediverse, the methods revealed an unexpected discrepancy
other cases, combining mathematical analysis and concrete between fan-in and fan-out modes, probably associated with
examples, we asserted that fully weighted measures outper- social interaction patterns that would mandate further investi-
form hybrid ones to evaluate clustering in networks with large gation.

043124-10
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

TABLE III. Undirected weighted clustering coefficients of ver- More specifically, Wang et al. [15] argue in favor of the use of
tex i (solid circles) for different weight configurations. Solid lines a specific version using the harmonic mean (hm):
depict edges of weight w = max(w) = 1, whereas dotted lines are 2
associated with edges with vanishing weight . Only the proposed
1 + w jk
jk 2 1
1
continuous clustering (bottom row) displays the required properties, wi j + wik
CiM,hm = 2
, (A4)
compared with the Zhang-Horvath (C Z ), Holme et al. [17] (C H ), and j=k
1 + maxlm (wlm )
2 1
Miyajima-Sakuragawa (hm) definitions (C M,hm ). 1
wi j + wik

which we will refer to as the Miyajima-Sakuragawa (hm)
definition. However, this definition suffers from three major
shortcomings:
(i) Despite what is asserted by the authors, it does not fulfill
the continuity condition.
(ii) It is not locally linear, meaning that two nodes that have
the same neighborhood but with all weights differing by a
factor λ will not have the ratio of their clustering coefficients
equal to λ.
ACKNOWLEDGMENTS (iii) It introduces an undesired asymmetry in the definition
of the triangle intensity: For a given triangle i jk , the com-
This research was funded by a Humboldt Research Fellow-
puted intensity will be different for each node as it depends
ship for Postdoctoral Researchers and a Sofja Kovalevskaja
on which one is considered as i.
Award from the Alexander von Humboldt Foundation, en-
dowed by the Federal Ministry of Education and Research.
We thank the anonymous reviewers for their comments, which APPENDIX B: COMPARISON OF CLUSTERING
helped to significantly improve our manuscript. We acknowl- PROPERTIES
edge the support from the BMBF through the Tübingen AI Multiple properties of the main definitions for the cluster-
Center (FKZ 01IS18039B). ing coefficient are listed in Table IV. These properties depend
on the definitions of triangle and triplet intensity that are
APPENDIX A: LIMITATIONS OF OTHER FULLY recapitulated in Table V.
WEIGHTED DEFINITIONS The different formulas for the intensities lead to different
interpretations of weighted clustering (Table VI). The Barrat
See Table III for a complete comparison of the fully and Zhang-Horvath definitions only quantify ratios of trian-
weighted clustering definitions. gle and triplet strength, while the Onnela and continuous
definitions are sensitive to the absolute value of the triangle
1. Holme et al. [17] intensity. The latter provides an intermediate interpretation
between the Zhang-Horvath and Onnela definitions as it reacts
Most studies consider the definition of Holme et al. [17] to to both the ratio of intensities and the absolute value of the
be triangle intensity.

jk wi j wik w jk
CiH = , (A1) APPENDIX C: DERIVATION OF THE EVOLUTION
maxi j (wi j ) jk wi j w jk
OF HYBRID CLUSTERING COEFFICIENTS
which would make it inconsistent with the binary definition. 1. Barrat method
However, the discussion in their paper states that consistency
Upon addition to a graph G(N, E ) of an edge (i, v) of
was one of their requirements, letting us think that they actu-
weight , giving G (N, E ) = E + {(i, v)}, one can compute
ally meant to define it as
the evolution of the initial clustering coefficient CiB to its new

value CiB by rewriting the definition from Ref. [10] as in
j=k wi j wik w jk
Ci =
H
, (A2) Ref. [14]:
maxi j (wi j ) j=k wi j w jk
1 wi j + wik
CiB = ai j aik a jk
which would make it equal to the Zhang-Horvath definition di (di − 1) j=k 2wi
[12], and we will therefore not consider Eq. (A1) here.
1 wi j
= ai j aik a jk . (C1)
di (di − 1) j=k wi
2. Miyajima and Sakuragawa [18]
Miyajima and Sakuragawa [18] define a multitude of gen- From this, we can deduce the following relationship
eralized clustering coefficients based on the use of an arbitrary between the Barrat definition and the binary clustering coeffi-
function h : R2 → R such that cient Cibin :

jk h[h(wi j , wik ), w jk ] n,i wi /wi w
Ci =
M,h
. (A3) CiB = = Cibin i , (C2)
j=k h[h(wi j , w jk ), maxlm (wlm )] di (di − 1) wi

043124-11
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

TABLE IV. Comparison of the different properties of the different clustering coefficients. The Zhang-Horvath and continuous definitions
fulfill the maximum number of desirable properties. An “X” means that the method possesses this property.

Property Barrat Onnela Miyajima-Sakuragawa (hm) Zhang-Horvath Continuous
Consistent with binary definition X X X X X
Normalized (C ∈ [0, 1]) X X X X X
All weights participate in I X X X X
All weights participate equally in I X X X
Linear against local scaling of the weights X X X
Sensitive to weight permutations X X X X
For a triangle = (i, j, k), I = IT f (w jk ) X
Continuous X

with n,i being the number of triangles in which node i partic- From this, one can define the evolution upon addition of an
ipates and wi being the average weight associated with edges edge (i, v) of weight as
connected to i and participating in a triangle.
Notice that this expression also explains why the Barrat CiO = Cibin Ii
O
jk
definition is so close to the binary clustering: For networks
jk + O()
O
n Ii
n
where weights are either rather homogeneous or not strongly =
correlated to triangles, wi and wi become very close as the di (di + 1) n
number of triangles per node increases. n
= I O + O()
Using Eq. (C2), the new clustering can be defined as di (di + 1) i jk
di − 1 O

= C + O(). (C5)
w di + 1 i
CiB = Cibin i
wi

3. Directed versions of the clustering coefficients
n,i wi +(n,i −n,i )

n,i
n,i To generalize the Zhang-Horvath definition of cluster-
= di wi +
di (di + 1) di +1
ing [12] for directed graphs, we use the same approach as
proposed by Fagiolo [19]. These definitions are visible in
n,i wi Table VII together with the directed definitions associated
= + O() with Barrat et al. [10] and Onnela et al. [11], respectively
di2 wi
defined in Refs. [20] and [19].
di − 1 B
= Ci + O(). (C3)
k APPENDIX D: CLOSURE
Closure was introduced in Ref. [13] as a complementary
2. Onnela method measure of clustering for binary undirected networks.
As for the other definitions, the clustering from Ref. [11]
can be defined as a function of the binary clustering: 1. Undirected weighted closure
From the Zhang-Horvath definition of clustering, closure
O
n Ii jk can be generalized in a fully weighted but noncontinuous way
CiO = = Cibin Ii
O . (C4)
di (di − 1) jk as

j=k wi j w jk wki Wii3
Hi =
0
= , (D1)
j=k=i wi j w jk j wi j (s j − wi j )
TABLE V. Comparison of the formula for triangle and triplet
intensity among the different clustering definitions for undirected again comparing the triangle intensities with their maximum
networks. possible value if all their second neighbors are also connected
to them (i.e., also first neighbors) by an edge of weight 1.
Definition Triangle (Ii jk ) Triplet (IT i jk )
wi j + wik TABLE VI. Comparison of the interpretations of the different
Barrat ai j aik a jk di (di − 1)
2wi clustering definitions for undirected networks.
Onnela (wi j wik w jk )1/3 di (di − 1)
2 2
Miyajima-Sakuragawa (hm) Definition Barrat Onnela Zhang-Horvath Continuous
1 + 1 + w +
2 1 1 1
jk wi j w ik
wi j wik

Zhang-Horvath wi j wik w jk w w wi bin bin
Z
Ii jk
2
Ii jk
√ √ i j ik Formula C O
Ii jk Ci Cibin Cibin
Continuous 3 w w w
i j jk ik wi j wik wi i ITZi jk IT i jk

043124-12
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

TABLE VII. Definitions of the Barrat, Onnela, and Zhang-Horvath intensities for each partial mode pattern in directed graphs. Column
1, pattern name; column 2, Onnela triangle intensity for node i; column 3, Barrat triangle intensity for node i; column 4, Barrat triplet
intensity for node i; column 5, Zhang-Horvath triangle intensity for node i; column 6, Zhang-Horvath triplet intensity for node i. The clustering
coefficients associated with each mode m are given by CiO,(m) = I,i
O,(m)
/nT,i
(m)
for the Onnela method, CiB,(m) = I,i /IT,i for the Barrat method,
B,(m) B,(m)

and Ci Z,(m)
=
I,i /I
Z,(m) Z,(m)
T,i for the Zhang-Horvath method. Note that, for the Barrat method, the reciprocal strength has been defined in Ref. [20]
B
as si,↔ = i= j 21 (wi j + w ji ).

O,(m) B,(m) B,(m) Z,(m) Z,(m)
Mode I,i I,i IT,i I,i IT,i
1
Cycle (W [ 3 ] )3ii 1
2
(WA2 + (WA2 )T )ii 1
(s d
2 i,in i,out
+ si,out di,in ) − si,↔
B
(W )3ii si,in si,out − si,↔
[2]

[ 13 ] 1 1
Middleman (W W [ 3 ]T W [ 3 ] )ii 1
2
(WAT A + W T AAT )ii 1
(s d
2 i,in i,out
+ si,out di,in ) − si,↔
B
(W W T W )ii si,in si,out − si,↔
[2]

[ 13 ]T [ 13 ] 2
Fan in (W (W ) )ii 1
2
(W T (A + AT )A)ii si,in (si,in − 1) (W T W W )ii (si,in ) − si,in
2 [2]
1 1
Fan out ((W [ 3 ] )2W [ 3 ]T )ii 1
2
(W (A + AT )AT )ii si,out (si,out − 1) (W W W T )ii (si,out )2 − si,out
[2]

This is equivalent to comparing with the situation where all or via the continuous definition,
open paths of length 2 are closed into a triangle by an edge of √ 2 3
weight 1. j=k
3 w w w 2
i j jk ki W[3]
Finally, it can also be defined in a continuous way as Hi,CO =
c
√ = 1 2 ii , (D7)
j=k=i wi j w jk [ 2 ] − si,out
j W ij
√ 2 3
3 w w w 2
j=k W[3] √ 2 ,T 3
Hi =
i j jk ki
√ = 1 2ii . (D2) j=k
3 w w w
k j ji ik
2
W[3]
j=k=i wi j w jk [ 2 ] − si
c
Hi,CI = √ = 1 2 ii ,
j W ij j=k=i wk j w ji W [ 2 ,T ] − si,in
j ij
(D8)
2. Directed weighted closure
√ 2 2 2 ,T
For directed networks, in contrast to Ref. [29], we only j=k
3 w w w 2
i j jk ki W [ 3 ] W [ 3 ] ii
c
Hi,FO = √ = 1 2 , (D9)
consider the extension of the undirected measure to di- wi j w jk
j=k=i W [ 2 ] − si,out
j
rected paths (that is, we consistently consider only second ij
in-neighbors via in-neighbors and second out-neighbors via √ 2 ,T 2 2
j=k
3 w w w 2
k j ji ki W [ 3 ] W [ 3 ] ii
out-neighbors, but not the out-neighbors of the node’s in- Hi,FI =
c
√ = 1 2 . (D10)
neighbors or vice versa). Therefore we define only four j=k=i wk j w ji W [ 2 ,T ] − s
j ij i,in
variants of the directed closure, two for outgoing paths [cycle
out (CO) and fan out (FO)] and two for incoming paths [cycle
APPENDIX E: NETWORK GENERATION ALGORITHMS
in (CI) and fan in (FI)].
We define the weighted version either directly using the All networks were generated using the Neural Networks
weights, as in the Zhang-Horvath definition for the clustering and Graphs’ Topology (NNGT) library [28].
coefficient,

Wii3 1. Core-periphery network
j=k wi j w jk wki
Hi,CO =
0
= , (D3)
j wi j (s j,out − wi j )
j=k=i wi j w jk
The core-periphery network in Fig. 1 contains (i) 1 cen-
3
tral core node (CCN), (ii) 10 outer-core nodes (OCNs), and
j=k wk j w ji wik (W T )ii (iii) 20 periphery nodes (PNs).
Hi,CI =
0
= , (D4)
j w ji (s j,in − w ji )
j=k=i wk j w ji
The nodes are connected as follows:
(i) The ten OCNs for a circular graph have fully reciprocal
j=k wi j w jk wik (W 2W T )ii connections to their four nearest neighbors.
0
Hi,FO = = , (D5)
j=k=i wi j w jk j wi j (s j,out − wi j ) (ii) The OCNs all connect to the CCN with reciprocal
connections.
j=k wk j w ji wki (W T W 2 )ii
Hi,FI =
0
= , (D6) The weights associated with these connections are drawn
j=k=i wk j w ji j w ji (s j,in − w ji ) from U (5, 10). The connections with the PNs are as follows:

TABLE VIII. Some characteristic properties of the mesoscale TABLE IX. Some characteristic properties of the Fediverse
mouse brain. SD, standard deviation. mesoscale network.

Mean SD Median Min Max Mean SD Median Min Max
In-degree 153.7 25.6 152 104 223 In-degree 21.3 82.6 4 0 2271
Out-degree 153.7 77.6 151 14 357 Out-degree 21.3 77.8 5 0 2038
CCbin
tot 0.43 0.02 0.43 0.38 0.49 CCbin
tot 0.68 0.37 0.86 0 1
Weight 0.076 0.36 0.038 3.9 × 10−16 20.4 Weight 0.045 0.14 0.0029 6.0 × 10−7 1

043124-13
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

FIG. 6. Distribution of in- and out-degrees (in orange and yellow, FIG. 7. Distribution of in- and out-degrees (in orange and yellow,
respectively) and weights in the mouse connectome. respectively) and weights in the Fediverse instances.

(i) Each OCN receives one connection from every other The original lattice L(N, k, r) has 21 Nk(1 + r) edges, lead-
PN, starting with the first or second PN depending on the ing to the limit cases (i) 21 Nk edges if r = 0, like the
OCN’s evenness. undirected lattice, and (ii) Nk edges if r = 1, with all con-
(ii) The OCN reciprocates the connections with probability nections being reciprocal.
0.5.
(iii) The PNs are connected among themselves following 3. Network properties for Fig. 2
an Erdős-Rényi scheme of density 0.05.
Weights associated with connections involving PNs are In the Watts-Strogatz network used in Fig. 2, we used a
drawn from U (0.05, 0.5). coordination number k = 20 and a rewiring probability p =
0.03.
The Erdős-Rényi network is generated to have an average
2. Watts-Strogatz network degree of 15, whereas the scale-free network is generated via
The original Watts-Strogatz network [30] consists of a reg- a degree list drawn from a power-law distribution with an
ular lattice basis (characterized by a coordination number k) exponent of 1.2 and normalized to get an average degree of
that is then modified, rewiring each edge with a probability 15.
p. For directed networks, we used a generalization of that The weights of the Watts-Strogatz network are drawn from
method implemented in NNGT which is strictly equivalent an exponential distribution of parameter 1.5 and shifted by
except for the fact that edges are now directed: +0.1 so that they are all nonzero.
(1) Start from a directed regular lattice with coordination The weights or the “noise” networks are either all equal
number k and reciprocity r (taken as 1 in this paper). to 10−4 or drawn from a second exponential distribution of
(2) Rewire each edge with probability p. parameter 0.05, shifted by +10−4 .

(a) (b)

FIG. 8. Different structures of local closure in the mouse brain and Fediverse. The panels compare the local closure of directed patterns
for the mouse brain (a) and Fediverse (b). The original values are compared with the averages over an ensemble of graphs with the same
adjacency matrix but shuffled edge weights. For the mouse brain, 10 uniformly shuffled networks; for Fediverse, 200 shuffling across all
outgoing edges of a each node (to preserve out-strength normalization). (a) The local closure also shows an increase for patterns promoting
redundancy of information flow in the mouse brain (fan-in and fan-out motifs are higher in the original graph). (b) For the Fediverse, only fan
out is significantly stronger than in the random graphs. Patterns where the original values are significantly greater (smaller) than the randomized
ones are marked by a + (−) and the initial of the method (brown Z, Zhang-Horvath method; orange C, continuous method; yellow O, Onnela
method). The original closure is considered to be significantly higher (lower) if 75% of the points are above (below) the dotted identity line.
The addition of * denotes one-sided fractions greater than 95%.

043124-14
WEIGHTED DIRECTED CLUSTERING: INTERPRETATIONS … PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

TABLE X. Number of nodes (N) and edges (E) and fraction of sponding to the number of users registered on that server.
single-triangle (1-T) nodes in several collaboration and gene associ- Edges are associated with four attributes: “num follows”, de-
ation networks. fined as Fi j , the number of followers from i to j; “follow/size”,
defined as Fi j /Si ; a “weight” given by wi j = Fi j /Fi , the ratio
Network N E 1-T nodes (%) between the number of followers from i to j divided by the
NetScience 1589 2742 23 total number of followers from i; and a “distance” giving the
CompGeo 7343 11 898 21 Euclidean distance between the two servers on the latitude-
CE-CX 15 229 245 952 8 longitude plane. See Table IX and Fig. 7.
CE-HT 2617 2985 2
HS-HT 2570 13 691 7 3. Closure in the shuffled networks
SC-HT 2084 63 027 5
The results obtained for the local closure confirm those
obtained using the clustering coefficient for the mouse brain
and Fediverse networks, as shown in Fig. 8. Namely, we see
APPENDIX F: REAL-WORLD NETWORKS
an overexpression of fan-in and fan-out patterns for the mouse
network and a discrepancy between those two patterns for the
1. Mouse mesoscale connectome Fediverse network.
The mouse mesoscale connectome network was obtained
from Ref. [22] and contains 426 nodes corresponding to brain 4. Networks with a high number of single-node triangles
regions of intermediate scale connected by 65 465 edges. It The presence of nodes participating in a single triangle
is a symmetric version of the original network that separates is frequent in very sparse networks such as collaboration or
nodes from both hemispheres. Each node is associated with gene-protein interaction networks.
a “name” property composed of an abbreviated denomination Table X details several such networks: NetScience in-
for the corresponding brain region, as well as a suffix (left or volves coauthorship in the network science community [31],
right) corresponding to the hemisphere. Edges are associated CompGeo involves collaborations in computational geome-
with three attributes: a “weight,” a “p value,” and a “distance” try [32], CE-CX is a graph of gene associations inferred
(corresponding to the Euclidean distance from the source to from the coexpression pattern of two genes (based on
the target node). See Table VIII and Fig. 6. high-dimensional gene expression data) for Caenorhabdi-
tis elegans, CE-HT is for gene associations inferred from
high-throughput protein-protein interactions for C. elegans,
2. Fediverse mesoscale network HS-HT is for gene associations inferred from high-throughput
The Fediverse mesoscale network was obtained from Ref. protein-protein interactions for Homo sapiens, and SC-HT is
[25] (data are available [26]). It contains 3825 nodes rep- for gene associations inferred from high-throughput protein-
resenting servers (instances) that are connected via 81 371 protein interactions for Saccharomyces cerevisiae.
edges. Each node is associated with two attributes: a “name”, Gene functional association networks [33] were down-
corresponding to the server domain, and a “size”, Si , corre- loaded [34].

[1] O. Mason and M. Verwoerd, Graph theory and networks in [8] M. D. Fox and M. Greicius, Clinical applications of rest-
biology, IET Syst. Biol. 1, 89 (2007). ing state functional connectivity, Front. Syst. Neurosci. 4, 19
[2] S. E. Ahnert and T. M. A. Fink, Clustering signatures classify (2010).
directed networks, Phys. Rev. E 78, 036112 (2008). [9] J. G. Orlandi, O. Stetter, J. Soriano, T. Geisel, and D. Battaglia,
[3] X.-J. Zhang, B. Gu, X.-M. Guan, Y.-B. Zhu, and R.-L. Lv, Cas- Transfer entropy reconstruction and labeling of neuronal con-
cading failure in scale-free networks with tunable clustering, nections from simulated calcium imaging, PLoS One 9, e98842
Int. J. Mod. Phys. C 27, 1650093 (2016). (2014).
[4] Y.-W. Li, Z.-H. Zhang, D. Fan, Y.-R. Song, and G.-P. Jiang, [10] A. Barrat, M. Barthelemy, R. Pastor-Satorras, and A.
Influence of clustering on network robustness against epidemic Vespignani, The architecture of complex weighted networks,
propagation, in Science of Cyber Security, Lecture Notes in Proc. Natl. Acad. Sci. USA 101, 3747 (2004).
Computer Science, edited by F. Liu, S. Xu, and M. Yung [11] J.-P. Onnela, J. Saramäki, J. Kertész, and K. Kaski, Intensity
(Springer International, Cham, Switzerland, 2018), pp. 19–33. and coherence of motifs in weighted complex networks, Phys.
[5] S. F. Muldoon, E. W. Bridgeford, and D. S. Bassett, Small- Rev. E 71, 065103(R) (2005).
world propensity and weighted brain networks, Sci. Rep. 6, [12] B. Zhang and S. Horvath, A general framework for weighted
22057 (2016). gene co-expression network analysis, Stat. Appl. Genet. Mol.
[6] J.-G. Young, G. T. Cantwell, and M. E. J. Newman, Bayesian Biol. 4, 17 (2005).
inference of network structure from unreliable data, J. Complex [13] H. Yin, A. R. Benson, and J. Leskovec, The local closure coeffi-
Networks 8, cnaa046 (2021). cient: A new perspective on network clustering, in Proceedings
[7] R. Guimerà and M. Sales-Pardo, Missing and spurious interac- of the Twelfth ACM International Conference on Web Search
tions and the reconstruction of complex networks, Proc. Natl. and Data Mining (Association for Computing Machinery, Mel-
Acad. Sci. USA 106, 22073 (2009). bourne, 2019), pp. 303–311.

043124-15
TANGUY FARDET AND ANNA LEVINA PHYSICAL REVIEW RESEARCH 3, 043124 (2021)

[14] J. Saramäki, M. Kivelä, J.-P. Onnela, K. Kaski, and J. Kertész, Ouellette, T. N. Nguyen, S. A. Sorensen, C. R. Slaughterbeck,
Generalizations of the clustering coefficient to weighted com- W. Wakeman, Y. Li, D. Feng, A. Ho, E. Nicholas et al., A
plex networks, Phys. Rev. E 75, 027105 (2007). mesoscale connectome of the mouse brain, Nature (London)
[15] Y. Wang, E. Ghumare, R. Vandenberghe, and P. Dupont, Com- 508, 207 (2014).
parison of different generalizations of clustering coefficient [23] https://fediverse.party/.
and local efficiency for weighted undirected graphs, Neural [24] https://en.wikipedia.org/wiki/Fediverse#Communication_
Comput. 29, 313 (2016). protocols_used_in_the_fediverse.
[16] G. Kalna and D. J. Higham, Clustering coefficients for weighted [25] M. Zignani, C. Quadri, S. Gaito, H. Cherifi, and G. P.
networks, in AISB’06: Adaptation in Artificial and Biological Rossi, The footprints of a “mastodon”: How a decentralized
Systems, edited by T. Kovacs and J. A. R. Marshall (Society architecture influences online social relationships, in IEEE Con-
for the Study of Artificial Intelligence and the Simulation of ference on Computer Communications Workshops 2019 (IEEE,
Behaviour, Bath, UK, 2006), Vol. 3, pp. 132–138. Piscataway, NJ, 2019), pp. 472–477.
[17] P. Holme, S. M. Park, B. J. Kim, and C. R. Edling, Korean [26] https://dataverse.mpi-sws.org/dataset.xhtml?persistentId=doi:
university life in a network perspective: Dynamics of a large 10.5072/FK2/AMYZGS.
affiliation network, Phys. A (Amsterdam) 373, 821 (2007). [27] https://joinmastodon.org/.
[18] K. Miyajima and T. Sakuragawa, Continuous and robust [28] T. Fardet, NNGT 2.5.1, 2021, doi: 10.5281/zenodo.5571901.
clustering coefficients for weighted and directed networks, [29] H. Yin, A. R. Benson, and J. Ugander, Measuring directed
arXiv:1412.0059 [physics.soc-ph]. triadic closure with closure coefficients, Network Sci. 8, 551
[19] G. Fagiolo, Clustering in complex directed networks, Phys. Rev. (2020).
E 76, 026107 (2007). [30] D. J. Watts and S. H. Strogatz, Collective dynamics of ‘small-
[20] G. Clemente and R. Grassi, Directed clustering in weighted world’ networks, Nature (London) 393, 440 (1998).
networks: A new perspective, Chaos, Solitons Fractals 107, 26 [31] https://networkrepository.com/netscience.php.
(2018). [32] http://vlado.fmf.uni-lj.si/pub/networks/data/collab/geom.htm.
[21] I. E. Antoniou and E. T. Tsompa, Statistical analysis of [33] A. Cho, J. Shin, S. Hwang, C. Kim, H. Shim, H. Kim, H.
weighted networks, Discrete Dyn. Nat. Soc. 2008, 375452 Kim, and I. Lee, WormNet v3: A network-assisted hypothesis-
(2008). generating server for Caenorhabditis elegans, Nucleic Acids
[22] S. W. Oh, J. A. Harris, L. Ng, B. Winslow, N. Cain, S. Mihalas, Res. 42, W76 (2014).
Q. Wang, C. Lau, L. Kuan, A. M. Henry, M. T. Mortrud, B. [34] https://networkrepository.com/bio.php.

043124-16