Correlation Clustering with a Fixed Number of Clusters

Authors Ioannis Giotis, Venkatesan Guruswami,
Plaintext
                                    T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266
                                                http://theoryofcomputing.org




                              Correlation Clustering with a
                               Fixed Number of Clusters∗
                                   Ioannis Giotis                              Venkatesan Guruswami†
                                        Received: October 18, 2005; published: October 22, 2006.




        Abstract: We continue the investigation of problems concerning correlation clustering or
        clustering with qualitative information, which is a clustering formulation that has been studied
        recently (Bansal, Blum, Chawla (2004), Charikar, Guruswami, Wirth (FOCS’03), Charikar,
        Wirth (FOCS’04), Alon et al. (STOC’05)). In this problem, we are given a complete graph
        on n nodes (which correspond to nodes to be clustered) whose edges are labeled + (for similar
        pairs of items) and − (for dissimilar pairs of items). Thus our input consists of only qualitative
        information on similarity and no quantitative distance measure between items. The quality of
        a clustering is measured in terms of its number of agreements, which is simply the number of
        edges it correctly classifies, that is the sum of number of − edges whose endpoints it places
        in different clusters plus the number of + edges both of whose endpoints it places within the
        same cluster.
            In this paper, we study the problem of finding clusterings that maximize the number of
        agreements, and the complementary minimization version where we seek clusterings that min-
        imize the number of disagreements. We focus on the situation when the number of clusters is
        stipulated to be a small constant k. Our main result is that for every k, there is a polynomial
        time approximation scheme for both maximizing agreements and minimizing disagreements.
   ∗ A preliminary version of this paper appeared in the Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete

Algorithms, January 2006.
   † Supported in part by NSF CCF-0343672, an Alfred P. Sloan Research Fellowship, and a David and Lucile Packard Foundation

Fellowship.


ACM Classification: F.2.2, G.1.2, G.1.6
AMS Classification: 68W25, 05C85
Key words and phrases: clustering, approximation algorithm, random sampling, polynomial time approx-
imation scheme.

Authors retain copyright to their work and grant Theory of Computing unlimited rights
to publish the work electronically and in hard copy. Use of the work is permitted as
long as the author(s) and the journal are properly acknowledged. For the detailed
copyright statement, see http://theoryofcomputing.org/copyright.html.

c 2006 Ioannis Giotis and Venkatesan Guruswami                                                DOI: 10.4086/toc.2006.v002a013
                                        I. G IOTIS AND V. G URUSWAMI

      (The problems are NP-hard for every k ≥ 2.) The main technical work is for the minimization
      version, as the PTAS for maximizing agreements follows along the lines of the property tester
      for Max k-CUT by Goldreich, Goldwasser, Ron (1998).
          In contrast, when the number of clusters is not specified, the problem of minimizing dis-
      agreements was shown to be APX-hard (Chawla, Guruswami, Wirth (FOCS’03)), even though
      the maximization version admits a PTAS.


1    Introduction
In this work, we investigate problems concerning an appealing formulation of clustering called correlation
clustering or clustering using qualitative information that has been studied recently in several works, in-
cluding [2, 3, 4, 5, 7, 6, 8, 17]. The basic setup here is to cluster a collection of n items given as input only
qualitative information concerning similarity between pairs of items; specifically for every pair of items, we
are given a (Boolean) label as to whether those items are similar or dissimilar. We are not provided with any
quantitative information on the degree of similarity between elements, as is typically assumed in most clus-
tering formulations. These formulations take as input a metric on the items and then aim to optimize some
function of the pairwise distances of the items within and across clusters. The objective in our formulation
is to produce a partitioning into clusters that places similar objects in the same cluster and dissimilar objects
in different clusters, to the extent possible.
     An obvious graph-theoretic formulation of the problem is the following: given a complete graph on n
nodes with each edge labeled either “+” (similar) or “−” (dissimilar), find a partitioning of the vertices into
clusters that agrees as much as possible with the edge labels. The maximization version, call it M AX AGREE,
seeks to maximize the number of agreements: the number of + edges inside clusters plus the number of −
edges across clusters. The minimization version, denoted M IN D IS AGREE, aims to minimize the number of
disagreements: the number of − edges within clusters plus the number of + edges between clusters.
     In this paper, we study the above problems when the maximum number of clusters that we are allowed
to use is stipulated to be a fixed constant k. We denote the variants of the above problems that have this
constraint as M AX AGREE[k] and M IN D IS AGREE[k]. We note that, unlike most clustering formulations, the
M AX AGREE and M IN D IS AGREE problems are not trivialized if we do not specify the number of clusters
k as a parameter — whether the best clustering uses few or many clusters is automatically dictated by the
edge labels. However, the variants we study are also interesting formulations, which are well-motivated
in settings where the number of clusters might be an external constraint that has to be met, even if there
are “better” clusterings (i. e., one with more agreements) with a different number of clusters. Moreover,
the existing algorithms for, say M IN D IS AGREE, cannot be modified in any easy way to output a quality
solution with at most k clusters. Therefore k-clustering variants pose new, non-trivial challenges that require
different techniques for their solutions.
     In the above description, we assumed that every pair of items is labeled as + or − in the input. However,
in some settings, the classifier providing the input might be unable to label certain pairs of elements as
similar or dissimilar. In these cases, the input is an arbitrary graph G together with ± labels on its edges.
We can again study the above problems M AX AGREE[k] (resp. M IN D IS AGREE[k]) with the objective being
to maximize (resp. minimize) the number of agreements (resp. disagreements) on edges of E (that is, we do
not count non-edges of G as either agreements or disagreements). In situations where we study this more
general variant, we will refer to these problems as M AX AGREE[k] on general graphs and M IN D IS AGREE[k]
on general graphs. When we don’t qualify with the phrase “on general graphs,” we will always mean the
problems on complete graphs.

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                250
                     C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

    Our main result in this paper is a polynomial time approximation scheme (PTAS) for M AX AGREE[k] as
well as M IN D IS AGREE[k] for k ≥ 2. We now discuss prior work on these problems, followed by a more
detailed description of results in this paper.




1.1   Previous and related work

The above problem seems to have been first considered by Ben-Dor et al. [6] motivated by some compu-
tational biology questions. Later, Shamir et al. [17] studied the computational complexity of the problem
and showed that M AX AGREE (and hence also M IN D IS AGREE), as well as M AX AGREE[k] (and hence also
M IN D IS AGREE[k]) for each k ≥ 2 is NP-hard. They, however, used the term “Cluster Editing” to refer to
this problem.
    Partially motivated by machine learning problems concerning document classification, Bansal, Blum,
and Chawla [5] also independently formulated and considered this problem. In particular, they initiated the
study of approximate solutions to M IN D IS AGREE and M AX AGREE, and presented a PTAS for M AX A-
GREE and a constant factor approximation algorithm for M IN D IS AGREE (the approximation guarantee was
a rather large constant, though). They also noted a simple factor 3 approximation algorithm for M IN D IS -
AGREE[2]. Charikar, Guruswami, and Wirth [7] proved that M IN D IS AGREE is APX-hard, and thus one
cannot expect a PTAS for the minimization problem similar to the PTAS for M AX AGREE. They also gave a
factor 4 approximation algorithm for M IN D IS AGREE by rounding a natural LP relaxation using the region
growing technique. Recently, Ailon et al. [2] presented an elegant combinatorial factor 3 approximation
algorithm with a clever analysis for M IN D IS AGREE; they also get a factor 5/2 approximation using LP
techniques on top of their basic approach.
    The problems on general graphs have also received attention. It is known that both M AX AGREE and
M IN D IS AGREE are APX-hard [5, 7]. Using a connection to minimum multicut, several groups [7, 9, 10]
presented an O(log n) approximation algorithm for M IN D IS AGREE. In fact, Emanuel and Fiat noted in [10]
that the problem is as hard to approximate as minimum multicut (and so this log n factor seems very hard to
improve). For the maximization version, algorithms with performance ratio better than 0.766 are known for
M AX AGREE [7, 18]. The latter work by Swamy [18] shows that a factor 0.7666 approximation can also be
achieved when the number of clusters is specified (i. e., for M AX AGREE[k] for k ≥ 2).
     Another problem that has been considered, let us call it M AX C ORR, is that of maximizing correla-
tion, defined to be the difference between the number of agreements and disagreements. Charikar and
Wirth [8] (see also [16]) present a factor O(log n) approximation algorithm for the problem (called M AX QP)
of maximizing a quadratic form with general (possibly negative) coefficients, and use this to deduce a factor
O(log n) approximation for M AX C ORR on complete graphs. (Their approximation guarantee holds also for
the weighted version of M AX C ORR where there are nonnegative weights on the edges and the goal is to
maximize the difference between the sum of the weights of agreements minus the sum of weights of the dis-
agreements.) For M AX C ORR on general graphs, an O(log θ (G)) approximation is presented in [3], where
θ (·) is the Lovász Theta Function. Alon et al. [3] also showed an integrality gap of Ω(log n) for the stan-
dard semidefinite program relaxation of M AX QP (the largest such integrality gap for a graph is called the
Grothendieck constant of the graph — thus these results establish the Grothendieck constant of the complete
graph on n vertices to be Θ(log n)). Very recently, for some constant α > 0, Arora et al. [4] proved a factor
logα n inapproximability result for M AX QP and the weighted version of M AX C ORR.

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                             251
                                               I. G IOTIS AND V. G URUSWAMI

1.2    Our results
The only previous approximation for M IN D IS AGREE[k] was a factor 3 approximation algorithm for the case
k = 2 [5]. The problems were shown to be NP-hard for every k ≥ 2 in [17] using a rather complicated reduc-
tion. In this paper, we will provide a much simpler NP-hardness proof and prove that both M AX AGREE[k]
and M IN D IS AGREE[k] admit a polynomial time approximation scheme for every k ≥ 2.1 The existence of
a PTAS for M IN D IS AGREE[k] is perhaps surprising in light of the APX-hardness of M IN D IS AGREE when
the number of clusters is not specified to be a constant (recall that the maximization version does admit a
PTAS even when k is not specified).
     It is often the case that minimization versions of problems are harder to solve compared to their com-
plementary maximization versions. The APX-hardness of M IN D IS AGREE despite the existence of a PTAS
for M AX AGREE is a notable example. The difficulty in these cases is when the optimum value of the mini-
mization version is very small, since then even a PTAS for the complementary maximization problem need
not provide a good approximation for the minimization problem. In this work, we first give a PTAS for
M AX AGREE[k]. This algorithm uses random sampling and follows closely along the lines of the property
testing algorithm for Max k-Cut due to [13].
     It is interesting to note that our algorithm can also be used as a PTAS for the unbounded clusters case
by setting k = Ω(1/ε). It is then not hard to see by a cluster merging procedure that the solution obtained
by our algorithm is still within an ε-factor of the optimal solution. Furthermore, since our algorithm runs
in linear time (on the number of vertices), it can be a more efficient alternative to the algorithm presented
in [5].
     In the next section, we develop a PTAS for M IN D IS AGREE[k], which is our main result. The difficulty
in obtaining a PTAS for the minimization version is similar to that faced in the problem of M IN -k-S UM
clustering, which has the complementary objective function to M ETRIC M AX -k-C UT. We remark that while
an elegant PTAS for M ETRIC M AX -k-C UT due to de la Vega and Kenyon [12] has been known for several
years, only recently has a PTAS for M IN -k-S UM clustering been obtained [11]. We note that although the
case of M IN -2-S UM clustering was solved in [14] soon after the M ETRIC M AX C UT algorithm of [12], the
case k > 2 appeared harder. Similarly to this, for M IN D IS AGREE[k], we are able to quite easily give a PTAS
for the 2-clustering version using the algorithm for M AX AGREE[2], but we have to work harder for the case
of k > 2 clusters. Some of the difficulty that surfaces when k > 2 is detailed in Section 4.1.
     In Section 5, we also note some results on the complexity of M AX AGREE[k] and M IN D IS AGREE[k]
on general graphs — these are easy consequences of connections to problems like Max Cut and graph
colorability.
     Our work seems to nicely complete the understanding of the complexity of problems related to correla-
tion clustering. Our algorithms not only achieve excellent approximation guarantees but are also sampling-
based and are thus simple and quite easy to implement.


2     NP-hardness of M IN D IS AGREE and M AX AGREE
In this section we show that the exact versions of problems we are trying to solve are NP-hard. An NP-
hardness result for M AX AGREE on complete graphs was shown in [5]; however their reduction crucially
relies on the number of clusters growing with the input size, and thus does not yield any hardness when
the number of clusters is a fixed constant k. It was shown by Shamir, Sharan, and Tsur [17], using a rather
    1 Our approximation schemes will be randomized and deliver a solution with the claimed approximation guarantee with high

probability. For simplicity, we do not explicitly mention this from now on.

                              T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                       252
                      C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

complicated reduction, that these problems are NP-hard for each fixed number k ≥ 2 of clusters. We will
provide a short and intuitive proof that M IN D IS AGREE[k] and M AX AGREE[k] are NP-hard.
    Since M AX AGREE[k] and M IN D IS AGREE[k] have complementary objectives, it suffices to establish the
NP-hardness of M IN D IS AGREE[k]. We will first establish NP-hardness for k = 2, the case for general k will
follow by a simple “padding” with (k − 2) large collections of nodes with + edges between nodes in each
collection and − edges between different collections.

Theorem 2.1. M IN D IS AGREE[2] on complete graphs is NP-hard.

Proof. We know that G RAPH M IN B ISECTION, namely partitioning the vertex set of a graph into two equal
halves so that the number of edges connecting vertices in different halves is minimized, is NP-hard. From
an instance G of G RAPH M IN B ISECTION with n (even) vertices we obtain a complete graph G0 using the
following polynomial time construction.
    Start with G and label all existing edges of G as + edges in G0 and non-existing edges as − edges. For
each vertex v create an additional set of n vertices. Let us call these vertices together with v, a “group” Vv .
Connect with + edges all pairs of vertices within Vv . All other edges with one endpoint in Vv are labeled as
− edges (except those already labeled).
    We will now show that any 2-clustering of G0 with the minimum number of disagreements, has two
clusters of equal size with all vertices of any group in the same cluster. Consider some optimal 2-clustering
W with two clusters W1 and W2 such that |W1 | = 6 |W2 | or not all vertices of some group are in the same cluster.
Pick some group Vv such that not all its vertices are assigned in the same cluster. Place all the vertices of
the group in the same cluster, obtaining W 0 such that the size difference of the two clusters is minimized.
If such a group could not be found, pick a group Vv from the larger cluster and place it in the other cluster.
Since all groups contain the same number of vertices it must be the case that the size difference between the
two clusters is reduced.
    Let us assume that Vv1 vertices of group Vv were in W1 and Vv2 in W2 . Without loss of generality, let us
assume that the clustering W 0 = (W10 ,W20 ) above is obtained by moving the Vv1 group vertices into cluster W2 ;

                                           W10 = W1 \Vv1 , W20 = W2 ∪Vv1 .

    We now observe the following facts about the difference in the number of disagreements between W 0
and W .

    • Clearly the number of disagreements between vertices not in Vv and between one vertex in Vv2 with
      one in W10 remains the same.

    • The number of disagreements is decreased by |Vv1 | · |Vv2 | based on the fact that all edges within Vv are
      + edges.

    • It is also decreased by at least |Vv1 | · |W10 | − (n − 1) based on the fact that all but at most n − 1 edges
      connecting vertices of Vv to the rest of the graph are − edges.

    • The number of disagreements increases at most |Vv1 | · |W2 \ Vv2 | because (possibly) all of the vertices
      in Vv1 are connected with − edges with vertices in W2 outside their group.

    Overall, the difference in the number of disagreements is at most

                              |Vv1 | · |W2 \Vv2 | − |Vv1 | · |Vv2 | − |Vv1 | · |W10 | + (n − 1) .

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                 253
                                               I. G IOTIS AND V. G URUSWAMI

Notice that since |W10 | − |W20 | was minimized it must be the case that |W10 | ≥ |W2 \ Vv2 |. Moreover since a
group has an odd number of vertices and the total number of vertices of G0 is even, it follows that |W10 | 6=
|W2 \Vv2 | and thus |W10 | − |W2 \Vv2 | ≥ 1. Therefore the total number of disagreements increases at most

                                                (n − 1) − |Vv1 | · (|Vv2 | + 1) .

Since |Vv1 | + |Vv2 | = n + 1 and Vv1 cannot be empty, it follows that |Vv1 | · (|Vv2 | + 1) ≥ n and the number of
disagreements strictly decreases, contradicting the optimality of W .
    Therefore the optimal solution to the M IN D IS AGREE[2] instance has two clusters of equal size and
all vertices of any group are contained in a single cluster. It is now trivial to see that an optimal solution
to the G RAPH M IN B ISECTION problem can be easily derived from the M IN D IS AGREE[2] solution which
completes the reduction.

       We now prove the NP-hardness for more than two clusters.

Theorem 2.2. For every fixed k ≥ 2, there exists n such that M AX AGREE[k] and M IN D IS AGREE[k] on
complete graphs are NP-hard.

Proof. Consider an instance of the M IN D IS AGREE[2] problem on a graph G with n vertices. Create a graph
G0 by adding to G, k − 2 “groups” of n + 1 vertices each. All edges within a group are marked as + edges,
while the remaining edges are marked as − edges.
    Consider now a k-clustering of G0 such that the number of disagreements is minimized. It is easy to
see that all the vertices of a group must make up one cluster. Also observe that any of the original vertices
cannot end up in one group’s cluster since that would induce n + 1 disagreements, strictly more than it
could possibly induce in any of the 2 remaining clusters. Therefore the 2 non-group clusters are an optimal
2-clustering of G. The theorem easily follows.


3      PTAS for maximizing agreement with k clusters
In this section we will present a PTAS for M AX AGREE[k] for every fixed constant k. Our algorithm follows
closely the PTAS for Max k-CUT by Goldreich et al. [13]. In the next section, we will present our main
result, namely a PTAS for M IN D IS AGREE[k], using the PTAS for M AX AGREE[k] together with additional
ideas.2

Theorem 3.1. For every k ≥ 2, there is a polynomial time approximation scheme for M AX AGREE[k].

Proof. We first note that for every k ≥ 2, and every instance of M AX AGREE[k], the optimum number OPT
                           2                                                        n
                                                                                      
of agreements is at least n /16. Let n+ be the number of positive edges, and n− = 2 − n+ be the number
of negative edges. By placing all vertices in a single cluster, we get n+ agreements. By placing vertices
randomly in one of k clusters, we get an expected (1 − 1/k)n− agreements just on the negative edges.
Therefore                                                            
                                                           (1 − 1/k) n
                        OPT ≥ max{n+ , (1 − 1/k)n− } ≥                   ≥ n2 /16 .
                                                               2      2
The proof now follows from Theorem 3.2 which guarantees a solution within additive εn2 of OPT for
arbitrary ε > 0.
    2 This is also similar in spirit, for example, to the PTAS for Min 2-sum clustering based on the PTAS for Metric Max CUT [14,

12].

                              T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                           254
                         C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

      Algorithm MaxAg(k, ε):
      Input: A labeling L : n2 → {+, −} of the edges of the complete graph on vertex set V .
                              

      Output: A k-clustering of the graph, i. e., a partition of V into (at most) k parts W1 ,W2 , . . . ,Wk .

      1. Construct an arbitrary partition of the graph into roughly equal parts, (V 1 ,V 2 , . . . ,V m ), m = d ε4 e.
      2. For i = 1 . . . m, choose uniformly at random with replacement from V \V i , a subset Si of size
                322      64mk
          r = 2ε  2 log εδ .
      3. Set W to be an arbitrary (or random) clustering.
      4. For each clustering of each of the sets Si into (S1i , . . . , Ski ) do
         (a) For i = 1 . . . m do the following
            (i) For each vertex v ∈ V i do
               (1) For j = 1 . . . k, let
                    β j (v) = |Γ+ (v) ∩ Sij | + ∑l6= j |Γ− (v) ∩ Sli |.
               (2) Place v in cluster argmax j β j (v).
         (b) If the clustering formed by step (a) has more agreements than W , set W to be that clustering.
      5. Output W .

                                           Figure 1: MaxAg(k, ε) algorithm



Theorem 3.2. On input ε, δ and a labeling L of the edges of a complete graph G with n vertices, with
probability at least 1 − δ , algorithm MaxAg outputs a k-clustering of the graph such that the number of
agreements induced by this k-clustering is at least OPT − εn2 /2, where OPT is the optimal number of
                                                                                          −3
agreements induced by any k-clustering of G. The running time of the algorithm is n · kO(ε log(k/(εδ ))) .
   The proof of this theorem is presented in Section 3.2, and we now proceed to describe the algorithm in
Figure 1.

3.1     Overview
Our algorithm is given a complete graph G(V, E) on n vertices. All the edges are marked as + or −, denoting
agreements or disagreements respectively. For a vertex v, let Γ+ (v) be the set of vertices adjacent to v via +
edges, and Γ− (v) the set of vertices adjacent to v via − edges.
     Initially, note that our algorithm works in m = O(1/ε) steps. We partition the vertices into m almost
equal parts V 1 ,V 2 , . . . ,V m (each with Θ(εn) vertices) in an arbitrary way. In the ith step, we place the ver-
tices of V i into clusters. This is done with the aid of a sufficiently large, but constant-sized, random sample
Si drawn from vertices outside V i . (The random samples for different steps are chosen independently.) The
algorithm tries all possible ways in which all the samples can be clustered, and for each poossibility, it clus-
ters V i by placing each vertex in the cluster that maximizes the agreement with respect to the clustering of
the sample Si .
     The analysis is based on the following rationale. Fix some optimal clustering D as a reference. Assume
that the considered clustering of Si matches the clustering of V 1 ∪ · · · ∪V i−1 the algorithm has computed so
far and the optimal clustering on V i+1 ∪ · · · ∪ V m (since we try all possible clusterings of Si , we will also
try this particular clustering). In this case, using standard random sampling bounds, we can show that with
high probability over the choice of Si , the clustering of V i chosen by the algorithm is quite close, in terms
of number of agreements, to the clustering of V i by the optimal clustering D. This will imply that with

                              T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                        255
                                                I. G IOTIS AND V. G URUSWAMI

constant probability our choice of the Si will allow us to place the vertices in such a way that the decrease in
the number of agreements with respect to an optimal clustering is O(ε 2 n2 ) per step. Therefore, the algorithm
will output a solution that has at most O(εn2 ) fewer agreements compared to the optimal solution.

3.2    Performance analysis of MaxAg(k, ε) algorithm.
Consider an arbitrary optimal k-clustering of the graph D ≡ (D1 , . . . , Dk ). We consider the subsets of each
cluster over our partition of vertices, defined as

                                           Dij ≡ D j ∩V i for j = 1, . . . , k, and
                                           Di ≡ (Di1 , . . . , Dik ) .

Let’s also call the clustering output by our algorithm W ≡ (W1 , . . . ,Wk ) and define in the same fashion
subsets of our clustering:

                                           W ji ≡ W j ∩V i for j = 1, . . . , k, and
                                           W i ≡ (W1i , . . . ,Wki ) .

To be able to measure the performance of our algorithm at each step, we need to have an image of the
graph with all the clustered vertices up to the current point, while the rest of the vertices will be viewed
under the arbitrary clustering defined above. Intuitively, this image of the graph will enable us to bound
within each step the difference in agreements against the arbitrary clustering while taking into account the
clustering decisions made thus far. More formally, we define a sequence of hybrid clusterings, such that
hybrid clustering H i , for i = 1, 2, . . . , m + 1, consists of the vertices as clustered by our algorithm up to (not
including) the ith step and the rest of the vertices as clustered by D;

                                     H i ≡ (H1i , . . . , Hki ) ,
                                     Hi ≡ (H1i , . . . , Hki ) ,
                                     H ij ≡ (∪i−1  l       m    l
                                              l=1W j ) ∪ (∪l=i D j ) for j = 1, . . . , k,
                                     Hij ≡ H ij \V i for j = 1, . . . , k.

    Since we are going through all possible clusterings of the random sample sets Si , for the rest of the
analysis consider the loop iteration when the clustering of each Si exactly matches how it is clustered in
Hi ,3 i. e., for j = 1, 2, . . . , k, we have Sij = Si ∩ Hij . Of course, taking the overall best clustering can only
help us.
    The following lemma captures the fact that our random sample with high probability gives us a good
estimate on the number of agreements towards each cluster for most of the vertices considered.

Lemma 3.3. For i = 1 . . . m, with probability at least 1 − (δ /4m) on the choice of Si , for all but at most an
ε/8 fraction of the vertices v ∈ V i , the following holds for j = 1, . . . k,

                                         |Γ+ (v) ∩ Sij |       |Γ+ (v) ∩ Hij |       ε
                                                           −                     ≤      .                       (3.1)
                                                r                 |V \V i |          32

(Note that if (3.1) above holds, then it also holds with Γ− (v) in place of Γ+ (v).)
   3 The clustering in question might not match the actual final clustering of these vertices.


                               T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                256
                       C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

Proof. Consider an arbitrary vertex v ∈ V i and the randomly chosen set Si = {u1 , . . . , ur }. For each j ∈
{1, . . . , k}, we define the random variables

                                                               1, if u` ∈ Γ+ (v) ∩ Sij ;
                                                             
                                  for ` = 1, . . . r, α `j =
                                                               0, otherwise.

    Clearly ∑r`=1 α `j = |Γ+ (v) ∩ Sij | and

                                                                |Γ+ (v) ∩ Hij |
                                               Pr[α lj = 1] =                     .
                                                                   |V \V i |

Using an additive Chernoff bound we get that

                    |Γ+ (v) ∩ Sij | |Γ+ (v) ∩ Hij |
                  "                                      #             
                                                      ε                 ε 2      εδ
               Pr                  −         i
                                                    >      < 2 · exp −2     r =
                          r           |V \V |         32                32      32mk

where r and m were defined in our algorithm as r = (322 /2ε 2 ) log(64mk/εδ ) and m = d4/εe. In particular,
r was chosen to precisely match this bound’s inequality requirements.
     We can now use a standard probabilistic argument. Defining a random variable to count the number of
vertices not satisfying inequality (3.1) and, using Markov’s inequality, we observe that for that particular
j, inequality (3.1) holds for all but a fraction ε/8 of vertices v ∈ V i , with probability at least 1 − (δ /4mk).
Using a probability union bound the lemma easily follows.

    We define agree(A) to be equal to the number of agreements induced by k-clustering A. Now consider
the placement of the V i vertices in clusters W1i , . . . ,Wki , as performed by the algorithm during step i. We will
examine the number of agreements compared to the placement of the same vertices under H i (placement
under the optimal clustering), more specifically we will bound the difference in the number of agreements
induced by placing vertices differently than H i . The following lemma formalizes this concept.

Lemma 3.4. For i = 0, . . . m, we have agree(H i+1 ) ≥ agree(D) − i · 18 ε 2 n2 .

Proof. Observe that H 1 ≡ D and H m+1 ≡ W . The only vertices placed differently between H i+1 and H i are
the vertices in V i . Suppose that our algorithm places v ∈ V i in cluster x, but v is placed in cluster x0 under
H i . For such vertex v the number of agreements towards clusters other than x, x0 remains the same, therefore
we will focus on the number of agreements towards these two clusters and the number of agreements within
V i.
      The number of agreements we could lose by thus misplacing v is

                diff xx0 (v) = |Γ+ (v) ∩ Hxi 0 | − |Γ+ (v) ∩ Hxi | + |Γ− (v) ∩ Hxi | − |Γ− (v) ∩ Hxi 0 | .

Since our algorithm chose cluster x, by construction

                          |Γ+ (v) ∩ Sxi | + |Γ− (v) ∩ Sxi 0 | ≥ |Γ+ (v) ∩ Sxi 0 | + |Γ− (v) ∩ Sxi | .          (3.2)

If inequality (3.1) holds for vertex v, using it for Γ+ (v) and Γ− (v) in both clusters x, x0 , we obtain bounds on
the difference of agreements between our random sample’s clusters Sxi , Sxi 0 and the hybrid clusters Hxi , Hxi 0 .
Combining with inequality (3.2) we get that diff xx0 (v) is at most (ε/8)n. Therefore the total decrease in the
number of agreements by this type of vertices is at most (ε/8)n|V i | ≤ (ε/8) · n2 /m.

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                  257
                                       I. G IOTIS AND V. G URUSWAMI

    By Lemma 3.3 there are at most (ε/8)|V i | vertices in V i for which inequality (3.1) doesn’t hold. The
total number of agreements originating from these vertices is at most (ε/8)|V i |n ≤ (ε/8) · n2 /m. Finally,
the total number of agreements from within Vi is at most |V i |2 ≤ (ε/4) · n2 /m.
    Overall the number of agreements that we could lose in one step of the algorithm is at most (ε/2) ·
n /m ≤ (ε 2 /8)n2 . The lemma follows by induction.
 2


     The approximation guarantee of Theorem 3.2 easily follows from Lemma 3.4. We need to go through
all possible k-clusterings of our random sample sets, a total of kmr loop iterations. The inner loop (over i)
runs m times, and each of those iterations can be implemented in O(nr) time. The claimed running time
bound of our algorithm thus follows.


4     PTAS for minimizing disagreements with k clusters
This section is devoted to the proof of the following theorem, which is our main result in this paper.

Theorem 4.1 (Main). For every k ≥ 2, there is a PTAS for M IN D IS AGREE[k].

    The algorithm for M IN D IS AGREE[k] will use the approximation scheme for M AX AGREE[k] as a sub-
routine. The latter already provides a very good approximation for the number of disagreements unless this
number is very small. So in the analysis, the main work is for the case when the optimum clustering is right
on most of the edges.

4.1   Idea behind the algorithm
The case of two clusters turns out to be a lot simpler and we use it to first illustrate the basic idea. By
the PTAS for maximization, we only need to focus on the case when the optimum clustering has only
OPT = γn2 disagreements for some small γ > 0. We draw a random sample S and try all partitions of it,
and focus on the run when we guess the right partition S = (S1 , S2 ), namely the way some fixed optimal
clustering D partitions S. Since the optimum has a very large number of agreements, there must exist a set
A of size at least (1 − O(γ))n such that each node in A has a clear choice of which side it prefers to be on.
Moreover, for each node in A, we can find out its choice correctly (with high probability) based on edges
connecting it to nodes in the sample S. Therefore, we can find a clustering which agrees with D on a set
A of at least 1 − O(γ) fraction of the nodes. We can then go through this clustering, and for each node in
parallel, switch it to the other side if that improves the solution to produce the final clustering. Nodes in
A have a clear preference therefore they won’t get switched, thus they will remain clustered exactly as in
the optimum D. The number of extra disagreements compared to D on edges amongst nodes in V \ A is
obviously at most the number of those edges which is O(γ 2 n2 ). For edges connecting a node u ∈ V \ A to
nodes in A, since we placed u on the “better” side, and A is placed exactly as in D in the final clustering,
we can have at most O(γn) extra disagreements per node compared to D (this is the error introduced by the
edges from a node in A towards the misplaced nodes in V \ A). Therefore we get a clustering with at most
OPT + O(γ 2 n2 ) = (1 + O(γ))OPT disagreements.
    Our k-clustering algorithm for k > 2 uses a similar high-level approach, but is more complicated. The
main thing which breaks down compared to the k = 2 case is the following. For two clusters, if D has
agreements on a large, i. e. (1 − O(γ)), fraction of edges incident on a node u (i. e. if u ∈ A in the above
notation), then we are guaranteed to place u exactly as in D based on the sample S (when we guess its
correct clustering), since the other option will have much poorer agreement. This is not the case when

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                             258
                       C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

k > 2, and one can get a large number of agreements by placing a node in say one of two possible clusters.
Therefore, it does not seem possible to argue that each node in A is correctly placed, and then to use this to
finish off the clustering.
     However, what we can show is that nodes in A that are incorrectly placed, call this set B, must be in small
clusters of D, and thus are few in number. Moreover, every node in A that falls in one of the large clusters
that we produce, is guaranteed to be correctly placed. (These facts are the content of Lemma 4.4.) The
nodes in B still need to be clustered, and even a small additional number of mistakes per node in clustering
them is more than we can afford. We get around this predicament by noting that nodes in B and A \ B are in
different sets of clusters in D. It follows that we can cluster B recursively in new clusters (and our algorithm
is guaranteed to terminate because B is clustered using fewer than k clusters). The actual algorithm must
also deal with nodes outside A, and in particular decide which of these nodes are recursively clustered along
with B.
     With this intuition in place, we now proceed to the formal specification of the algorithm that gives
a factor (1 + ε) approximation for M IN D IS AGREE[k] in Figure 2. We will use a small enough absolute
constant c1 in the algorithm; the choice c1 = 1/20 will work.

    Algorithm MinDisAg(k, ε):
    Input: A labeling L : n2 → {+, −} of the edges of the complete graph on vertex set V = {1, . . . , n}.
                            

    Output: A k-clustering of the graph, i. e., a partition of V into (at most) k parts V1 ,V2 , . . . ,Vk .

    0. If k = 1, return the obvious 1-clustering.
                                                                                                 ε 2 c2
    1. Run the PTAS for M AX AGREE[k] from previous section on input L with accuracy 32k14 .
        Let ClusMax be the k-clustering returned.
                  c1 ε                                          5 log n
    2. Set β = 16k    2 . Pick a sample S ⊆ V by drawing          β2
                                                                        vertices u.a.r with replacement.
    3. ClusVal = 0; /* Keeps track of value of best clustering found so far*/
    4. For each partition S̃ of S as S1 ∪ S2 ∪ · · · ∪ Sk , perform the following steps:
       (a) Initialize the clusters Ci ≡ Si for 1 ≤ i ≤ k.
       (b) For each u ∈ V \ S
          (i) For each i = 1, 2, . . . , k, compute pvalS̃ (u, i), defined to be 1/|S| times the number of
               agreements on edges connecting u to nodes in S if u is placed in cluster i along with Si .
                                                              def
         (ii) Let ju = arg maxi pvalS̃ (u, i), and valS̃ (u) = pvalS̃ (u, ju ).
         (iii) Place u in cluster C ju , i. e., C ju ≡ C ju ∪ {u}.
      (c) Compute the set of large and small clusters as
                                                   n
            Large ≡ { j | 1 ≤ j ≤ k, |C j | ≥ 2k     }, and Small ≡ {1, 2, . . . , k} \ Large.
           Let l = |Large| and s = k − l = |Small|. /* Note that s < k. */
                      def S
       (d) Cluster W = j∈Small C j into s clusters using recursive call to algorithm MinDisAg(s, ε/3).
            Let the clustering output by the recursive call be W ≡ W10 ∪W20 ∪ · · · ∪Ws0
            (where some of the Wi0 ’s may be empty)
       (e) Let C be the clustering comprising of the k clusters {C j } j∈Large and {Wi0 }1≤i≤s .
            If the number of agreements of C is at least ClusVal, update ClusVal to this value, and
            update ClusMin ≡ C.
    5. Output the better of the two clusterings ClusMax and ClusMin.

                                        Figure 2: MinDisAg(k, ε) algorithm

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                259
                                                I. G IOTIS AND V. G URUSWAMI

4.2    Performance analysis of the algorithm
We now analyze the approximation guarantee of the above algorithm. We need some notation. Let

                                                     A ≡ A1 ∪ A2 ∪ · · · ∪ Ak

be any k-clustering of the nodes in V . Define the function valA : V → [0, 1] as follows: valA (u) equals the
fraction of edges incident upon u whose labels agree with clustering A (i. e., we count negative edges that are
cut by A and positive edges that lie within the same Ai for some i). Also define disagr(A) to be the number
                                                                                                  A
of disagreements of A with respect to the labeling L. (Clearly disagr(A) = n−1   2 ∑u∈V (1 − val (u)).) For a
node u ∈ V and 1 ≤ i ≤ k, let A(u,i) denote the clustering obtained from A by moving u to Ai and leaving
all other nodes untouched. We define the function pvalA : V × {1, 2, . . . , k} → [0, 1] as follows: pvalA (u, i)
equals the fraction of edges incident upon u that agree with the clustering A(u,i) .
     In the following, we fix D to be any optimal k-clustering that partitions V as V ≡ D1 ∪ D2 ∪ · · · ∪ Dk .
Let γ be defined to be disagr(D)/n2 so that the clustering D has γn2 disagreements with respect to the input
labeling L.
     Call a sample S of nodes, each drawn uniformly at random with replacement, to be α-good if the nodes
in S are distinct4 and for each u ∈ V and i ∈ {1, 2, . . . , k},

                                               |pvalS̃ (u, i) − pvalD (u, i)| ≤ α ,                                             (4.1)

for the partition S̃ of S, defined as S̃ = {S1 , . . . , Sk } with Si = S ∩ Di (where pvalS̃ (·, ·) is as defined in the
algorithm). The following lemma follows by a standard Chernoff and union bound argument similar to
Lemma 3.3.5
                                                                                                          √
Lemma 4.2. The sample S picked in Step 2 is β -good with high probability, at least 1 − O(1/ n), where β
is defined in Figure 2.

    Therefore, in what follows we assume that the sample S is β -good. In the rest of the discussion we focus
on the run of the algorithm for the partition S̃ of S that agrees with the optimal partition D, i. e., Si = S ∩ Di .
(All lemmas stated apply for this run of the algorithm, though we don’t make this explicit in the statements.)
Let (C1 ,C2 , . . . ,Ck ) be the clusters produced by the algorithm at the end of Step 4(c) on this run. Let’s begin
with the following simple observation.

Lemma 4.3. Suppose a node u ∈ Ds is placed in cluster Cr at the end of Step 4(b) for r 6= s, 1 ≤ r, s ≤ k.
Then pvalD (u, r) ≥ pvalD (u, s) − 2β = valD (u) − 2β .

Proof. Note that since u ∈ Ds , valD (u) = pvalD (u, s). By the β -goodness of S (recall inequality (4.1)),
pvalS̃ (u, s) ≥ pvalD (u, s) − β . Since we chose to place u in Cr instead of Cs , we must have pvalS̃ (u, r) ≥
pvalS̃ (u, s). By the β -goodness of S again, we have pvalD (u, r) ≥ pvalS̃ (u, r) − β . Combining these three
inequalities gives us the claim of the lemma.

   Define the set of nodes of low value in the optimal clustering D as Tlow = {u | valD (u) ≤ 1 − c1 /k2 }.
                                                                                                def

The total number of disagreements is at least the number of disagreements induced by these low valued
    4 Note that in the algorithm we draw elements of the sample with replacement, but for the analysis, we can pretend that S consists

of distinct elements, since this happens with high probability.
    5 Since our sample size is Ω(log n) as opposed to O(1) that was used in Lemma 3.3, we can actually ensure (4.1) holds for every

vertex with high probability.

                               T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                               260
                       C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

nodes, therefore
                                                2k2 disagr(D)    2k2 γn2    4k2 γn
                                    |Tlow | ≤                 =           ≤        .                          (4.2)
                                                  (n − 1)c1     (n − 1)c1     c1
The following key lemma asserts that the large clusters produced in Step 4(c) are basically correct.

Lemma 4.4. Suppose γ ≤ c1 /16k3 . Let Large ⊆ {1, 2, . . . , k} be the set of large clusters as in Step 4(c) of
the algorithm. Then for each i ∈ Large, Ci \ Tlow ≡ Di \ Tlow , that is with respect to nodes of high value, Ci
precisely agrees with the optimal cluster Di .

Proof. Choose i arbitrarily from Large. We will first prove the inclusion Ci \ Tlow ⊆ Di \ Tlow . Suppose this is
                                                                                         / Tlow , we have valD (u) ≥
not the case and there exists u ∈ Ci \ (Di ∪ Tlow ), so u ∈ D j for some j 6= i. Since u ∈
1 − c1 /k2 , which implies pvalD (u, j) ≥ 1 − c1 /k2 . By Lemma 4.3, this gives pvalD (u, i) ≥ 1 − c1 /k2 − 2β .
Therefore we have
                                                                               |Di | + |D j | − 1
                      2(1 − c1 /k2 − β ) ≤ pvalD (u, i) + pvalD (u, j) ≤ 2 −
                                                                                       n
where the second inequality follows from the simple but powerful observation that each edge connecting
u to a vertex in Di ∪ D j is correctly classified in exactly one of the two placements of u in the ith and jth
clusters (when leaving every other vertex as in clustering D). We conclude that both |Di | and |D j | are at
most:                                                    c    
                                                            1
                                        |Di |, |D j | ≤ 2 2 + β n + 1 .                                  (4.3)
                                                          k
What we have shown is that if u ∈ Ci \ (Di ∪ Tlow ), then u ∈ D j for some j with |D j | ≤ 2(c1 /k2 + β )n + 1. It
follows that |Ci \ (Di ∪ Tlow )| ≤ 2(c1 /k + β k)n + k. Therefore,
                                    c
                                     1
                                                n    4k2 γn    c
                                                                   1
                                                                               c
                                                                                   1
                                                                                     
      |Di | ≥ |Ci | − |Tlow | − 2      +βk n−k ≥    −        −2      +βk n−k > 2 2 +β n+1
                                     k           2k     c1       k               k

where the last inequality follows since γ ≤ c1 /16k3 , k ≥ 2, c1 = 1/20, β is tiny and by using a crude bound.
This contradicts (4.3), and so we conclude Ci \ Tlow ⊆ Di \ Tlow .
    We now consider the other inclusion Di \ Tlow ⊆ Ci \ Tlow . If a node v ∈ Di \ (Ci ∪ Tlow ) is placed in Cq
for q 6= i, then a similar argument to how we concluded (4.3) establishes |Di | ≤ 2(c1 /k2 + β )n + 1, which is
impossible since we have shown Di ⊇ Ci \ Tlow , and hence

                                                         n    4k2 γn    c
                                                                           1
                                                                             
                             |Di | ≥ |Ci | − |Tlow | ≥      −        > 2 2 +β n+1 ,
                                                         2k     c1       k
where the last step follows similarly as above.

    The next lemma states that there is a clustering which is very close to optimum and agrees exactly with
our large clusters. This justifies the approach taken by the algorithm to find a near-optimal clustering by
focusing on the small clusters among (C1 ,C2 , . . . ,Ck ) and reclustering them recursively.

Lemma 4.5. Assume γ ≤ c1 /16k3 . There exists a clustering F that partitions V as V ≡ F1 ∪ F2 ∪ · · · Fk , that
satisfies the following:

  (i) Fi ≡ Ci for every i ∈ Large,
                                                                                                     
                                                                                         2     2k2 γ 
  (ii) The number of disagreements of the clustering F is at most disagr(F) ≤ γn2 1 + 4k
                                                                                       c1  β +  c1      .

                             T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                                261
                                        I. G IOTIS AND V. G URUSWAMI

Proof. By the previous lemma, we know that the clustering C = (C1 , . . . ,Ck ) agrees with the optimal cluster-
ing D on the large clusters, except possibly for vertices belonging to Tlow . To get the clustering F claimed in
the lemma, we will start with D and move any elements of Tlow where D and C differ to the corresponding
cluster of C. This will yield a clustering F which agrees with C on the large clusters, and we will argue that
only few extra disagreements are introduced in the process. The formal details follow.
    Consider the clustering formed from D by performing the following in parallel for each w ∈ Tlow : If
w ∈ Cr and w ∈ Ds for some r 6= s, move w to Dr . Let F ≡ F1 ∪ · · · ∪ Fk be the resulting clustering. By
construction,
                                       Fi ∩ Tlow ≡ Ci ∩ Tlow , for 1 ≤ i ≤ k.
Since we only move nodes in Tlow , clearly Fi \ Tlow ≡ Di \ Tlow for 1 ≤ i ≤ k. By Lemma 4.4, however,
Ci \ Tlow ≡ Di \ Tlow for i ∈ Large; we conclude that Fi ≡ Ci for each i ∈ Large.
     Now the only extra edges that the clustering F can get wrong, compared to D, are those incident upon
nodes in Tlow , and therefore

                       disagr(F) − disagr(D) ≤ (n − 1) ∑ (valD (w) − valF (w)) .                          (4.4)
                                                          w∈Tlow

If a node w belongs to the same cluster in F and D (i. e., we did not move it), then since no node outside
Tlow is moved in obtaining F from D, we have

                                    valF (w) ≥ valD (w) − |Tlow |/(n − 1) .                               (4.5)

If we moved a node w ∈ Tlow from Ds to Dr , then by Lemma 4.3 we have pvalD (w, r) ≥ valD (w) − 2β .
Therefore for such a node w

                 valF (w) ≥ pvalD (w, r) − |Tlow |/(n − 1) ≥ valD (w) − 2β − |Tlow |/(n − 1) .            (4.6)

Combining (4.4), (4.5) and (4.6), we can conclude
                                                                             
                                                                      |Tlow |
                          disagr(F) − disagr(D) ≤ (n − 1)|Tlow | 2β +           .
                                                                      n−1

The claim now follows using the upper bound on |Tlow | from (4.2) (and using n2 /(n − 1)2 ≤ 2).


Lemma 4.6. If the optimal clustering D has γn2 disagreements for γ ≤ c1 /16k  3 , then the clustering ClusMin

found by the algorithm has at most γn2 (1 + ε/3) 1 + 4k2 β /c1 + 8k4 γ/c21 disagreements.
                                                                          

Proof. We note that when restricted to the set of all edges except those entirely within W , the set of agree-
ments of the clustering C in Step 4(e) coincides precisely with that of F. Let n1 be the number of disagree-
ments of F on edges that lie within W and let n2 be the number of disagreements on all other edges. Since
W is clustered recursively, we know the number of disagreements in C is at most
                                              ε                    ε
                                  n2 + n1 1 +      ≤ (n1 + n2 ) 1 +       .
                                               3                      3
The claim follows from the bound on n1 + n2 from Lemma 4.5, Part (ii).

Theorem 4.7. For every ε > 0, algorithm MinDisAg(k, ε) delivers a clustering with number of disagree-
ments within a factor (1 + ε) of the optimum.

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                               262
                        C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

Proof. Let OPT = γn2 be the number of disagreements of an optimal clustering. The solution ClusMax
returned by the maximization algorithm has at most

                                              ε 2 c21 n2      2
                                                                     ε 2 c21 
                                     OPT +               = γn    1 +
                                               32k4                  32k4 γ

disagreements. The solution ClusMin has at most γn2 (1 + ε/3) 1 + 4k2 β /c1 + 8k4 γ/c21 ) disagreements.
                                                                                             

If γ > εc21 /32k4 , the former is within (1 + ε) of the optimal. If γ ≤ εc21 /32k4 (which also satisfies the
requirement γ ≤ c1 /16k3 we had in Lemma 4.6), the latter clustering ClusMin achieves approximation ratio
(1 + ε/3)(1 + ε/2) ≤ (1 + ε) (recall that β ≤ εc1 /16k2 ). Thus the better of these two solutions is always an
(1 + ε) approximation.

    To conclude Theorem 4.1, we examine the running time of MinDisAg. Step 4 will be run for k|S| =
    4 2
nO(k /ε ) iterations.  During each iteration, the placement of vertices is done in O(n log n) time. Finally,
observe that there is always at least one large cluster, therefore the recursive call is always done on at most
(k − 1) clusters. It follows that the running time of MinDisAg(k, ε) can be described from the recurrence
                 4 2
T (k, ε) ≤ nO(k /ε ) (n log n + T (k − 1, ε/3)) from which we derive that the total running time is bounded by
     k 2
nO(9 /ε ) log n.


5    Complexity on general graphs
So far, we have discussed the M AX AGREE[k] and M IN D IS AGREE[k] problems on complete graphs. In this
section, we note some results on the complexity of these problems when the graph can be arbitrary. As we
will see, the problems become much harder in this case.

Theorem 5.1. There is a polynomial time factor 0.878 approximation algorithm for M AX AGREE[2] on
general graphs. For every k ≥ 3, there is a polynomial time factor 0.7666 approximation algorithm for
M AX AGREE[k] on general graphs.

Proof. The bound for the 2-clusters case follows from the Goemans-Williamson algorithm for M AX C UT
modified in the obvious way to account for the positive edges. Specifically, we can write a semidefinite
program relaxation for M AX AGREE[2] similar to the GW semidefinite relaxation of M AX C UT: There is a
unit vector associated with each vertex, and the objective function, which now includes terms for the positive
edges, equals
                                               1 − hvi , v j i                   1 + hvi , v j i
                                      ∑             2
                                                               +       ∑              2
                                                                                                 .
                               (i, j) negative                   (i, j) positive

The rounding is identical to the GW random hyperplane rounding. By the GW analysis, we know that the
probability that vi and v j are separated by a random hyperplane is at least 0.878 times (1/2)(1 − hvi , v j i).
By a similar calculation, it can be shown that the probability that vi and v j are not separated by a random
hyperplane is at least 0.878 times (1/2)(1 + hvi , v j i). These facts imply that the expected agreement of the
clustering produced by random hyperplane rounding is at least 0.878 times the optimum value of the above
semidefinite program, which in turn is at least as large as the maximum agreement with two clusters.
    The bound for k ≥ 3 is obtained by Swamy [18] who also notes that slightly better bounds are possible
for 3 ≤ k ≤ 5.

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                              263
                                       I. G IOTIS AND V. G URUSWAMI

    The M AX AGREE[2] problem on general graphs includes as a special case the M AX C UT problem. There-
fore, by the recent work on hardness of approximating M AX C UT [15], the above approximation guarantee
for M AX AGREE[2] is the best possible, unless the Unique Games Conjecture is false.
                                               √
Theorem 5.2. There is a polynomial time O( log n) approximation algorithm for M IN D IS AGREE[2] on
general graphs. For k ≥ 3, M IN D IS AGREE[k] on general graphs cannot be approximated within any finite
factor.

Proof. The bound for 2-clustering follows by the simple observation that M IN D IS AGREE[2] on general
graphs reduces to M IN 2CNFD ELETION, i. e., given an instance of 2SAT, determining the minimum number
                                                                                          √
of clauses that have to be deleted to make it satisfiable. The latter problem admits an O( log n) approxima-
tion algorithm [1]. The result on M IN D IS AGREE[k] for k ≥ 3 follows by a reduction from k-coloring. When
k ≥ 3, it is NP-hard to tell if a graph is k-colorable, and thus even given an instance of M IN D IS AGREE[k]
with only negative edges, it is NP-hard to determine if the optimum number of disagreements is zero or
positive.

Acknowledgments. We thank the anonymous referees for several useful comments on the presentation.


References
                                                                                       √
 [1] * A. AGARWAL , M. C HARIKAR , K. M AKARYCHEV, AND Y. M AKARYCHEV: O( log n) approxi-
     mation algorithms for Min Uncut, Min 2CNF deletion, and directed cut problems. In Proc. 37th STOC,
     pp. 573–581. ACM Press, 2005. [STOC:10.1145/1060590.1060675]. 5

 [2] * N. A ILON , M. C HARIKAR , AND A. N EWMAN: Aggregating Inconsistent Information: Ranking
     and Clustering. In Proc. 37th STOC, pp. 684–693. ACM Press, 2005. [STOC:1060590.1060692]. 1,
     1.1

 [3] * N. A LON , K. M AKARYCHEV, Y. M AKARYCHEV, AND A. NAOR: Quadratic forms on graphs. In
     Proc. 37th STOC, pp. 486–493. ACM Press, 2005. [STOC:10.1145/1060590.1060664]. 1, 1.1

 [4] * S. A RORA , E. B ERGER , E. H AZAN , G. K INDLER , AND S. S AFRA: On non-approximability
     for quadratic programs. In Proc. 46th FOCS, pp. 206–215. IEEE Computer Society Press, 2005.
     [FOCS:10.1109/SFCS.2005.57]. 1, 1.1

 [5] * N. BANSAL , A. B LUM , AND S. C HAWLA: Correlation clustering. Machine Learning, Special Issue
     on Clustering, 56:89–113, 2004. [doi:10.1023/B:MACH.0000033116.57574.95]. 1, 1.1, 1.2, 2

 [6] * A. B EN -D OR , R. S HAMIR , AND Z. YAKHINI: Clustering gene expression patterns. Journal of
     Computational Biology, 6:281–297, 1999. 1, 1.1

 [7] * M. C HARIKAR , V. G URUSWAMI , AND A. W IRTH: Clustering with qualitative information. Journal
     of Computer and System Sciences, 71(3):360–383, October 2005. [JCSS:10.1016/j.jcss.2004.10.012].
     1, 1.1

 [8] * M. C HARIKAR AND A. W IRTH: Maximizing quadratic programs: extending Grothendieck’s
     inequality.   In Proc. 45th FOCS, pp. 54–60. IEEE Computer Society Press, 2004.
     [FOCS:10.1109/FOCS.2004.39]. 1, 1.1

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                             264
                   C ORRELATION C LUSTERING WITH A F IXED N UMBER OF C LUSTERS

 [9] * E. D EMAINE AND N. I MMORLICA: Correlation clustering with partial information. In Proc.
     6th Internat. Workshop on Approximation Algorithms for Combinatorial Optimization Problems (AP-
     PROX’03), volume 2764 of LNCS, pp. 1–13. Springer, 2003. [doi:10.1007/b11961]. 1.1

[10] * D. E MANUEL AND A. F IAT: Correlation clustering—minimizing disagreements on arbitrary
     weighted graphs. In Proc. 11th European Symp. on Algorithms (ESA’03), pp. 208–220. Springer,
     2003. [doi:10.1007/b13632]. 1.1

[11] * W. F ERNANDEZ DE LA V EGA , M. K ARPINSKI , C. K ENYON , AND Y. R ABANI: Approxi-
     mation schemes for clustering problems. In Proc. 35th STOC, pp. 50–58. ACM Press, 2003.
     [STOC:10.1145/780542.780550]. 1.2

[12] * W. F ERNANDEZ DE LA V EGA AND C. K ENYON: A randomized approximation scheme
     for metric max-cut. In Proc. 39th FOCS, pp. 468–471. IEEE Computer Society Press, 1998.
     [FOCS:10.1109/SFCS.1998.743497]. 1.2, 2

[13] * O. G OLDREICH , S. G OLDWASSER , AND D. RON: Property testing and its connection to learning
     and approximation. Journal of the ACM, 45(4):653–750, July 1998. [JACM:10.1145/285055.285060].
     1.2, 3

[14] * P. I NDYK: A sublinear-time approximation scheme for clustering in metric spaces. In Proc. 40th
     FOCS, pp. 154–159. IEEE Computer Society Press, 1999. [FOCS:10.1109/SFFCS.1999.814587]. 1.2,
     2

[15] * S. K HOT, G. K INDLER , E. M OSSEL , AND R. O’D ONNELL: Optimal inapproximability results for
     Max Cut and other 2-variable CSPs. In Proc. 45th FOCS, pp. 146–154. IEEE Computer Society Press,
     2004. [FOCS:10.1109/FOCS.2004.49]. 5

[16] * A. N EMIROVSKI , C. ROOS , AND T. T ERLAKY: On maximization of quadratic form over in-
     tersection of ellipsoids with common center. Mathematical Programming, 86(3):463–473, 1999.
     [STACS:922f1ftd6bpfxhk7]. 1.1

[17] * R. S HAMIR , R. S HARAN , AND D. T SUR: Cluster graph modification problems. In Proc. 28th
     Workshop on Graph Theory (WG’02), pp. 379–390, 2002. 1, 1.1, 1.2, 2

[18] * C. S WAMY: Correlation Clustering: Maximizing agreements via semidefinite programming.
     In Proc. 15th ACM-SIAM Symp. on Discrete Algorithms (SODA’04), pp. 519–520. SIAM, 2004.
     [SODA:982866]. 1.1, 5


AUTHORS

     Ioannis Giotis
     Department of Computer Science and Engineering,
     University of Washington, Seattle, WA
     giotis cs washington edu
     http://www.cs.washington.edu/homes/giotis/




                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                       265
                                   I. G IOTIS AND V. G URUSWAMI

   Venkatesan Guruswami
   Department of Computer Science and Engineering,
   University of Washington, Seattle, WA
   venkat cs washington edu
   http://www.cs.washington.edu/homes/venkat/


ABOUT THE AUTHORS

   I OANNIS G IOTIS is a graduate student in the Department of Computer Science and Engineer-
       ing at the University of Washington. Ioannis received his undergraduate degree from the
       Computer Engineering and Informatics Department at the University of Patras, Greece. His
       research interests include game theory, probabilistic techniques, and approximation algo-
       rithms.


   V ENKATESAN G URUSWAMI is an Assistant Professor of Computer Science and Engineering
      at the University of Washington in Seattle. Venkat received his Bachelor’s degree from the
      Indian Institute of Technology at Madras in 1997 and his Ph. D. from the Massachusetts
      Institute of Technology in 2001 under the supervision of Madhu Sudan. He was a Miller
      Research Fellow at the University of California, Berkeley during 2001-2002. His research
      interests include the theory of error-correcting codes, approximation algorithms for opti-
      mization problems and corresponding hardness results, algebraic algorithms, explicit com-
      binatorial constructions, and pseudorandomness.
      Venkat is a recipient of the David and Lucile Packard Fellowship (2005), Alfred P. Sloan
      Research Fellowship (2005), an NSF Career Award (2004), ACM’s Doctoral Dissertation
      Award (2002), and the IEEE Information Theory Society Paper Award (2000).
      Venkat was born and grew up in the South Indian metropolis of Chennai. He enjoys play-
      ing badminton and other racquet games, hiking within his humble limits, listening to Car-
      natic (South Indian classical) music, and staring at Mount Rainier (on the few days Seattle
      weather is kind enough to permit a glimpse).




                     T HEORY OF C OMPUTING, Volume 2 (2006), pp. 249–266                            266