A Constant-Factor Approximation Algorithm for Co-clustering

Authors Aris Anagnostopoulos, Anirban Dasgupta, Ravi Kumar,
Plaintext
                             T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622
                                          www.theoryofcomputing.org




                 A Constant-Factor Approximation
                   Algorithm for Co-clustering∗
             Aris Anagnostopoulos†                         Anirban Dasgupta                     Ravi Kumar
                                Received: August 4, 2011; published: December 10, 2012.




         Abstract: Co-clustering is the simultaneous partitioning of the rows and columns of a matrix
         such that the blocks induced by the row/column partitions are good clusters. Motivated by
         several applications in text mining, market-basket analysis, and bioinformatics, this problem
         has attracted a lot of attention in the past few years. Unfortunately, to date, most of the
         algorithmic work on this problem has been heuristic in nature.
             In this work we obtain the first approximation algorithms for the co-clustering problem.
         Our algorithms are simple and provide constant-factor approximations to the optimum. We
         also show that co-clustering is NP-hard, thereby complementing our algorithmic result.

ACM Classification: F.2.0
AMS Classification: 68W25
Key words and phrases: co-clustering, biclustering, clustering, approximation

1      Introduction
Clustering is a fundamental primitive in many data-analysis applications, including information retrieval,
databases, text and data mining, bioinformatics, market-basket analysis, and so on [12, 20]. The central
objective in clustering is the following: given a set of points and a pairwise distance measure, partition
the set into clusters such that points that are close to each other according to the distance measure occur
together in a cluster and points that are far away from each other occur in different clusters. This objective
     ∗ A preliminary version appeared in the Proc. 27th ACM Symp. on Principles of Database Systems (PODS 2008). The

present version contains complete proofs. In particular, Section 3.1 has been rewritten to be self-contained.
   † Partially supported by the EU FP7 Project N. 255403 SNAPS.



    2012 Aris Anagnostopoulos, Anirban Dasgupta, and Ravi Kumar
    Licensed under a Creative Commons Attribution License                                        DOI: 10.4086/toc.2012.v008a026
                   A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

sounds straightforward, but it is not easy to state a universal desiderata for clustering—Kleinberg showed
in a reasonable axiomatic framework that clustering is an impossible problem to solve [22]. In general,
the clustering objectives tend to be application specific, exploiting the underlying structure in the data
and imposing additional structure on the clusters themselves.
     In several applications, the data itself has a lot of structure, which may be hard to capture using a
traditional clustering objective. Consider the example of a Boolean matrix, whose rows correspond to
keywords and whose columns correspond to advertisers, and an entry is one if and only if the advertiser
has placed a bid on the keyword. The goal is to cluster both the advertisers and the keywords. One way
to accomplish this would be to independently cluster the advertisers and keywords using the standard
notion of clustering—cluster similar advertisers and cluster similar keywords. However (even though
for some criteria this might be a reasonable solution, as we argue subsequently in this work), such an
endeavor might fail to elicit subtle structures that might exist in the data: perhaps, there are two disjoint
sets of advertisers A1 , A2 and keywords K1 , K2 such that each advertiser in Ai bids on each keyword in
K j if and only if i = j. In an extreme case, maybe there is a combinatorial decomposition of the matrix
into blocks such that each block is either almost full or almost empty. To be able to discover such things,
the clustering objective has to simultaneously intertwine information about both the advertisers and
keywords that is present in the matrix. This is precisely achieved by co-clustering [17, 7]; other terms for
co-clustering include biclustering, bidimensional clustering, and subspace clustering.
     In the simplest version of (k, `)-co-clustering, we are given a matrix of numbers, and two integers k
and `. The goal is to partition the rows into k clusters and the columns into ` clusters such that the sum-
squared deviation from the mean within each “block” induced by the row-column partition is minimized.
This definition, along with different objectives, is made precise in Section 2. Co-clustering has received
a lot of attention in recent years, with several applications in text mining [10, 15], market-basket data
analysis, image, speech, and video analysis, and bioinformatics [7, 8, 23]; see the paper by Banerjee
et al. [3] and the survey by Madeira and Oliveira [25].
     Even though co-clustering has been extensively studied in many application areas, very little is known
about it from an algorithmic angle. Very special variants of co-clustering are known to be NP-hard [18].
A natural generalization of the k-means algorithm to co-clustering is known to converge [3]. Apart from
these, most of the algorithmic work done on co-clustering has been heuristic in nature, with no proven
guarantees of performance.


Main contributions In this paper we address the problem of co-clustering from an algorithmic point
of view. Our main contribution is the first constant-factor approximation algorithm for the (k, `)-co-
clustering problem. Our algorithm is simple and builds upon approximation algorithms for a variant
of the k-median problem, which we call k-means p . The algorithm works for a standard set of matrix
norms and produces a 3α-approximate solution, where α is the approximation factor for the k-means p
problem; for the latter, we obtain a constant-factor approximation by extending known results on the
k-median
 √         problem. We next consider the important special case of the Frobenius norm and obtain a
( 2 + ε)-approximation algorithm when k and ` are fixed by exploiting the geometry of the space and
the results on the k-means problem.
    We complement these results by considering the extreme cases of ` = 1 and ` = nε , where the matrix
is of size m × n and ε is any fixed value in (0, 1). We show that while the (k, 1)-co-clustering problem

                        T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                             598
               A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

can be solved exactly in time O(mn + m2 k), the (k, nε )-co-clustering problem is NP-hard, for k ≥ 2 under
the `1 and `2 norms.
    In the data-mining and machine-learning literature there exists a variety of co-clustering problems
depending on what is the precise objective function one is trying to optimize; we give some examples in
Section 2. The version that we address in this paper is a natural one and at the same time it possesses
such a structure that allows us to approximately solve it by decomposing it into single-dimensional
clustering problems. This is not true for other types of objectives. For instance, if one considers the
residue given by (2.2) (for more discussion about the relevant concepts refer to Section 2) then one can
create counter-examples showing that the problem is not decomposable; solving such cases is interesting
future work.



Related work Research on clustering has a long and varied history, with work ranging from approxima-
tion algorithms to axiomatic developments of the objective functions [19, 12, 22, 20, 36, 16]. The problem
of co-clustering itself has found growing applications in several practical fields, for example, simultane-
ously clustering words and documents in information retrieval [10], clustering genes and expression data
for biological data analysis [7, 34], clustering users and products for recommendation systems [1], and
so on. The exact objective function, and the corresponding definition of co-clustering varies, depending
on the type of structure we want to extract from the data. The hardness of the co-clustering problem
depends on the exact merit function to be used [18]. Consequently, work on co-clustering has mostly
focused on heuristics that work well in practice. Excellent references on such methods are the surveys by
Madeira and Oliveira [25] and Tanay, Sharan, and Shamir [32]. Banerjee et al. [3] unified a number of
merit functions for the co-clustering problem under the general setting of Bregman divergences and gave
a k-means style algorithm that is guaranteed to monotonically decrease the merit function. Our objective
function for the p = 2 case is precisely the k·kF merit function for which their results apply.
     There is little work along the lines of approximation algorithms for the co-clustering problems. The
closest algorithmic work to this problem relates to finding cliques and dense bipartite subgraphs [28].
These variants are often hard even to approximate to within a constant factor. Hassanpour [18] showed
that a version of the co-clustering problem that finds out homogeneous submatrices is hard. Feige and
Kogan [13] showed that the problem of finding the maximum biclique cannot be approximated to within
        δ                                                                         3/4+ε
2(log n) for some δ = δ (ε) > 0 assuming that 3-SAT cannot be solved in O(2n            ) deterministic time
for some fixed ε > 0.
    Concurrently with the conference publication of our work, Puolamäki et al. [30] published results
on the co-clustering problem for objective functions of the same form that we study. They analyze
the same algorithm for two cases, the `1 norm for 0/1-valued matrices and the `2 norm for real-valued
matrices. In the first case they obtain a better approximation factor than ours (2.414α as opposed to
3α, where α is the best approximation factor for one-sided clustering). On the other hand, our result
      √ general as it holds for any ` p norm and for real-valued matrices. Their `2 result is the same as
is more
ours ( 2α-approximation) and their proof is similar (although presented differently). Subsequent to the
conference publication of our work, Jegelka et al. [21] showed that our algorithm can be extended to
tensor clustering and to general metrics.

                        T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                            599
                            A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

2       Preliminaries and problem definition
In this section we mention briefly some of the variants of the objective function that have been proposed in
the co-clustering literature and are close to the ones we use in this work. Other commonly used objectives
are based on information-theoretic quantities.
    Let A = {ai j } ∈ Rm×n be the matrix that we want to co-cluster. A (k, `)-co-clustering is a k-partitioning
I = {I1 , . . . , Ik } of the set of rows {1, . . . , m} (i. e., we want to find k sets I1 , . . . , Ik , Ii ⊆ {1, . . . , m} such
that Ii ∩ Ii0 = 0/ for i 6= i0 and ki=1 Ii = {1, . . . , m}) and an `-partitioning J = {J1 , . . . , J` } of the set of
                                       S

columns {1, . . . , n}.
    Cho et al. [8] define two different co-clustering objectives, based on two different definitions of
residue. For every element ai j that belongs to the (I, J)-co-cluster define its residue hi j either as

                                                           ai j − aIJ ,                                                    (2.1)

or as
                                                    ai j − aiJ − aI j + aIJ ,                                              (2.2)
where
                 1
    aIJ =                ∑ ai j      is the average of all the entries in the co-cluster,
             |I| · |J| i∈I, j∈J
              1
    aiJ =            ai j            is the mean of all the entries in row i whose columns belong into J, and
             |J| ∑
                 j∈J
              1
    aI j =           ai j            is the mean of all the entries in column j whose rows belong into I.
             |I| ∑
                 i∈I

   Having defined the (two different) residues, the goal is to minimize some norm of the residue matrix
H = (hi j ). The norm most commonly used in the literature is the Frobenius norm, k·kF , defined as the
square root of the sum of the squares of the elements:
                                                  s
                                                                   m      n
                                                   kHkF =         ∑ ∑ h2i j .
                                                                  i=1 j=1

   One can attempt to minimize some other norm; for example, Yang et al. [35] attempt to find (possibly
overlapping) clusters that minimize the value
                                                         1
                                                       |I| |J| ∑   ∑ |hi j |
                                                               i∈I j∈J

using definition (2.2), where I and J define a cluster.
    More generally, one can define the norm
                                                                                  !1/p
                                                              m   n
                                                                              p
                                              |kHk| p =      ∑ ∑ hi j                    .                                 (2.3)
                                                             i=1 j=1


                                T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                          600
                  A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

Note that the Frobenius norm is a special case, where p = 2.
    In this work we study the general case of norms of the form of (2.3), for p ≥ 1, using the residual
definition of (2.1); designing approximation algorithms for residual definition corresponding to (2.2) is an
open problem. Specifically, we assume that the data is given in the form of a matrix A in Rm×n . We denote
the ith row of A as Ai? and the jth column of A as A? j . The aim in co-clustering is to simultaneously
cluster the rows and columns of A, so as to optimize the difference between A and the clustered matrix.
Formally, we want to compute a k-partitioning I = {I1 , . . . , Ik } of the set of rows {1, . . . , m} and an
`-partitioning J = {J1 , . . . , J` } of the set of columns {1, . . . , n}. For any such partition I = {I1 , . . . , Ik } of
rows (or columns) a convenient way for us to describe it will be in terms of a clustering index matrix, say
R = {ri j } ∈ Rm×k , such that each row in R essentially corresponds to the index vector of the corresponding
part in the partition I, that is, riI = 1 if i ∈ I and 0 otherwise (see Figure 1). (Note that we slightly abuse
notation and we use the partition as an index. For example, the element ri Is refers to the element ris of
matrix R and it equals 1 if i ∈ Is and 0 otherwise.) We can define a similar index matrix C = {ci j } ∈ R`×n
for the column partition J in a similar fashion: cJ j = 1, if j ∈ J and 0 otherwise. The co-clustering
solution is then completely defined by the three matrices R ∈ Rm×k , M ∈ Rk×` (to be defined below),
and C ∈ R`×n (see Figure 1). For each row cluster and column cluster tuple (I, J), we refer to the set of
indices in I × J to be a block.
    The clustering error associated with the co-clustering (I, J) is defined to be the quantity

                                                          |kA − RMCk| p ,                                                               (2.4)

where M = {µi j } is defined as the matrix in Rk×` that minimizes

                                                    M = argmin |kA − RXCk| p .
                                                            X∈Rk×`

Let mI = |I| be the size of the row cluster I and nJ = |J| denote the size of the column cluster J. By the
definition of |k·k| p , we can write (see also Figure 1)
                                                                      1/p                                              1/p

             |kA − RMCk| p =  ∑ |kAIJ − RI MCJ k| pp                         =  ∑ |kAIJ − µIJ · 1mI ×nJ k| pp                   ,   (2.5)
                                                                                                              
                                        I∈I                                           I∈I
                                        J∈J                                           J∈J

where each AIJ = {(aIJ )i j } ∈ RmI ×nJ and equals

               (aIJ )i j = aαi ,β j ,         if I = {α1 , . . . , αi , . . . , αmI } and J = {β1 , . . . , β j , . . . , βnJ } ,

each RI = {(rI )i j } ∈ RmI ×k and equals

                                              (rI )i j = 1 if I = I j , and 0 otherwise

(which means that if I refers to the jth row cluster then the jth column of RI is 1), each CJ = {cJ } ∈ R`×nJ ,
and equals
                                    (cJ )i j = 1 if J = Ji , and 0 otherwise

                              T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                                        601
                    A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

                            n=8
                     J1         J2      J3              k=5
                                                1                          µ I 2 J3
               I1                               1
                                                1
                                                    1
               I2                    AI 2 J 3               RI 2
                                                    1                       1     1   1               C J3
          m = 11                                        1                                 1   1   1           ℓ=3
               I3                                       1                                             1   1
                                                        1
               I4                                           1
                                                                   1
               I5
                                                                   1
                            A                           R              M                      C
Figure 1: An example of co-clustering, where we have rows and columns that appear in the same cluster
next to each other. Here, AIJ is represented by RI MCJ = µIJ · 1mI ×nJ as in (2.5), for instance, the 2 × 2
submatrix AI2 J3 is represented by RI2 MCJ3 = µI2 J3 · 12×2 .


(which means that if J refers to the ith column cluster then the ith row of CJ is 1), each µIJ ∈ R is the
mean element associated with co-cluster (I, J), and 1mI ×nJ is the 2-dimensional matrix of size mI × nJ
with all the elements equal to 1. Two special cases that are of interest to us are p = 1, 2. For the p = 2
case, the matrix norm |k·k| p corresponds to the the well known Frobenius norm k·kF , and the value µIJ
corresponds to a simple average of the corresponding block. For the p = 1 case, the norm corresponds
to a simple sum over the absolute values of the entries of the matrix, and the corresponding µIJ values
would be the median of the entries in that block.
    In general the number of row and column clusters, k and `, are parts of the input. In some cases, for k
and ` constants one can obtain stronger results; when this is the case we will mention it explicitly.


3    Algorithm
In this section we give a simple algorithm for co-clustering for the objective function defined in (2.4).
We first present the algorithm and then show that for the general |k·k| p norm, the algorithm gives a
constant-factor approximation. We then√do a tighter analysis for the simpler case of |k·k|2 (i. e., the
Frobenius norm), to show that we get a ( 2 + ε)-approximation when k, ` are constant.

Algorithm 1: Co-cluster(A, k, `)
Require: Matrix A ∈ Rm×n , number of row clusters k, number of column clusters `.
 1: Compute an α-approximate clustering of the row vectors with k clusters. Let Î be the resulting
    clustering.
 2: Compute an α-approximate clustering of the column vectors with ` clusters. Let Ĵ be the resulting
    clustering.
 3: return (Î, Ĵ).


                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                       602
                 A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

    Before analyzing the algorithm we need to show how we can solve the subproblems of steps 1 and 2.
For this we use some results for “standard” one-sided clustering and we develop some new ones needed
to solve the problem for any p-norm.

3.1    One-sided clustering
In the standard clustering problem, we are given n points in a metric space, possibly Rd , and an objective
function that measures the quality of any given clustering of the points. Various such objective functions
have been extensively used in practice, and have been analyzed in the theoretical computer science
literature (k-center, k-median, k-means, etc.). As an aid to our co-clustering algorithm, we are particularly
interested in the following setting of the problem, which we call k-means p . Given a set of vectors
a1 , a2 , . . . an ∈ Rd , the standard ` p distance metric k·k p , and an integer k, we first define the cost of a
partitioning I = {I1 , . . . , Ik } of {1, . . . , n} as follows. For each cluster I ∈ I, the center of the cluster I is
defined to be the vector µI such that
                                                                          p
                                            µI = argmin ∑ a j − x p .                                             (3.1)
                                                    x∈Rd     j∈I

The cost of the clustering I is then defined to be the sum of distances of each point to the corresponding
cluster center, raised to the power of 1/p, that is,
                                                                      !1/p
                                                                  p
                                               ∑∑        a j − µI p          .
                                               I∈I j∈I

The goal for the k-means p problem is to minimize this cost. In general the number of clusters k is part of
the input. We also consider the case that k is constant, where we can obtain stronger results. When this is
the case we mention it explicitly.
    In matrix notation, the points define a matrix A = [a1 , . . . , an ]T ∈ Rn×d . We will represent each
clustering I = {I1 , . . . , Ik } of n points in Rd by a clustering index matrix R ∈ Rn×k . Each column of matrix
R will essentially be the index vector of the corresponding cluster, RiI = 1 if ai belongs to cluster I, and 0
otherwise (see Figure 2). Similarly, the matrix M ∈ Rk×d is defined to be the set of centers of the clusters,
that is, M = [µ1 , . . . , µk ]T . Thus, the aim is to find out the clustering index matrix R that minimizes

                                                    |kA − RMk| p ,

where M is defined as the matrix in Rk×d that minimizes

                                            M = argmin |kA − RXk| p .
                                                    X∈Rk×d

Let mI be the size of the row cluster I, AI ∈ RmI ×d the corresponding submatrix of A, and RI ∈ RmI ×k the
corresponding submatrix of R. Also let Ai? be the ith row vector of A. We can write

                            |kA − RMk| pp = ∑ |kAI − RI Mk| pp = ∑ ∑ kAi? − µI k pp .                             (3.2)
                                              I∈I                      I∈I i∈I


                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                     603
                       A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

                                          d=8                                 k=5
                                                                      1                                    µI2
                             I1                                       1
                                                                      1
                                                                          1
                             I2           AI 2                                      RI 2
                                                                          1
                    n = 11                                                      1
                             I3                                                 1
                                                                                1
                             I4                                                     1
                                                                                           1
                             I5                                                            1
                                          A                                   R                M
Figure 2: An example of a row clustering, where we have rows and columns that appear in the same
cluster next to each other. Here, AI is represented by RI · M as in (3.2).



     Of particular interest to us is the value p = 2. In that case, the center µI for each cluster is the average
of all the points Ai? in that cluster (the average minimizes the expression (3.1) for p = 2). It is commonly
known as the k-means problem and it admits a polynomial-time (1 + ε)-approximation algorithm when k
is fixed.

Theorem 3.1 ([24]). For any fixed values of k and ε > 0 there is a polynomial-time algorithm that
achieves a (1 + ε)-factor approximation for the k-means problem, if dimension is part of the input.

     In Theorem 3.4 we show that there exists a constant approximation algorithm for the k-means p
problem for any p (and k). Our proof will be based on the analysis of the similar k-median problem
by Charikar et al. [6]. In the k-median problem we are given a set A = {a1 , . . . , an } of points, the cost
d(ai , a j ) of assigning point ai to point a j , and an integer k > 0, and the goal is to find a k-partitioning I of
{1, . . . , n} to minimize the cost
                                                   ∑ ∑ d(ai , µImed ) ,
                                                      I∈I i∈I

where the k-median center µImed of cluster I is given by

                                                 µImed = argmin ∑ d(ai , a) .                                               (3.3)
                                                                a∈A       i∈I

In the metric case, d(·, ·) is assumed to be symmetric and to satisfy the triangle inequality. Unfortunately,
in our case we use the norm ` pp and the triangle inequality does not hold for p > 1. However, as we show
next in Lemma 3.3, it does hold approximately. This allows us to use the following result of Charikar
et al. [6], which is a simple extension of their constant-factor approximation for the metric k-median
problem.1
   1 While Charikar et al. [6] study the metric case, they mention that their results can be extended to the case that the triangle

inequality holds only approximately, as described in Theorem 3.2. Following their proof, shows that the approximation ratio can
be bounded by 4(δ 2 + δ ) ≤ 8δ 2 , for δ ≥ 1.


                                  T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                         604
                 A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

Theorem 3.2 ([6]). There is a polynomial-time algorithm that achieves an 8δ 2 -approximation to the
metric k-median problem if the costs satisfy a δ -approximate triangle inequality, for δ ≥ 1, that is,
d(ai , ao ) ≤ δ (d(ai , a j ) + d(a j , ao )) for each i, j, o.

   Next we prove Lemma 3.3, which shows that the distance measure induced by ` pp satisfies an
approximate version of the triangle inequality.

Lemma 3.3. Let u, v, w ∈ Rd and let p ≥ 1. Then, ku − vk pp ≤ 2 p−1 (ku − wk pp + kw − vk pp ).

Proof. Note that the function f (x) = |x| p is convex for p ≥ 1 and hence

               |a + b| p |a| p + |b| p
                      
                                            or, equivalently, |a + b| p ≤ 2 p−1 |a| p + |b| p .
                                                                                             
                          ≤
                  2               2

Let u = (u1 , . . . , ud ), and similarly for v, w. We have
                                                d
                                  ku − vk pp = ∑ |ui − vi | p
                                               i=1
                                                d
                                            = ∑ |(ui − wi ) + (wi − vi )| p
                                               i=1
                                                d
                                            ≤ ∑ 2 p−1 |ui − wi | p + |wi − vi | p
                                                                                    
                                               i=1
                                                                          
                                            = 2 p−1 ku − wk pp + kw − vk pp .

    Now we can state and prove the theorem for the one-sided clustering.

Theorem 3.4. There is a polynomial-time algorithm that achieves a 16-approximation to the k-means p
problem for p ≥ 1.

Proof. The proof is based on the result by Charikar et al., described in Theorem 3.2. We note two issues
                                                                                                            p
that make our problem different from the standard metric k-median: (i) we have d(ai , a j ) = ai − a j p ,
but it does not satisfy the triangle inequality and (ii) the centers in k-median are chosen from the given
set A (i. e., µImed ∈ A) whereas the centers in k-means p are not necessarily from A. Note also a small
asymmetry between the cost functions of the k-means p and the k-median objectives, as in the former the
sum of the distances is raised to the power 1/p, unlike in the latter.
    To address issue (i), as we showed in Lemma 3.3, the distance measure k·k pp satisfies an approximate
triangle inequality and hence we appeal to Theorem 3.2. We address issue (ii) as follows. Let I∗ be the
optimal clustering according to the k-means p objective (i. e., the cluster centers can be arbitrary) and let
J∗ be the optimal clustering according to the k-median objective (i. e., the cluster centers belong to A). In
Lemma 3.5, proven directly after this proof, we show that the loss of the requirement of cluster centers
belonging to A can be bounded.
    Define c(I∗ ) to be the cost
                                         c(I∗ ) = ∑ ∑ kai − µI k pp ,
                                                     I∈I∗ i∈I


                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                            605
                    A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

with µI defined as in (3.1). Notice that this is the cost of the k-means p objective raised to the pth power.
Define also c(J∗ ) to be the cost of the k-median objective:
                                                                             p
                                           c(J∗ ) = ∑ ∑ ai − µImed p ,
                                                       I∈I∗ i∈I

with µImed defined as in (3.3).
    Using δ = 2 p−1 from Lemma 3.3, the algorithm in Theorem 3.2 outputs a clustering J̃ whose k-
                                       1/p
means p cost is bounded by 22p+1 c(J∗ )     . Using Lemma 3.5, we then conclude that the k-means p cost
of I∗ is bounded by
                                                1/p
                                   23p+1 c(I∗ )      ≤ 16 (c(I∗ ))1/p .

    We now prove the lemma used in the preceding theorem.

Lemma 3.5. If the costs are in ` pp , then c(J∗ ) ≤ 2 p c(I∗ ).

Proof. First, we claim that for each cluster I ∈ I∗ , there is some a ∈ A such that

                                       ∑ kai − ak pp ≤ 2 p ∑ kai − µI k pp .
                                       i∈I                          i∈I

The proof of the claim is by a standard averaging argument over the points in cluster I. By Lemma 3.3,
we have
                                                                            !
                                      p                                                      p
                     ∑ ∑ ai − a j p ≤ 2 p−1 ∑ ∑ kai − µI k pp + µI − a j p
                     j∈I i∈I                            j∈I i∈I
                                                                                                        !
                                                                                                    p
                                          ≤ 2 p−1       ∑ ∑ kai − µI k pp + ∑ ∑            a j − µI p
                                                        j∈I i∈I                  j∈I i∈I
                                               p
                                          = 2 ∑∑             kai − µI k pp
                                                   j∈I i∈I

                                          = 2 |I| ∑ kai − µI k pp ,
                                               p
                                                      i∈I

which means that
                                    1                 p
                                   |I| ∑   ∑ ai − a j p ≤ 2 p ∑ kai − µI k pp .
                                       j∈I i∈I                i∈I

Letting f (a) = ∑i∈I kai − ak pp , we have that

                                           1
                                              ∑   f (a j ) ≤ 2 p ∑ kai − µI k pp .
                                          |I| j∈I                i∈I

Since the average (over the elements in cluster I ∈ I∗ ) cost of f (a) is bounded by at most

                                                     2 p ∑ kai − µI k pp
                                                         i∈I


                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                               606
                  A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

it means that there exists an element a j ∈ I for which

                                         f (a j ) ≤ 2 p ∑ kai − µI k pp ,
                                                        i∈I

and thus we have found the desired element a ∈ A for cluster I. By summing this over all the clusters in
I∗ , we have shown that each cluster center in I∗ can be replaced by some point in A without increasing
the clustering cost by too much. The proof then follows from the optimality of J∗ .

3.2   Constant-factor approximation
We now show that the co-clustering returned by Algorithm 1 is a constant-factor approximation to the
optimum.

Theorem 3.6. Given an algorithm for obtaining an α-approximation to the k-means p problem, the
algorithm Co-cluster (Algorithm 1) returns a co-clustering that is a 3α-approximation to an optimal
co-clustering of A.

Proof. Let (I∗ , J∗ ) be an optimal co-clustering solution. Define the corresponding index matrices to be
R∗ and C∗ respectively. Furthermore, let Î∗ be an optimal row clustering and Ĵ∗ be an optimal column
clustering. Define the index matrix R̂∗ from the clustering Î∗ , and the index matrix Ĉ∗ from the clustering
Ĵ∗ . This means that there is a matrix M̂R∗ ∈ Rk×n such that

                                                  A − R̂∗ M̂R∗   p

is minimized over all such index matrices representing k clusters. Similarly, there is a a matrix M̂C∗ ∈ Rm×`
such that
                                               A − M̂C∗ Ĉ∗ p
is minimized over all such index matrices representing ` clusters.
    The algorithm Co-cluster uses approximate solutions for the one-sided row and column clustering
problems to compute partitionings Î and Ĵ. Let R̂ be the clustering index matrix corresponding to this row
clustering and M̂R be the set of centers. Similarly, let Ĉ, M̂C be the corresponding matrices for the column
clustering constructed by Co-cluster. By the assumptions of the theorem we have that

                                      A − R̂M̂R    p
                                                       ≤ α A − R̂∗ M̂R∗         p
                                                                                    ,                   (3.4)

and, similarly,
                                      A − M̂CĈ    p
                                                       ≤ α A − M̂C∗ Ĉ∗         p
                                                                                    .                   (3.5)

    For the co-clustering (M̂R , M̂C ) that the algorithm computes, define the center matrix M ∈ Rk×` as
follows. Each entry µIJ is defined to be
                                                                       p
                                        µIJ = argmin ∑ ai j − x             .                           (3.6)
                                                   x      i∈I
                                                          j∈J


                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                              607
                        A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

    Now we will show that the co-clustering (Î, Ĵ) with the center matrix M will be a 3α-approximate
solution. First, we lower bound the cost of the optimal co-clustering solution by the optimal row clustering
and optimal column clustering. Since (R̂∗ , M̂R∗ ) is the optimal row clustering, we have that
                                   A − R̂∗ M̂R∗     p
                                                        ≤ min |kA − R∗ Xk| p ≤ |kA − R∗ M ∗C∗ k| p .                                           (3.7)
                                                            X

Similarly, since (Ĉ∗ , M̂C∗ ) is the optimal column clustering,
                                   A − M̂C∗ Ĉ∗     p
                                                        ≤ min |kA − XC∗ k| p ≤ |kA − R∗ M ∗C∗ k| p .                                           (3.8)
                                                            X

      Let us consider a particular block (I, J) ∈ Î × Ĵ. Note that (R̂M̂R )i j = (R̂M̂R )i0 j for i, i0 ∈ I. We denote
r̂I j = (R̂M̂R )i j . Let µ̂IJ be the value x that minimizes
                                                                                           p
                                                     µ̂IJ = argmin ∑ r̂I j − x                 .
                                                                    x        j∈J

We also denote ĉiJ = (M̂CĈ)i j . Then for all i ∈ I we have
                                                                        p                          p
                                                  ∑ r̂I j − µ̂IJ            ≤ ∑ r̂I j − ĉiJ           ,
                                                  j∈J                          j∈J

which gives
                           1/p                               1/p                                      1/p                          1/p
                    p                                   p                                    p                                      p
    ∑ r̂I j − µ̂IJ )                ≤  ∑ r̂I j − ĉiJ                   ≤  ∑ ai j − r̂I j                      +  ∑ ai j − ĉiJ          ,
                                                                                                                    
      i∈I                                   i∈I                                      i∈I                                      i∈I
      j∈J                                   j∈J                                      j∈J                                      j∈J
                                                                                                                                               (3.9)
where the last inequality is just application of the triangle inequality.
   Then we get
                                                                          1/p           1/p
                                              !1/p
                (a)                         p                             p   (b)       p
  A − R̂MĈ p = ∑ AIJ − µIJ R̂I ĈJ p              = ∑ ∑ ai j − µIJ  ≤ ∑ ∑ ai j − µ̂IJ 
                                                      
                            I,J                                               I,J i∈I                                         I,J i∈I
                                                                                  j∈J                                             j∈J
                                                  1/p                                       1/p
                  (c)                p                                       p
                  ≤ ∑ ∑ ai j − r̂I j                     + ∑ ∑ r̂I j − µ̂IJ 
                                                             
                            I,J i∈I                                 I,J i∈I
                                j∈J                                     j∈J
                                                   1/p                                      1/p                                    1/p
                  (d)                p                                        p                                            p
                  ≤ ∑ ∑ ai j − r̂I j                      + ∑ ∑ ai j − r̂I j                           + ∑ ∑ ai j − ĉiJ 
                                                                                                            
                            I,J i∈I                                 I,J i∈I                                         I,J i∈I
                                j∈J                                     j∈J                                             j∈J

                  =   A − R̂M̂R p + A − R̂M̂R p + A − M̂CĈ p
                  (e)                                                                                          
                  ≤α     A − R̂∗ M̂R∗ p + A − R̂∗ M̂R∗ p + A − M̂C∗ Ĉ∗                                     p
                  (f)
                  ≤ 3α |kA − R∗ M ∗C∗ k| p ,

                                  T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                                            608
                 A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

where (a) follows from (2.5), (b) follows from (3.6), (c) from the triangle inequality, (d) from (3.9),
(e) from (3.4) and (3.5), and (f) follows from (3.7) and (3.8).


      By combining the above with Theorem 3.4 we obtain the following.

Corollary 3.7. There is a polynomial-time algorithm that returns a (k, `)-co-clustering that is a 48-factor
approximation to the optimum under the |k·k| p norm.

3.3     A tighter analysis for the Frobenius norm
A commonly used instance of our objective function is the case of p = 2, the Frobenius norm. The results
of the previous section give us a (3 + ε)-approximation in this particular case, when k, ` are constants.
But it turns out that in this case, we can actually exploit the particular structure of the Frobenius norm
and give a better approximation factor.
     To restate the problem, we want to compute clustering matrices R ∈ Rm×k ,C ∈ R`×n , such that
Ri,I = 1, if i ∈ I and 0 otherwise, and CJ, j = 1, if j ∈ J and 0 otherwise (see Section 2 for more details)
such that kA − RMCkF is minimized, where M ∈ Rk×` and M contains the averages of the cluster, that is,
M = {µIJ } where
                                                       1
                                             µIJ =              ai j ,
                                                    mI · nJ ∑
                                                            i∈I
                                                              j∈J

where mI is the size of row cluster I and nJ is the size of column cluster J. We show the following
theorem.

Theorem 3.8. Given an α-approximation
                                   √ algorithm for the k-means clustering problem, the algorithm
Co-cluster (Algorithm 1) computes a 2α-approximate solution to the co-clustering problem with the
k·kF objective function.

Proof. Define R̄ = {r̄i j } ∈ Rm×k similarly to R, but with the values scaled down according to the clustering.
                        √                                                                                       √
Specifically, r̄i,I = 1/ mI , if i ∈ I and 0 otherwise. Likewise, define C̄ = {c̄i j } ∈ R`×n , with c̄J, j = 1/ nJ ,
if j ∈ J and 0 otherwise. Then notice that we can write RMC = R̄R̄T AC̄T C̄.
     If we consider also the one-sided clusterings (RMR and MCC) then we can also write RMR = R̄R̄T A
and MCC = AC̄T C̄.
     We define PR = R̄R̄T . Then PR is a projection matrix. To see why this is the case, notice first that R̄
has orthogonal columns:
                                                                 1
                                              (R̄T · R̄)II = ∑      = 1,
                                                             i∈I mI

and (R̄T · R̄)IJ = 0, for I 6= J, thus R̄T · R̄ = Ik . Therefore PR PR = PR , hence PR is a projection matrix.
Define as PR⊥ = (Im − PR ) the projection orthogonal to PR . Similarly we define the projection matrices
PC = C̄T C̄ and PC⊥ = (In − PC ). In general, in the rest of the section, PX and PX⊥ refer to the projection
matrices that correspond to clustering matrix X.
   We can then state the problem as finding the projections of the form PR = R̄R̄T and PC = C̄T C̄ that
minimize kA − PR APC k2F , under the constraint that R̄ and C̄ are of the form that we described previously.

                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                   609
                     A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

   Let R∗ and C∗ be the optimal co-clustering solution, R̂∗ and Ĉ∗ be the optimal one-sided clusterings,
and R̂ and Ĉ be the one-sided row and column clusterings that are α-approximate to the optimal ones.
We have
                                              2                     2
                                    A − R̂M̂R F ≤ α 2 A − R̂∗ M̂R∗ F                              (3.10)
and
                                                     2                        2
                                        A − M̂CĈ F ≤ α 2 A − M̂C∗ Ĉ∗ F .                        (3.11)
      We can write
                          A = PR̂ A + PR̂⊥ A = PR̂ APĈ + PR̂ APĈ⊥ + PR̂⊥ APĈ + PR̂⊥ APĈ⊥ ,
and thus
                                  A − PR̂ APĈ = PR̂ APĈ⊥ + PR̂⊥ APĈ + PR̂⊥ APĈ⊥ .
Then,
                                        2                                             2
                          A − PR̂ APĈ F = PR̂ APĈ⊥ + PR̂⊥ APĈ + PR̂⊥ APĈ⊥
                                                                                      F
                                                                                  2
                                            = PR̂ APĈ⊥ + PR̂⊥ (APĈ + APĈ⊥ )
                                                                                  F
                                            (a)           2                               2
                                            = PR̂ APĈ⊥       + PR̂⊥ (APĈ + APĈ⊥ )
                                                       F                             F
                                                       2                              2
                                                     ⊥             ⊥          ⊥    ⊥
                                            = PR̂ APĈ   +       PR̂ APĈ + PR̂ APĈ
                                                       F                              F
                                            (b)        2                   2            2
                                            = PR̂ APĈ⊥ +        PR̂⊥ APĈ + PR̂⊥ APĈ⊥ ,
                                                       F                   F            F

where (a) follows from the Pythagorean theorem (we apply it to every column separately and the square
of the Frobenius norm is just the sum of the column lengths squared) and the fact that the projection
matrices PR̂ and PR̂⊥ are orthogonal to each other, and (b) again from the Pythagorean theorem and the
orthogonality of PĈ and PĈ⊥ .
                                                                  2                   2
   Without loss of generality we assume that PR̂ APĈ⊥ ≥ PR̂⊥ APĈ (otherwise we can consider AT ).
                                                       F           F
Then,
                                                           
                           2               2              2
                                        ⊥         ⊥     ⊥
              A − PR̂ APĈ F ≤ 2 PR̂ APĈ    + PR̂ APĈ
                                           F              F
                                                                                         (3.12)
                                                      2            2              2
                                        ⊥     ⊥    ⊥             ⊥
                             = 2 PR̂ APĈ + PR̂ APĈ      = 2 APĈ   = 2 A − APĈ F ,
                                                                 F                    F

where the first equality follows once again from the Pythagorean theorem. By combining (3.11) and (3.12)
we get
                                          2               2                  2
                             A − PR̂ APĈ F ≤ 2 A − APĈ F ≤ 2α 2 A − APĈ∗ F .                   (3.13)
    By (3.8) we know that the error of the optimal one-sided clustering is bounded by the error of the
optimal co-clustering. Considering (3.8) for p = 2, taking the square, and rewriting with the notation of
this section we have that
                                               2
                                    A − APĈ∗ F ≤ kA − PR∗ APC∗ k2F .                              (3.14)

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                         610
                 A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

      Combining (3.13) and (3.14) gives
                                                2
                                   A − PR̂ APĈ F ≤ 2α 2 kA − PR∗ APC∗ k2F .
                           √
   Thus we can obtain a     2α-approximation to the optimal co-clustering solution, under the Frobenius
norm.

      We can now use Theorem 3.1 and obtain the following corollary.

Corollary 3.9. For any fixed √  values of k, ` and ε > 0 there is a polynomial time algorithm that returns a
(k, `)-co-clustering that is a ( 2 + ε)-approximation to the optimum, under the Frobenius norm.

   We also improve slightly the approximation ratio when k and ` are part of the input by using
Theorem 3.4:

Corollary 3.10. There is a polynomial-time algorithm that returns a (k, `)-co-clustering that is a 32-
factor approximation to the optimum under the Frobenius norm.


3.4     Solving (k, 1)-co-clustering
In this section we show how we can solve exactly the problem in the case that we only want one column
cluster (note that this is different from the one-sided clustering; the latter is equivalent to having n column
clusters). While this case is not of significant practical interest, we include it for completeness and to
show that even in that case the problem can be nontrivial. In particular, while we can solve exactly the
problem under the Frobenius norm, it is not clear whether we can solve it for all the norms of the form of
(2.3), including even the case that p = 1. To summarize the cases that we study in this section, one can
consider the following cases of (k, 1)-co-clustering:

      • The problem of (k, 1)-co-clustering for a matrix of dimension m × 1, for any p ≥ 1. This case is
        known to be solvable in polynomial time using dynamic programming and we present the algorithm
        and the analysis in Lemma 3.11 for completeness.

      • The problem of (k, 1)-co-clustering for a matrix of dimension m × n, for any n > 1 and for p = 2. In
        Theorem 3.12 we show how we can solve this problem in polynomial time based on Lemma 3.11.

      • The problem of (k, 1)-co-clustering for a matrix of dimension m × n, for any n > 1 and for p 6= 2.
        It remains open whether this case can be solved exactly in polynomial time.

    First we begin by considering the first case (A ∈ Rm×1 ). Then the problem is easy, for any norm of
the form of (2.3). We state the following simple result.

Lemma 3.11. Let A ∈ Rm×1 and consider any norm |k·k| p . There is an algorithm that can (k, 1)-cluster
matrix A optimally in time O(m2 k) and space O(mk).

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                               611
                     A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

Proof. The proof is based on Brucker’s algorithm for one-dimensional clustering [5]; we present it for
completeness. Since A is just a set of real values, then (k, 1) clustering A corresponds to the partition
of those values into k clusters. The proof is based on the following fact. If a cluster in the optimal
solution contains values ai and a j then it should contain also all the values in between. The proof is by
contradiction. Assume that there are two clusters such that the values are not separated. Then we can
replace these two clusters with two clusters with the values separated, and the cost will be lower.
    This fact implies that we can solve the problem using dynamic programming. Assume that the sorted
values of A are {a1 , a2 , . . . , am }. We define C(i, r) to be the optimal r-clustering solution of {a1 , . . . , ai }.
Then, knowing C( j, r − 1) for j ≤ i allows us to compute C(i, r), by considering all the possible positions
that the rth cluster begins:

                                C(i, r) =        min      {C( j − 1, r − 1) + c( j, i)} ,
                                            j∈{r,r+1,...,i}


where c( j, i) is the cost of the single cluster containing the values {a j , a j+1 , . . . , ai }.
    We need to compute O(mk) values C(i, r) so the space needed is O(mk), and to compute each of them
we require O(m) time, so the total time required is O(m2 k).


   We now use this lemma to solve optimally for general A, under the Frobenius norm. The algorithm is
simple. Assume that
                                                            1 n
                                A = {ai j } and let µi = ∑ ai j
                                                            n j=1

be the mean of row i. Also write ai j = µi + εi j , and note that for all i we have ∑nj=1 εi j = 0. The algorithm
then runs the dynamic programming algorithm on the vector of the means and returns the clustering
produced.

Algorithm 2: Co-cluster-DP(A, k)
Require: Matrix A ∈ Rm×n , number of row clusters k.
 1: Create the vector v = (µ1 , µ2 , . . . , µm ), where µi = 1n ∑nj=1 ai j .
 2: Use the dynamic programming algorithm of Lemma 3.11 and let I be the resulting k-clustering.
 3: return (I, {1, . . . , n}).



Theorem 3.12. Let A ∈ Rm×n . Let I be the clustering produced under the k·kF norm by Algorithm 2.
Then I has optimal cost. The running time of the algorithm is O(mn + m2 k).

Proof. Let us see the cost of a given cluster I, with |I| = mI rows. The mean µI of the cluster equals

                                                           n
                                                  1                  1
                                          µI =        ∑   ∑   ai j =        µi .
                                                 mI n i∈I j=1        mI ∑
                                                                        i∈I


                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                      612
                 A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

The cost of the cluster is
                          n                       n                                 n
                     ∑ ∑ (ai j − µI )2 = ∑ ∑ a2i j + mI nµI2 − 2µI ∑ ∑ ai j
                     i∈I j=1                  i∈I j=1                           i∈I j=1
                                                   n
                                        = ∑ ∑ (µi + εi j )2 + mI nµI2 − 2µI2 mI n
                                              i∈I j=1
                                                   n                  n                   n
                                        = ∑ ∑ µi2 + ∑ ∑ εi2j + 2 ∑ µi ∑ εi j − mI nµI2
                                              i∈I j=1
                                                    i∈I j=1          i∈I   j=1
                                                       n
                                        = n µi2 +
                                               ∑          εi2j − mI nµI2 ,
                                                            ∑∑
                                           i∈I    i∈I j=1

since ∑nj=1 εi j = 0, for all i, as we mentioned previously.
    Therefore, the cost of the entire clustering I = {I1 . . . , Ik } equals
                                          m             m   n
                                       n ∑ µi2 + ∑ ∑ εi2j − n ∑ mI µI2 .                              (3.15)
                                         i=1            i=1 j=1           I∈I

    Consider now the one-dimensional problem of (k, 1) clustering only the row means µi . The cost of a
given cluster I is:
                      ∑(µi − µI )2 = ∑ µi2 + mI µI2 − 2µI ∑ µi = ∑ µi2 − mI µI2 .
                         i∈I              i∈I                             i∈I      i∈I
Thus the cost of the clustering is
                                                   m
                                                  ∑ µi2 − ∑ mI µI2 .
                                                  i=1           I∈I
Compare the cost of this clustering with that of (3.15). Note that in both cases the optimal row clustering
is the one that maximizes the term ∑I∈I mI µI2 , as all the other terms are independent of the clustering.
Thus we can optimally solve the problem for A ∈ Rm×n by solving the problem simply on the means
vector. The time needed to create the vector of means is O(mn), and by applying Lemma 3.11 we
conclude that we can solve the problem in time O(mn + m2 k).


4    Hardness of the objective function
In this section we show that the problem of co-clustering an m × n matrix A is NP-hard when the number
of clusters on the column side is at least nε for any fixed 0 < ε < 1. While there are several results in the
literature that show hardness of similar problems [31, 18, 4, 29], we are not aware of any previous result
that proves the hardness of the co-clustering for the objectives that we study in this paper.
Theorem 4.1. The problem of finding an optimal (k, `) co-clustering under the `1 norm for a matrix
A ∈ Rm×n is NP-hard for (k, `) = (k, nε ), for any k ≥ 2 and any fixed 0 < ε < 1.
Proof. The proof contains several steps. First we reduce the one-sided s-median problem (where
s = n/3 + o(n)) under the `1 norm to the (2, n/3 + o(n))-co-clustering when A ∈ R2×n . We reduce the
latter problem to the case of A ∈ Rm×n and (k, n/3 + o(n)), and this, finally, to the case of (k, nε )-co-
clustering. We now proceed with the details.

                          T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                            613
                   A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

Hardness of (2, n/3 + o(n)) Megiddo and Supowit [27] show that the (one-sided) continuous (i. e.,
cluster centers can be arbitrary points in space) p-median problem is NP-hard under the `1 norm in
R2 . By looking carefully at the pertinent proof we can see that the problem is hard even if we restrict
the number of clusters to be n/3 + o(n); here, n is the number of points. Let us assume that we have
such a problem instance of n points {b j = (b j1 , b j2 ); j = 1, . . . , n} and we want to assign them into `
clusters, ` = n/3 + o(n), so as to minimize the `1 norm. Specifically, we want to compute a partition
J = {J1 , . . . , J` } of {1, . . . , n}, and points µ1 , . . . , µ` ∈ R2 such that the objective

                            ∑ ∑ b j − µJ 1 = ∑ ∑ b j1 − µJ1 + b j2 − µJ2                                  (4.1)
                            J∈J j∈J               J∈J j∈J

is minimized.
    We construct a co-clustering instance by constructing the matrix A where we set Ai j = b ji , for i = 1, 2
and j = 1, . . . , n:                                                     
                               A11 A12 · · · A1n          b11 b21 · · · bn1
                         A=                           =                        ,
                               A21 A22 · · · A2n          b12 b22 · · · bn2
which we want to (2, `)-co-cluster. Solving this problem is equivalent to solving the one-sided clustering
problem. To provide all the details, there is only one row clustering, I = {{1}, {2}} (strictly speaking,
there is also the clustering {{1, 2}, {}}, which has higher cost unless b j1 = b j2 for all j), and consider
the column clustering J = {J1 , . . . , J` } and the corresponding center matrix M ∈ R2×` . The cost of the
solution equals
                           ∑ ∑ Ai j − MIJ = ∑ ∑ b j1 − M1J + b j2 − M2J .                              (4.2)
                           I,J i∈I               J∈J j∈J
                               j∈J

Note that this expression is minimized when the point (M1J , M2J ) is the median of the points b j =
(b j1 , b j2 ), j ∈ J, in which case the cost equals to that of (4.1), if the partitioning J is the same. Thus a
solution to the co-clustering problem induces a solution to the one-sided problem. Therefore, solving the
(2, `)-co-clustering problem in R2×n is NP-hard.

Hardness of (k, n/3 + o(n)), k > 2 The next step is to show that it is NP-hard to (k, `)-co-cluster any
matrix of dimensions m × n for any m, n, k > 2 and ` = n/3 + o(n). To prove it we will use the previous
hardness result for the (2, `)-co-clustering in R2×n . We start with an instance of the (2, `)-co-clustering
problem in R2×n given by a matrix A as described previously and we create a matrix Ã = {Ãi j } ∈ Rm×n ,
by distorting one of the two dimensions and by adding to the previous matrix A, m − 2 rows of some
value 4B, where B is sufficiently large (say B > 8m max{ Ai j }):
                                                                        
                                         A11       A12     ···     A1n
                                      A21 + B A22 + B · · · A2n + B
                                                                        
                                 Ã =  4B
                                                  4B      ···     4B   .
                                       ..           ..    ..       .. 
                                       .             .       .      . 
                                          4B       4B      ···     4B
Indeed, we can achieve a solution with the same cost as (4.2) by the same column partitioning J and a
row partitioning that puts each of rows 1 and 2 to its own cluster and cluster the rest of the rows (where

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                614
                  A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

all the values equal 4B) arbitrarily. Notice that this is an optimal solution since any other row cluster is of
the following form: either it puts rows 1 and 2 in the same cluster, or it puts either row 1 or 2 with any
of the other rows in the same cluster. In the first case, let I with 1, 2 ∈ I be the row cluster and J be any
column cluster. Then, if MIJ is the median value of the entries in the cluster, we have that MIJ > B/2,
since half of the entries are at least B. Thus, there will be a cell (1, j), j ∈ J for which the residue will be

                                                                    B                B
                              A1 j − MIJ ≥ MIJ − A1 j >               − max{ Ai j } > ,
                                                                    2                4

since B > 4 max{ Ai j }. In the second case, since half of the elements of a cluster have value at least 4B,
the median is at least 2B, therefore there is a residue of a cell Ãi j for i ∈ {1, 2}, j ∈ J that is at least

                                                                                            B
                           Ãi j − MIJ ≥ MIJ − Ãi j ≥ 2B − B − max{ Ai j } >                 .
                                                                                            4

Thus, in both cases, we incur a penalty of at least B/4, which is larger than that of (4.2), for B >
8m max{ Ai j }.


Hardness of (k, nε ) The final step is to reduce a problem instance of finding a (k, `0 )-co-clustering of
                    0
a matrix A0 ∈ Rm×n , with `0 = n0 /3 + o(n0 ) to a problem instance of finding a (k, `)-co-clustering of a
matrix A ∈ Rm×n , with ` = nε , for any fixed 0 < ε < 1.
   The construction is similar as before. Let A0 = {A0i j }. Define n = (`0 + 1)1/ε and let A ∈ Rm×n . For
n0 > 31/(1−ε) , n > n0 . Furthermore, n = O(n0 1/ε ), hence the reduction is polynomial time. For 1 ≤ j ≤ n0 ,
define Ai j = A0i j and for any j > n0 , define Ai j = B0 , where B0 is some sufficiently large value (e. g.,
B0 > 4mn max|A0i j |):
                                   0
                                    A11 A012 · · · A01n0 B0 B0 · · · B0
                                                                                
                                   A0       0           0        0    0       0
                                   21 A22 · · · A2n0 B B · · · B 
                              A= .         ..   ..      ..    ..   .. . .    . .
                                   ..       .      .     .     .    .     . .. 
                                       A0m1 A0m2 · · ·      A0mn0    B0 B0 · · ·    B0

Now, we only need to prove that the optimal solution of a (k, `0 + 1) = (k, nε )-co-clustering of A corre-
sponds to the optimal solution of the (k, `0 )-co-clustering of A0 .
     Assume that the optimal solution for matrix A0 is given by the partitions I0 = {I10 , . . . , Ik0 } and J0 =
{J1 , . . . , J0`0 }. The cost of the solution is
  0


                                          C0 (I0 , J0 ) = ∑ ∑ A0i j − MIJ ,
                                                        I∈I0 i∈I
                                                        J∈J0 j∈J


where MIJ is defined as the median of the values {A0i j ; i ∈ I, j ∈ J}.
   Let us compute the optimal solution for the (k, `0 + 1)-co-clustering of A. First note that we can
compute a solution (I, J) with cost C0 (I0 , J0 ). We let I = I0 , and for J = {J1 , . . . , J`0 +1 ) we set J j = J 0j for
j ≤ `0 , and J`0 +1 = {n0 + 1, n0 + 2, . . . , n}. For matrix M we have MIJ j = MIJ
                                                                                  0 for j ≤ `0 and M
                                                                                     j
                                                                                                                          0
                                                                                                              IJ`0 +1 = B .


                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                        615
                     A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

The cost C(I, J) of the co-clustering equals
                               C(I, J) = ∑ ∑ Ai j − MIJ
                                           I∈I i∈I
                                           J∈J j∈J
                                        = ∑ ∑ Ai j − MIJ + ∑              ∑         Ai j − MIJ
                                           I∈I i∈I                 I∈I     i∈I
                                           J∈J0 j∈J                      j∈J`0 +1

                                        = ∑ ∑ A0i j − MIJ
                                                       0
                                                          +∑              ∑         B0 − B0
                                           I∈I0 i∈I                I∈I     i∈I
                                           J∈J0 j∈J                      j∈J`0 +1

                                        = C0 (I0 , J0 ) .
     Now, we have to show that the optimal solution to the co-clustering problem has to have the above
structure, that is, if J = {J1 , J2 , . . . , J`0 +1 } are the column clusters, then it has to be the case that, modulo
a permutation of cluster indices, J j = J 0j for j ≤ `0 and J`0 +1 = {n0 + 1, . . . , n} and I = I0 . Suppose not,
then we consider two cases. The first is that there exists a column A? j for j > n0 that is put into the same
cluster (say cluster J) as a column A?y for y ≤ n0 . In this case we show that the resulting co-clustering
cost will be much more than C0 (I0 , J0 ). To show this, just consider the error from the two coordinates
A1 j and A1y , for instance. The value of the center for this row, is some M1J = x. Now, if x > B0 /2, then
since (by the assumption on the value of B0 ) A1y < B0 /4, we have that |A1y − x| > B0 /4 > C0 (I0 , J0 ). On
the other hand if x ≤ B0 /2 then A1 j − x > B0 /4 > C0 (I0 , J0 ). Thus the cost of this solution is much larger
than the cost of the optimal solution.
     We have now established that any column j > n0 is not placed in the same cluster as y ≤ n0 . Now, note
that since the columns {n0 + 1, . . . , n} are clustered amongst themselves, their contribution to the total
cost is 0, irrespectively of the number of column clusters used. Suppose we use `00 ≤ `0 column clusters
for the columns {1, . . . , n0 }, and `0 + 1 − `00 clusters for {n0 + 1, . . . , n}. Thus the total cost is equal to
the cost of finding a (k, `00 )-co-clustering, `00 ≤ `0 , of the submatrix of A[1, . . . , m, 1, . . . , n0 ]. This is thus
maximized by using `00 = `0 . Thus, the solution (I, J) is optimal.
     Note that `0 + 1 = nε . Thus, solving the (k, `0 + 1) = (k, nε )-co-clustering problem on the new matrix
gives us a solution to the original k-median problem. Hence the (k, `)-co-clustering problem under the `1
norm is NP-hard, for any k > 1 and ` = nε .

    Observe that while we showed hardness for the `1 norm, our reduction can show hardness of co-
clustering from hardness of one-sided clustering. The following corollary about co-clustering under the
Frobenius norm is immediate.
Corollary 4.2. For any fixed ε > 0, any k > 1 and ` > nε , finding a (k, `)-co-clustering that is optimal
under the Frobenius norm is NP-hard.
    The proof follows by the above theorem and by hardness of the k-means objective shown in [11, 9, 8].


5    Discussion and future work
In this paper we consider the problem of co-clustering. We obtain the first algorithms for this problem with
provable performance guarantees. Our algorithms are simple and achieve constant-factor approximations

                           T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                                        616
                A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

with respect to the optimum. We also show that the co-clustering problem is NP-hard, for a wide range of
the input parameters. Finally, as a byproduct, we introduce the k-means p problem, which generalizes the
k-median and k-means problems, and give a constant-factor approximation algorithm.
    Our work leads to several interesting questions. In Section 4 we showed that the co-clustering problem
is NP-hard if ` = Ω(nε ) under the `1 and `2 norms. The restriction on ` arises from the fact that the
base result on k-median hardness by Megiddo and Supowit [27] only works for Ω(n) clusters. Recent
results by Dasgupta [9] and Aloise et al. [2] showed that k-means is hard for k = 2 when the number
of dimensions is part of the input. Mahajan et al. [26] and Vattani [33] showed hardness for constant
dimension and k being part of the input. Similar results for k-median are unresolved. For co-clustering,
even the hardness questions for the (2, 2) or the (O(1), O(1)) cases are, as far as we know, unresolved.
While we conjecture that these cases are hard, we do not have yet a proof for this. It should also be
interesting to extend the hardness results to any ` p norm, p ≥ 1; relevant to this are the hardness result for
the k-center [14]. Similarly it would be interesting to see if we can obtain PTAS for co-clustering for
some norms.
    Another question is whether the problem becomes easy for matrices A having a particular structure.
For instance, if A is symmetric, and k = `, is it the case that the optimal co-clustering is also symmetric?
The answer turns out to be negative, even if we are restricted to 0/1-matrices, and the counterexample
reveals some of the difficulty in co-clustering. Consider the matrix

                                                        
                                                   1 1 0
                                              A = 1 1 1 .
                                                   0 1 1

We are interested in a (2, 2)-co-clustering, say using k·kF . There are are three symmetric solutions,
I = J = {{1, 2}, {3}}, I = J = {{2, 3}, {1}}, and I = J = {{1, 3}, {2}}, and all √have a cost of 1. Instead,
the nonsymmetric solution (I, J) = ({{1}, {2, 3}}, {{1, 2}, {3}}), has cost of 3/2. Therefore, even for
symmetric matrices, one-sided clustering cannot be used to obtain the optimal co-clustering.
     Another interesting direction is to find approximation algorithms for other commonly used objective
functions for the co-clustering problem. It appears that our techniques cannot be directly applied to any
of those. As we mentioned before, the work by Banerjee et al. [3] unifies a number of such objectives and
gives an expectation maximization style heuristic for such merit functions. It would be interesting to see
if given an approximation algorithm for solving the clustering problem for a Bregman divergence, we
can construct a co-clustering approximation algorithm from it. Another objective function for which our
approach is not immediately applicable is (2.3) using the residual definition of (2.2). For several problem
domains this class of objective functions might be more appropriate than the one that we analyze here.



6    Acknowledgments

We would like to thank the anonymous reviewers of this article, whose constructive comments improved
significantly the presentation of our results.

                         T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                              617
                  A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

References
 [1] D EEPAK AGARWAL AND S RUJANA M ERUGU: Predictive discrete latent factor models for large
     scale dyadic data. In Proc. 13th Internat. Conf. on Knowledge Discovery and Data Mining (KDD’07),
     pp. 26–35, 2007. [doi:10.1145/1281192.1281199] 599

 [2] DANIEL A LOISE , A MIT D ESHPANDE , P IERRE H ANSEN , AND P REYAS P OPAT: NP-hardness of Eu-
     clidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. [doi:10.1007/s10994-
     009-5103-0] 617

 [3] A RINDAM BANERJEE , I NDERJIT S. D HILLON , J OYDEEP G HOSH , S RUJANA M ERUGU , AND
     D HARMENDRA S. M ODHA: A generalized maximum entropy approach to Bregman co-clustering
     and matrix approximation. Journal of Machine Learning Research, 8:1919–1986, 2007. Preliminary
     version in KDD’04. [ACM:1314498.1314563] 598, 599, 617

 [4] N IKHIL BANSAL , AVRIM B LUM , AND S HUCHI C HAWLA:            Correlation cluster-
     ing.     Machine Learning, 56(1-3):89–113, 2004. Preliminary version in FOCS’02.
     [doi:10.1023/B:MACH.0000033116.57574.95] 613

 [5] P ETER B RUCKER: On the complexity of clustering problems. In Optimization and Operations
     Research: Proc. of a Workshop held at the University of Bonn, pp. 45–54. Springer, 1977. 612

 [6] M OSES C HARIKAR , S UDIPTO G UHA , É VA TARDOS , AND DAVID B. S HMOYS: A constant-factor
     approximation algorithm for the k-median problem. J. Comput. System Sci., 65(1):129–149, 2002.
     Preliminary version in STOC’99. [doi:10.1006/jcss.2002.1882] 604, 605

 [7] Y IZONG C HENG AND G EORGE M. C HURCH: Biclustering of expression data. In Proc. 8th Internat.
     Conf. on Intelligent Systems for Molecular Biology (ISMB’00), pp. 93–103. AAAI Press, 2000.
     [ACM:645635.660833] 598, 599

 [8] H YUK C HO , I NDERJIT S. D HILLON , Y UQIANG G UAN , AND S UVRIT S RA: Mini-
     mum sum-squared residue co-clustering of gene expression data.          In Proc. 4th SIAM
     Internat. Conf. on Data Mining (SDM’04), pp. 114–125. SIAM, 2004.             Accessible at
     http://siam.org/proceedings/datamining/2004/dm04_011choh.pdf. 598, 600, 616

 [9] S ANJOY DASGUPTA:            The hardness of k-means clustering.               Technical Re-
     port CS2008-0916, University of California, San Diego, 2008.                    Accessible at
     http://csetechrep.ucsd.edu/Dienst/UI/2.0/Describe/ncstrl.ucsd_cse/CS2008-0916. 616, 617

[10] I NDERJIT S. D HILLON: Co-clustering documents and words using bipartite spectral graph partition-
     ing. In Proc. 7th Internat. Conf. on Knowledge Discovery and Data Mining (KDD’01), pp. 269–274.
     ACM Press, 2001. [doi:10.1145/502512.502550] 598, 599

[11] P ETROS D RINEAS , A LAN M. F RIEZE , R AVI K ANNAN , S ANTOSH V EMPALA , AND V. V INAY:
     Clustering large graphs via the singular value decomposition. Machine Learning, 56(1–3):9–33,
     2004. Preliminary version in SODA’99. [doi:10.1023/B:MACH.0000033113.59016.96] 616

                       T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                         618
               A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

[12] R ICHARD O. D UDA , P ETER E. H ART, AND DAVID G. S TORK: Pattern Classification. Wiley
     Interscience, 2nd edition, 2000. 597, 599
[13] U RIEL F EIGE AND S HIMON KOGAN: Hardness of approximation of the balanced com-
     plete bipartite subgraph problem. Technical Report MCS04-04, Department of Computer
     Science and Applied Math., The Weizmann Institute of Science, 2004. Accessible at
     http://www.wisdom.weizmann.ac.il/math/reports/years/2004.shtml. 599
[14] ROBERT J. F OWLER , M IKE PATERSON , AND S TEVEN L. TANIMOTO: Optimal packing and cover-
     ing in the plane are NP-complete. Inform. Process. Lett., 12(3):133–137, 1981. [doi:10.1016/0020-
     0190(81)90111-3] 617
[15] B IN G AO , T IE -YAN L IU , X IN Z HENG , Q IAN -S HENG C HENG , AND W EI -Y ING M A: Consistent
     bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In
     Proc. 11th Internat. Conf. on Knowledge Discovery and Data Mining (KDD’05), pp. 41–50, 2005.
     [doi:10.1145/1081870.1081879] 598
[16] S REENIVAS G OLLAPUDI , R AVI K UMAR , AND D. S IVAKUMAR: Programmable clustering.
     In Proc. 25th Symp. on Principles of Database Systems (PODS’06), pp. 348–354, 2006.
     [doi:10.1145/1142351.1142400, ACM:1142351.1142400] 599
[17] J OHN A. H ARTIGAN: Direct clustering of a data matrix. J. Amer. Stat. Assoc., 67(337):123–129,
     1972. [doi:10.1080/01621459.1972.10481214] 598
[18] S AEED H ASSANPOUR: Computational complexity of bi-clustering. Master’s thesis, University of
     Waterloo, 2007. Accessible at http://uwspace.uwaterloo.ca/bitstream/10012/3359/1/Thesis.pdf. 598,
     599, 613
[19] A NIL K. JAIN , M. NARASIMHA M URTY, AND PATRICK J. F LYNN: Data clustering: A review.
     ACM Comput. Surv., 31(3):264–323, 1999. [doi:10.1145/331499.331504] 599
[20] M ICHEL JAMBU AND M ARIE -O DILE L EBEAUX: Cluster Analysis and Data Analysis. Elsevier
     Science Ltd., 1983. 597, 599
[21] S TEFANIE J EGELKA , S UVRIT S RA , AND A RINDAM BANERJEE: Approximation algorithms for
     tensor clustering. In Proc. 20th Internat. Conf. on Algorithmic Learning Theory (ALT’09), pp.
     368–383. Springer, 2009. [doi:10.1007/978-3-642-04414-4_30] 599
[22] J ON K LEINBERG: An impossibility theorem for clustering. In Proc. Advances in Neural Information
     Processing Systems (NIPS’02), pp. 446–453. MIT Press, 2002. 598, 599
[23] Y UVAL K LUGER , RONEN BASRI , J OSEPH T. C HANG , AND M ARK G ERSTEIN: Spectral bicluster-
     ing of microarray data: Coclustering genes and conditions. Genome Research, 13:703–716, 2003.
     [doi:10.1101/gr.648603] 598
[24] A MIT K UMAR , YOGISH S ABHARWAL , AND S ANDEEP S EN: A simple linear time (1 + ε)-
     approximation algorithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pp.
     454–462. IEEE Comp. Soc. Press, 2004. [doi:10.1109/FOCS.2004.7] 604

                       T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                          619
                 A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

[25] S ARA C. M ADEIRA AND A RLINDO L. O LIVEIRA: Biclustering algorithms for biological
     data analysis: A survey. IEEE/ACM Trans. Comput. Biology Bioinform., 1(1):24–45, 2004.
     [doi:10.1109/TCBB.2004.2] 598, 599

[26] M EENA M AHAJAN , P RAJAKTA N IMBHORKAR , AND K ASTURI VARADARAJAN: The planar
     k-means problem is NP-hard. Theoret. Comput. Sci., 442:13–21, 2012. Preliminary version in
     WALCOM’09. [doi:10.1016/j.tcs.2010.05.034] 617

[27] N IMROD M EGIDDO AND K ENNETH J. S UPOWIT: On the complexity of some common geometric
     location problems. SIAM J. Comput., 13(1):182–196, 1984. [doi:10.1137/0213014] 614, 617

[28] N INA M ISHRA , DANA RON , AND R AM S WAMINATHAN: A new conceptual clustering
     framework. Machine Learning, 56(1-3):115–151, 2004. Preliminary version in COLT’03.
     [doi:10.1023/B:MACH.0000033117.77257.41] 599

[29] R ENÉ P EETERS: The maximum edge biclique problem is NP-complete. Discr. Appl. Math.,
     131(3):651–654, 2003. [doi:10.1016/S0166-218X(03)00333-0] 613

[30] K AI P UOLAMÄKI , S AMI H ANHIJÄRVI , AND G EMMA C. G ARRIGA: An approximation ratio for
     biclustering. Inform. Process. Lett., 108(2):45–49, 2008. [doi:10.1016/j.ipl.2008.03.013] 599

[31] RON S HAMIR , RODED S HARAN , AND D EKEL T SUR: Cluster graph modification prob-
     lems.    Discr. Appl. Math., 144(1–2):173–182, 2004. Preliminary version in WG’02.
     [doi:10.1016/j.dam.2004.01.007] 613

[32] A MOS TANAY, RODED S HARAN , AND RON S HAMIR: Biclustering algorithms: A survey. In
     S RINIVAS A LURU, editor, Handbook of Computational Molecular Biology. Chapman & Hall/CRC,
     Computer and Information Science Series, 2005. 599

[33] A NDREA VATTANI: The hardness of k-means clustering in the plane. Manuscript, accessible at
     http://cseweb.ucsd.edu/˜avattani/papers/kmeans_hardness.pdf. 617

[34] J IONG YANG , H AIXUN WANG , W EI WANG , AND P HILIP S. Y U: Enhanced biclustering on
     expression data. In Proc. 3rd IEEE Conf. on Bioinformatics and Bioengineering (BIBE’03), pp.
     321–327. IEEE Comp. Soc. Press, 2003. [doi:10.1109/BIBE.2003.1188969] 599

[35] J IONG YANG , W EI WANG , H AIXUN WANG , AND P HILIP S. Y U: δ -clusters: Capturing subspace
     correlation in a large data set. In Proc. 18th Internat. Conf. on Data Engineering (ICDE’02), pp.
     517–528. IEEE Comp. Soc. Press, 2002. [doi:10.1109/ICDE.2002.994771] 600

[36] H ANSON Z HOU AND DAVID P. W OODRUFF: Clustering via matrix powering. In Proc.
     23rd Symp. on Principles of Database Systems (PODS’04), pp. 136–142. ACM Press, 2004.
     [doi:10.1145/1055558.1055579] 599




                      T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                         620
            A C ONSTANT-FACTOR A PPROXIMATION A LGORITHM FOR C O - CLUSTERING

AUTHORS

    Aris Anagnostopoulos
    Assistant Professor
    Department of Computer, Control, and Management Engineering
    Sapienza University of Rome, Italy
    aris dis uniroma1 it
    http://aris.me


    Anirban Dasgupta
    Yahoo!
    Sunnyvale, CA
    anirban.dasgupta gmail com
    http://sites.google.com/site/anirbandasgupta


    Ravi Kumar
    Google
    Mountain View, CA
    ravi.k53 gmail com
    https://sites.google.com/site/ravik53


ABOUT THE AUTHORS

    A RIS A NAGNOSTOPOULOS is an assistant professor at the Department of Computer, Control,
       and Management Engineering of the Sapienza University of Rome, Italy. Before joining
       the department he was a postdoctoral fellow at the Yahoo! Research labs in Santa Clara,
       CA. He graduated in 2000 in computer engineering and informatics from the University
       of Patras, Greece and in 2006 he obtained a Ph. D. in computer science from Brown
       University in the area of analysis of stochastic processes in computer science, under the
       supervision of Eli Upfal. His research interests lie in the broad areas of algorithms and
       probabilistic analysis, with emphasis on data-mining, web-mining, and social-network
       applications.




                     T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                           621
           A RIS A NAGNOSTOPOULOS , A NIRBAN DASGUPTA , AND R AVI K UMAR

A NIRBAN DASGUPTA did his undergraduate studies at the Computer Science department
   of IIT Kharagpur, and joined the Cornell CS department as a graduate student in 2000.
   Anirban finished his Ph. D. in 2006 under the supervision of John Hopcroft, having
   worked on spectral methods for learning mixtures of distributions. Since then, Anirban
   has been employed as a scientist at Yahoo! Research. His research interests span linear
   algebraic techniques for information retrieval, algorithmic game theory, modeling of
   and algorithms for social networks, and the design and analysis of randomized and
   approximation algorithms in general.


R AVI K UMAR has been a senior staff research scientist at Google since June 2012. Prior
   to this, he was a research staff member at the IBM Almaden Research Center and a
   principal research scientist at Yahoo! Research. He obtained his Ph. D. in Computer
   Science from Cornell University in 1998, under the guidance of Ronitt Rubinfeld; his
   thesis topic was on program checking. His primary interests are web and data mining,
   social networks, algorithms for large data sets, and theory of computation.




                T HEORY OF C OMPUTING, Volume 8 (2012), pp. 597–622                          622