Matrix Approximation and Projective Clustering via Volume Sampling

Authors Amit Deshpande, Luis Rademacher, Santosh S. Vempala, Grant Wang,
Plaintext
                                 T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247
                                             http://theoryofcomputing.org




            Matrix Approximation and Projective
             Clustering via Volume Sampling
              Amit Deshpande∗                               Luis Rademacher†            Santosh Vempala‡
                                                                  Grant Wang§
                Received: August 11, 2005; revised: September 28, 2006; published: October 12, 2006.



        Abstract: Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows
        of a given matrix A spans the rows of a low-rank approximation D that minimizes kA − DkF
        within a small additive error, and the sampling can be done efficiently using just two passes
        over the matrix. In this paper, we generalize this result in two ways. First, we prove
        that the additive error drops exponentially by iterating the sampling in an adaptive manner
        (adaptive sampling). Using this result, we give a pass-efficient algorithm for computing a
        low-rank approximation with reduced additive error. Our second result is that there exist
        k rows of A whose span contains the rows of a multiplicative (k + 1)-approximation to
        the best rank-k matrix; moreover, this subset can be found by sampling k-subsets of rows
        from a natural distribution (volume sampling). Combining volume sampling with adaptive
        sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows
        of a multiplicative (1 + ε)-approximation. This leads to a PTAS for the following NP-hard
   ∗ Supported by NSF Award CCR-0312339.
   † Supported by NSF Award CCR-0312339.
   ‡ Supported by NSF Award CCR-0312339 and a Guggenheim Foundation Fellowship.
   § Supported by NSF Award CCR-0312339.



ACM Classification: G.1.3, F.2.1
AMS Classification: 68W25, 65F15
Key words and phrases: algorithms, matrix approximation, projective clustering.


Authors retain copyright to their work and grant Theory of Computing unlimited rights
to publish the work electronically and in hard copy. Use of the work is permitted as
long as the author(s) and the journal are properly acknowledged. For the detailed
copyright statement, see http://theoryofcomputing.org/copyright.html.

c 2006 Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang                  DOI: 10.4086/toc.2006.v002a012
                        A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

       projective clustering problem: Given a set P of points in Rd , and integers j and k, find
       subspaces F1 , . . . , Fj , each of dimension at most k, that minimize ∑ p∈P mini d(p, Fi )2 .



1     Introduction
1.1   Motivation
Let the rows of a matrix be points in a high-dimensional space. It is often of interest to find a low-
dimensional representation. The subspace spanned by the top k right singular vectors of the matrix
is a good choice for many applications. The problem of efficiently finding an approximation to this
subspace has received much attention in the past decade [19, 15, 1, 14, 16]. In this paper, we give new
algorithms for this problem and show existence of subspaces lying in the span of a small set of rows
with better additive approximation as well as multiplicative approximation. At the heart of our analysis
are generalizations of previous sampling schemes [19]. We apply these results to the general problem of
finding j subspaces, each of dimension at most k, so as to minimize the sum of squared distances of each
point to its nearest subspace, a measure of the “error” incurred by this representation; as a result, we
obtain the first polynomial-time approximation scheme for this projective clustering problem [5, 32, 6, 3]
when j and k are fixed.
     The case of j = 1, i. e., finding a single k-dimensional subspace, is an important problem in itself
and can be solved efficiently (for j ≥ 2, the problem is NP-hard [30], even for k = 1 [15]). The optimal
projection is given by the rank-k matrix Ak = AYY T where the columns of Y are the top k right singular
vectors of A and can be computed using the Singular Value Decomposition. Note that among all rank-k
matrices D, Ak is the one that minimizes kA−Dk2F = ∑i, j (Ai j −Di j )2 . The running time of this algorithm,
dominated by the SVD computation of an m × n matrix, is O(min{mn2 , nm2 }). Although polynomial,
this is still too high for some applications.
     For problems on data sets that are too large to store/process in their entirety, one can view the
data as a stream of data items arriving in arbitrary order, and the goal is to process a subset chosen
judiciously on the fly and then extrapolate from this subset. Motivated by the question of finding a faster
algorithm, Frieze et al. [19] showed that any matrix A has k/ε rows whose span contains the rows of a
rank-k approximation to A within additive error εkAk2F . In fact, the subset of rows can be obtained as
independent samples from a distribution that depends only on the norms of the rows. (In what follows,
A(i) denotes the ith row of A.)

Theorem 1.1 ([19]). Let S be a sample of s rows of an m × n matrix A, each chosen independently from
the following distribution: row i is picked with probability

                                                     kA(i) k2
                                              Pi =            .
                                                      kAk2F

If s ≥ k/ε, then span(S) contains the rows of a matrix Ãk of rank at most k for which

                                 ES (kA − Ãk k2F ) ≤ kA − Ak k2F + εkAk2F .

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                              226
             M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

    This can be turned into an efficient algorithm based on sampling [15].1 The algorithm makes one
pass through A to figure out the sampling distribution and another pass to sample and compute the
approximation. Its complexity is O(min{m, n}k2 /ε 4 ). These results lead us to the following questions:
(1) Can the error be reduced significantly by using multiple passes through the data? (2) Can we get
multiplicative (1 + ε)-approximations? (3) Do these sampling algorithms have any consequences for the
general projective clustering problem?

1.2    Our results
We begin with the observation that the additive error term drops exponentially with the number of passes
when the sampling is done adaptively. Thus, low-rank approximation is a natural problem for which
multiple passes through the data are highly beneficial.
    The idea behind the algorithm is quite simple. As an illustrative example, suppose the data consists
of points along a 1-dimensional subspace of Rn except for one point. The best rank-2 subspace has zero
error. However, one round of sampling will most likely miss the point far from the line. So we use a
two-round approach. In the first pass, we get a sample from the squared norm distribution. Then we
sample again, but adaptively—we sample with probability proportional to the squared distance to the
span of the first sample. We call this procedure adaptive sampling. If the lone far-off point is missed
in the first pass, it will have a high probability of being chosen in the second pass. The span of the full
sample now contains the rows of a good rank-2 approximation. In the theorem below, for a set S of rows
of a matrix A, we denote by π S (A) the matrix whose rows are the projection of the rows of A to the span
of S.

Theorem 1.2. Let S = S1 ∪ · · · ∪ St be a random sample of rows of an m × n matrix A where, for j =
1, . . . ,t, each set S j is a sample of s rows of A chosen independently from the following distribution: row
i is picked with probability
                                                              (i)
                                                      ( j) kE j k2
                                                    Pi =
                                                           kE j k2F
where E1 = A, E j = A − π S1 ∪···∪S j−1 (A). Then for s ≥ k/ε, span(S) contains the rows of a matrix Ãk of
rank at most k such that
                                                          1
                                  ES (kA − Ãk k2F ) ≤       kA − Ak k2F + ε t kAk2F .
                                                         1−ε
    The proof of Theorem 1.2 is given in Section 2.1. The resulting algorithm, described in Section 2.2
uses 2t passes through the data. Although the sampling distribution is modified t times, the matrix itself
is not changed. This is especially significant when A is sparse as the sparsity of the matrix is maintained.
    Theorem 1.2 raises the question of whether we can get a multiplicative approximation instead of an
additive approximation. To answer this question, we generalize the sampling approach. For a set S of k
points in Rn , let ∆(S) denote the k-dimensional simplex formed by them along with the origin. We pick
a random k-subset S of rows of A with probability proportional to vol(∆(S))2 . This procedure, which
   1 Frieze et al. [19] go further to show that there is an s × s submatrix for s = poly(k/ε) from which the low-rank approxi-

mation can be computed in poly(k, 1/ε) time in an implicit form.


                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                          227
                         A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

we call volume sampling, is a generalization of the earlier sampling approach which picks single rows
according to their squared norms.

Theorem 1.3. Let S be a random k-subset of rows of a given matrix A chosen with probability

                                                     vol(∆(S))2
                                         PS =                          .
                                                ∑T :|T |=k vol(∆(T ))2

Then Ãk , the projection of A to the span of S, satisfies

                                   ES (kA − Ãk k2F ) ≤ (k + 1)kA − Ak k2F .

    We prove this theorem in Section 1.4. Moreover, the factor of k + 1 is the best possible for a
k-subset (Proposition 3.3). By combining Theorem 1.3 with the adaptive sampling idea from The-
orem 1.2, we show that there exist O(k2 /ε) rows whose span contains the rows of a multiplicative
(1 + ε)-approximation.

Theorem 1.4. For any m × n matrix A, there exist k + k(k + 1)/ε rows whose span contains the rows of
a rank-k matrix Ãk such that
                                kA − Ãk k2F ≤ (1 + ε) kA − Ak k2F .

    The existence of a small number of rows containing a good multiplicative approximation is the key
ingredient in our last result—a polynomial-time approximation scheme (PTAS) for the general projec-
tive clustering problem. This result makes a connection between matrix approximation and projective
clustering. A key idea in the matrix approximation work of [14, 15, 19] is that, for any matrix, there
is a small subset of its rows whose span contains a good approximation to the row space of the entire
matrix. This is similar to the idea of core-sets [2], which have been studied in computing extent mea-
sures in computational geometry (and applied to a variant of the projective clustering problem [24]).
Roughly speaking, a core-set is a subset of a point-set such that computing the extent measure on the
core-set provides an approximation to the extent measure on the entire point set. An extent measure is
just a statistic on the point set (for example, the diameter of a point set, or the radius of the minimum
enclosing cylinder); typically, it measures the size of the point set or the minimum size of an object that
encloses the point set [2].
    We state the projective clustering problem using the notation from computational geometry: Let
d(p, F) be the orthogonal distance of a point p to a subspace F. Given a set P of n points in Rd , find j
subspaces F1 , . . . , Fj , each of dimension k, such that

                                      C(F1 , . . . , Fj ) = ∑ min d(p, Fi )2                          (1.1)
                                                               i
                                                        p∈P

is minimized. When subspaces are replaced by flats, the case k = 0 corresponds to the “ j-means prob-
lem” in computational geometry.
    Theorem 1.4 suggests an enumerative algorithm. The optimal set of k-dimensional subspaces in-
duces a partition P1 ∪ · · · ∪ Pj of the given point set. Each set Pi contains a subset of size O(k2 /ε) in
whose span lies a multiplicative (1 + ε)-approximation to the optimal k-dimensional subspace for Pi . So

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                           228
            M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

we consider all possible combinations of j subsets each of size O(k2 /ε), and a δ -net of k-dimensional
subspaces in the span of each subset. The δ -net depends on the points in each subset and is not just a
grid, as is often the case. Each possible combination of subspaces induces a partition and we output the
best of these. Since the subset size is bounded (and so is the size of the net), this gives a PTAS for the
problem (see Section 4) when j and k are taken to be fixed constants.
Theorem 1.5. Given n points in Rd and parameters B and ε, in time
                                            n O( jk3 /ε)
                                         d
                                             ε
we can find a solution to the projective clustering problem which is of cost at most (1 + ε)B provided
there is a solution of cost B.

1.3   Related work
The work of Frieze et al. [19] and Drineas et al. [15] introduced matrix sampling for fast low-rank ap-
proximation. Subsequently, an alternative sampling-based algorithm was given by Achlioptas and Mc-
Sherry [1]. That algorithm achieves somewhat different bounds (see [1] for a detailed comparison) using
only one pass. It does not seem amenable to the multipass improvements presented here. Bar-Yossef [9]
has shown that the bounds of these algorithms for one or two passes are optimal up to polynomial factors
in 1/ε.
     These algorithms can also be viewed in the streaming model of computation [25]. In this model,
we do not have random access to data; the data comes as a stream of data items in arbitrary order and
we are allowed one or a few sequential passes over the data. Algorithms for the streaming model have
been designed for computing frequency moments [7], histograms [22], etc. and have mainly focused
on what can be done in one pass. There has been some recent work on what can be done in multiple
passes [14, 18]. The “pass-efficient” model of computation was introduced in [25]. Our multipass
algorithms fit this model and relate the quality of approximation to the number of passes. Feigenbaum et
al. [18] show such a relationship for computing the maximum unweighted matching in bipartite graphs.
     The Lanczos method is an iterative algorithm that is used in practice to compute the Singular Value
Decomposition [20, 26]. An exponential decrease in an additive error term has also been proven for
the Lanczos method under a different notion of additive error ([20, 26]). However, the exponential
decrease in error depends on the gap between singular values. In particular, the following is known for
the Lanczos method: after k iterations, each approximate singular value θi2 obeys:

                                              θi2 ≥ σi2 − ckC                                          (1.2)

where σi is the ith singular value. Both constants c and C depend on the gap between singular values; in
particular, c is proportional to (σ22 − σr2 )/(2σ12 − σ22 − σr2 ) and C = σ12 − σr2 . The guarantee (1.2) can
be transformed into an inequality:

                                      kA − Ãk k2F ≤ kA − Ak k2F + ckC                                 (1.3)

very similar to Theorem 1.2, but without the multiplicative error term for kA − Ak k2F . However, note that
when σ12 = σ22 , successive rounds of the Lanczos method fail to converge to σi2 . In the Lanczos method,

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                              229
                        A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

each iteration can be implemented in one pass over A, whereas our algorithm requires two passes over
A in each iteration. Kuczyński and Woźniakowski [27] prove that the Lanczos method, with a randomly
chosen √ starting vector, outputs a vector v with Ã1 = π span(v) (A) such that kÃ1 k2F ≥ (1 − ε)kA1 k2F in
log(n)/ ε iterations. However, this does not imply a multiplicative (1 + ε) error to kA − A1 k2F .
     While the idea of volume sampling appears to be new, in [21] it is proved that, using the rows
and columns that correspond to the k × k submatrix of maximal volume, one can compute a rank-k
approximation to the original matrix which differs in each entry by at most (k + 1)σk+1 (leading to much
         √
weaker mn approximations for the Frobenius norm).
     The results of our paper connect two previously separate fields: low-rank approximation and pro-
jective clustering. Algorithms and systems based on projective clustering have been applied to facial
recognition, data-mining, and synthetic data [5, 32, 6], motivated by the observation that no single
subspace performs as well as a few different subspaces. It should be noted that the advantage of a low-
dimensional representation is not merely in the computational savings, but also the improved quality of
retrieval. In [3], Agarwal and Mustafa consider the same problem as in this paper, and propose a variant
of the j-means algorithm for it. Their paper has promising experimental results but does not provide any
theoretical guarantees. There are theoretical results for special cases of projective clustering, especially
the j-means problem (find j points). Drineas et al. [15] gave a 2-approximation to j-means using SVD.
Subsequently, Ostrovsky and Rabani [31] gave the first randomized polynomial time approximation
schemes for j-means (and also the j-median problem). Matoušek [29] and Effros and Schulman [17]
both gave deterministic PTAS’s for j-means. Fernandez de la Vega et al. [11] describe a randomized
algorithm with a running time of n(log n)O(1) . Using the idea of core-sets, Har-Peled and Mazumdar [23]
showed a multiplicative (1 + ε)-approximation algorithm that runs in linear time for fixed j, ε. Kumar et
al. [28] give a linear-time PTAS that uses random sampling. There is a PTAS for k = 1 (lines) as well [4].
Other objective functions have also been studied, e. g. sum of distances ( j-median when k = 0, [31, 23])
and maximum distance ( j-center when k = 0, [10]). For general k, Har-Peled and Varadarajan [24] give
a multiplicative (1 + ε)-approximation algorithm for the maximum distance objective function. Their
                               6         5
algorithm runs in time dnO( jk log(1/ε)/ε ) and is based on core-sets (see [2] for a survey).

1.4   Previous versions of this paper
This paper is a journal version of the paper by the same set of authors presented at the 17th Annual
Symposium on Discrete Algorithms (SODA 2006) [12]. This paper contains the same major results, but
additionally provides more intuition and more complete proofs. A previous version [33] of this paper
by a subset of three of the authors (L. Rademacher, S. Vempala, G. Wang) appeared as an MIT CSAIL
technical report. That version contained a subset of the results in this paper. In particular, the notion of
volume sampling and related results did not appear in the technical report.

1.5   Notation and preliminaries
Any m × n real matrix A has a singular value decomposition, that is, it can be written in the form
                                                  r
                                            A = ∑ σi u(i) v(i)
                                                                 T

                                                 i=1


                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                             230
              M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

where r is the rank of A and σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0 are called the singular values; {u(1) , . . . , u(r) } ∈ Rm ,
{v(1) , . . . , v(r) } ∈ Rn are sets of orthonormal vectors, called the left and right singular vectors, respec-
tively. It follows that AT u(i) = σi v(i) and Av(i) = σi u(i) for 1 ≤ i ≤ r.
    The Frobenius norm of a matrix A ∈ Rm×n having elements (ai j ) is denoted kAkF and is given by
                                                         m   n
                                              kAk2F = ∑ ∑ a2i j .
                                                        i=1 j=1

It satisfies kAk2F = ∑ri=1 σi2 .
     For a subspace V ⊆ Rn , let π V,k (A) denote the best rank-k approximation (under the Frobenius norm)
of A with its rows in V . Let
                                                                 k
                                      π k (A) = π Rn ,k (A) = ∑ σi u(i) v(i)
                                                                               T

                                                              i=1
be the best rank-k approximation of A. Also π V (A) = π V,n (A) is the orthogonal projection of A onto V .
When we say “a set (or sample) of rows of A” we mean a set of indices of rows, rather than the actual
rows. For a set S of rows of A, let span(S) ⊆ Rn be the subspace generated by those rows; we use the
simplified notation π S (A) for π span(S) (A) and π S,k (A) for π span(S),k (A).
    For subspaces V,W ⊆ Rn , their sum is denoted V +W and is given by
                                   V +W = {x + y ∈ Rn : x ∈ V, y ∈ W } .
      The following elementary properties of the operator π V will be used:
      • π V is linear, that is, π V (λ A + B) = λ π V (A) + π V (B) for any λ ∈ R and matrices A, B ∈ Rm×n .
      • If V,W ∈ Rn are orthogonal linear subspaces, then π V +W (A) = π V (A) + π W (A), for any matrix
        A ∈ Rm×n .
    For a random vector v, its expectation, denoted E(v), is the vector having as components the expected
values of the components of v.


2      Improved approximation via adaptive sampling
2.1     Proof that adaptive sampling works
We will prove Theorem 1.2 in this section. It will be convenient to formulate an intermediate theorem
as follows, whose proof is quite similar to one in [19].
Theorem 2.1. Let A ∈ Rm×n , and V ⊆ Rn be a vector subspace. Let E = A − π V (A) and let S be a
random sample of s rows of A from a distribution D such that row i is chosen with probability
                                                        kE (i) k2
                                                 Pi =             .                                          (2.1)
                                                         kEk2F
Then, for any nonnegative integer k,
                                                                              k
                          ES (kA − π V +span(S),k (A)k2F ) ≤ kA − π k (A)k2F + kEk2F .
                                                                              s

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                  231
                           A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

Proof. We define vectors w(1) , . . . , w(k) ∈ V + span(S) such that W = span{w(1) , . . . , w(k) } and show that
W is a good approximation to span{v(1) , . . . , v(k) } in the sense that

                                                                            k
                                   ES (kA − π W (A)k2F ) ≤ kA − π k (A)k2F + kEk2F .                                 (2.2)
                                                                            s

Recall that π k (A) = π span{v(1) ,...,v(k) } (A), i. e., span{v(1) , . . . , v(k) } is the optimal subspace upon which to
project. Proving (2.2) proves the theorem, since W ⊆ V + span(S).
                           ( j)
    To this end, define Xl to be a random variable such that for i = 1, . . . , m and l = 1, . . . , s,

                                        ( j)         ( j)
                            ( j)       ui           u
                                             E (i) = i A(i) − π V (A(i) ) with probability Pi .
                                                                         
                         Xl        =
                                        Pi           Pi
             (i)                                                                                                      ( j)
Note that Xl is a linear function of a row of A sampled from the distribution D. Let X ( j) = 1s ∑sl=1 Xl ,
and note that ES (X ( j) ) = E T u( j) .
   For 1 ≤ j ≤ k, define:
                                         w( j) = π V (A)T u( j) + X ( j) .                             (2.3)
Then we have that ES (w( j) ) = σ j v( j) . We seek to bound the second central moment of w( j) , i. e.,
ES (kw( j) − σ j v( j) k2 ). We have that

                                                 w( j) − σ j v( j) = X ( j) − E T u( j) ,

which gives us

                    ES (kw( j) − σ j v( j) k2 ) = ES (kX ( j) − E T u( j) k2 )
                                                   = ES (kX ( j) k2 ) − 2 ES (X ( j) ) · E T u( j) + kE T u( j) k2
                                                   = ES (kX ( j) k2 ) − kE T u( j) k2 .                              (2.4)

    We evaluate the first term in (2.4),
                                                     1 s
                                                                 ( j) 2
                                                           ∑ Xl
                                                                        
                        ES (kX ( j) k2 ) = ES
                                                         s l=1
                                                     s
                                                 1                    2
                                                    ∑  ES (kXl k2 ) + 2 ∑ ES (Xl1 · Xl2 )
                                                              ( j)                 ( j) ( j)
                                            =     2
                                                 s l=1               s 1≤l1 <l2 ≤s
                                                 1 s                s−1
                                                    ∑ ES (kXl k2 ) + s kE T u( j) k2 .
                                                             ( j)
                                            =                                                                        (2.5)
                                                 s2 l=1

                           ( j)           ( j)
In (2.5) we used that Xl1 and Xl2 are independent. From (2.4) and (2.5) we have that

                                                              1 s                 1
                                                                 ∑
                                                                           ( j)
                           ES (kw( j) − σ j v( j) k2 ) =       2
                                                                    ES (kXl k2 ) − kE T u( j) k2 .                   (2.6)
                                                              s l=1               s

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                       232
            M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

The definition of Pi gives us
                                                     m        ( j)
                                                           kui E (i) k
                                   ES (kXl k2 ) = ∑ Pi
                                          ( j)
                                                                       ≤ kEk2F .                            (2.7)
                                                     i=1      Pi2

Thus, we have obtained a bound on the second central moment of w( j) :
                                                                    1
                                       ES (kw( j) − σ j v( j) k2 ) ≤ kEk2F .                                (2.8)
                                                                    s
    With this bound in hand, we can complete the proof. Let y( j) = w( j) /σ j for 1 ≤ j ≤ k, and consider
                                T
the matrix F = A ∑ki=1 v(i) y(i) . The row space of F is contained in W = span{w(1) , . . . , w(k) }. Therefore,
kA − π W (A)k2F ≤ kA − Fk2F . We will use F to bound the error kA − π W (A)k2F .
    By decomposing A − F along the left singular vectors u(1) , . . . , u(r) , we can use the inequality (2.8)
to bound kA − Fk2F :
                                                                 r
                ES (kA − π W (A)k2F ) ≤ ES (kA − Fk2F ) = ∑ ES (k(A − F)T u(i) k2F )
                                                               i=1
                                                                k                              r
                                                            = ∑ ES (kσi v(i) − w(i) k2 ) + ∑ σi2
                                                               i=1                           i=k+1
                                                             k
                                                            ≤ kEk2F + kA − π k (A)k2F .                     (2.9)
                                                             s


    We can now prove Theorem 1.2 inductively using Theorem 2.1.

Proof of Theorem 1.2. We will prove the inequality by induction on t. Theorem 1.1 gives us the base
case t = 1.
     For the inductive step, let E = A − π S1 ∪···∪St−1 (A). By means of Theorem 2.1 with s ≥ k/ε we have
that

                          ESt (kA − π S1 ∪···∪St ,k (A)k2F ) ≤ kA − π k (A)k2F + εkEk2F .

Combining this inequality with the fact that kEk2F ≤ kA − π S1 ∪···∪St−1 ,k (A)k2F we get

                ESt (kA − π S1 ∪···∪St ,k (A)k2F ) ≤ kA − π k (A)k2F + εkA − π S1 ∪···∪St−1 ,k (A)k2F .    (2.10)

Taking the expectation over S1 , . . . , St−1 , and using the induction hypothesis for t − 1 gives the result:

      ES (kA − π S1 ∪···∪St ,k (A)k2F ) ≤ kA − π k (A)k2F + ε ES1 ,...,St−1 kA − π S1 ∪···∪St−1 ,k (A)k2F
                                                                                                          
                                                                                                            (2.11)
                                                                                                         
                                                                    1
                                        ≤ kA − π k (A)k2F + ε            kA − π k (A)k2F + ε t−1 kAk2F      (2.12)
                                                                 1−ε
                                            1
                                        =        kA − π k (A)k2F + ε t kAk2F .                              (2.13)
                                          1−ε



                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                  233
                          A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

2.2     Algorithm
In this section, we present the multipass algorithm for low-rank approximation. We first describe it at a
conceptual level and then give the details of the implementation.
     Informally, the algorithm will find an approximation to the best rank-k subspace (the span of
v(1) , . . . , v(k) ) by first choosing a sample T of s random rows with probabilities proportional to the squared
norms of the rows (as in Theorem 1.1). Then we focus on the space orthogonal to the span of the chosen
rows, that is, we consider the matrix E = A − π T (A), which represents the error of our current approx-
imation, and we sample s additional rows with probabilities proportional to the squared norms of the
rows of E. We consider the union of this sample with our previous sample, and we continue adding
samples in this way, up to the number of passes that we have chosen. Theorem 1.2 gives a bound on the
error of this procedure.

 Fast Approximate SVD

 Input: A ∈ Rm×n , integers k ≤ m, t, error parameter ε > 0.
 Output: A set of k vectors in Rn .

      1. Let S = 0,
                 / s = k/ε.

      2. Repeat t times:

          (a) Compute the probabilities Pi = kE (i) k2 /kEk2F for i = 1, . . . , m. (One pass.)
          (b) Let T be a sample of s rows of A according to the distribution that assigns probability
              Pi to row i. (Another pass.)
          (c) Let S = S ∪ T .

      3. Let h1 , . . . , hk be the top k right singular vectors of π S (A).

      In what follows, let M be the number of non-zeros of A.
Theorem 2.2. In 2t passes over the data, where in each pass the entries of the matrix arrive in arbitrary
order, the algorithm finds vectors h1 , . . . , hk ∈ Rn such that with probability at least 3/4 their span V
satisfies                                               
                                    2                4ε
                       kA − π V (A)kF ≤ 1 +                kA − π k (A)k2F + 4ε t kAk2F .              (2.14)
                                                    1−ε
                                         2 2
                                               
The running time is O M ktε + (m + n) kε 2t .

Proof. For the correctness, observe that π V (A) is a random variable with the same distribution as π S,k (A)
as defined in Theorem 1.2. Also, kA − π S,k (A)k2F − kA − π k (A)k2F is a nonnegative random variable and
Theorem 1.2 gives a bound on its expectation:
                                                                  ε
                   ES (kA − π S,k (A)k2F − kA − π k (A)k2F ) ≤       kA − π k (A)k2F + ε t kAk2F .         (2.15)
                                                                 1−ε

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                234
            M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

Markov’s inequality applied to this variable gives that with probability at least 3/4
                                                               4ε
                   kA − π V (A)k2F − kA − π k (A)k2F ≤            kA − π k (A)k2F + 4ε t kAk2F ,           (2.16)
                                                              1−ε
which implies inequality (2.14).
     We will now bound the running time. We maintain a basis of the rows indexed by S. In each iteration,
we extend this basis orthogonally with a new set Y of vectors, so that it spans the new sample T . The
residual squared norm of each row, kE (i) k2 , as well as the total, kEk2F , are computed by subtracting
the contribution of π T (A) from the values that they had during the previous iteration. In each iteration,
the projection onto Y needed for computing this contribution takes time O(Ms). In iteration i, the
computation of the orthonormal basis Y takes time O(ns2 i) (Gram-Schmidt orthonormalization of s
vectors in Rn with reference to an orthonormal basis of size at most s(i + 1)). Thus, the total time
in iteration i is O(Ms + ns2 i); with t iterations, this is O(Mst + ns2t 2 ). At the end of Step 2 we have
π S (A) in terms of our basis (an m × st matrix). Finding the top k singular vectors in Step 3 takes time
O(ms2t 2 ). Bringing them back to the original basis takes time O(nkst). Thus, the total running time is
O(Mst + ns2t 2 + ms2t 2 + nkst) or, in other words, O Mkt/ε + (m + n)k2t 2 /ε 2 .


3    Volume sampling and multiplicative approximation
We begin with a proof of Theorem 1.3, namely that volume sampling leads to a multiplicative (k + 1)-
approximation (in expectation).

Proof of Theorem 1.3. For every S ⊆ {1, . . . , m}, let ∆S be the simplex formed by {A(i) : i ∈ S} and the
origin. We wish to bound ES (kA − Ãk k2F ) which can be written as follows:
                                                   1
                   ES (kA − Ãk k2F ) =                          ∑ volk (∆S )2 kA − π S (A)k2F .
                                          ∑S,|S|=k volk (∆S )2 S,|S|=k
                                                                                                            (3.1)


For any (k + 1)-subset S = {i1 , . . . , ik+1 } of rows of A, we can express volk+1 (∆S )2 in terms of volk (∆T )2
for T = {i1 , . . . , ik }, along with the squared distance from A(ik+1 ) to HT , where HT is the linear subspace
spanned by {A(i) : i ∈ T }:
                                                    1
                               volk+1 (∆S )2 =            volk (∆T )2 d(A(ik+1 ) , HT )2 .                  (3.2)
                                                 (k + 1)2
Summing over all subsets S of size k + 1:
                                                               m
                                                 1                   1
                    ∑        volk+1 (∆S )2 =           ∑      ∑
                                               k + 1 T,|T |=k j=1 (k + 1)2
                                                                           volk (∆T )2 d(A( j) , HT )2
                 S,|S|=k+1
                                                                               m
                                                  1
                                          =               ∑
                                               (k + 1)3 T,|T
                                                                 volk (∆ T )2
                                                                              ∑ d(A( j) , HT )2
                                                             |=k              j=1
                                                  1
                                          =               ∑ volk (∆T )2 kA − π T (A)k2F ,
                                               (k + 1)3 T,|T
                                                                                                            (3.3)
                                                             |=k



                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                  235
                            A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

where in the last step we noted that ∑mj=1 d(A( j) , HT )2 = kA − π T (A)k2F . By substituting this equality in
(3.1) and applying Lemma 3.1 (proved next) twice, we have:
                                                                                              !
                                           1
            ES (kA − Ãk k2F ) =                        (k + 1)3 ∑ volk+1 (∆S )2
                                 ∑T,|T |=k volk (∆T )2           S,|S|=k+1
                                                                                                   !
                                           1            (k + 1)
                               =
                                 ∑T,|T |=k volk (∆T )2
                                                                       ∑
                                                         (k!)2 1≤t1 <···<t
                                                                                    2          2
                                                                                  σt1 · · · σtk+1
                                                                           k+1 ≤n
                                                                                                       !
                                                                                                 r
                                           1            (k + 1)
                               ≤
                                 ∑T,|T |=k volk (∆T )2
                                                                      ∑ σt1 · · · σtk ∑ σ j
                                                         (k!)2 1≤t1 <···<t
                                                                                  2         2        2
                                                                                                          (3.4)
                                                                           k ≤r               j=k+1
                                                                                       !
                                                                                r
                                        (k + 1)
                               =                          ∑ volk (∆T )2 ∑ σ 2j
                                 ∑T,|T |=k volk (∆T )2 T :|T |=k             j=k+1

                                 = (k + 1)kA − Ak k2F .



Lemma 3.1.
                                                         1
                                ∑ volk (∆S )2 = (k!)2               ∑              σt21 σt22 · · · σt2k
                               S,|S|=k                       1≤t1 <t2 <···<tk ≤r

where σ1 , σ2 , · · · , σr > 0 = σr+1 = · · · = σn are the singular values of A.

Proof. Let AS be the sub-matrix of A formed by the rows {A(i) : q
                                                                i ∈ S}. Then we know that the volume
of the k-simplex formed by these rows is given by volk (∆S ) = k!1                    det(AS ATS ). Therefore,

                                           1                                  1
                    ∑ volk (∆S )2 = (k!)2 ∑ det(AS ATS ) = (k!)2                            ∑             det(B) .   (3.5)
                  S,|S|=k                      S,|S|=k                                 B: principal
                                                                                     k-minor of AAT

Let det(AAT − λ I) = (−1)m λ m + cm−1 λ m−1 + · · · + c0 be the characteristic polynomial of AAT . From
basic linear algebra we know that the roots of this polynomial are precisely the eigenvalues of AAT , i. e.,
σ12 , σ22 , . . . , σr2 and 0 with multiplicity (m − r). Moreover the coefficient cm−k can be expressed in terms
of these roots as:
                                     cm−k = (−1)m−k         ∑        σt21 σt22 · · · σt2k .                 (3.6)
                                                     1≤t1 <t2 <···<tk ≤r

But we also know that cm−k is the coefficient of λ m−k in det(AAT − λ I), which means

                                         cm−k = (−1)m−k            ∑          det(B)                                 (3.7)
                                                               B: principal
                                                             k-minor of AAT

(see e. g., [8]; we prove it next as Proposition 3.2). This gives us our desired result.

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                       236
            M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

Proposition 3.2. Let the characteristic polynomial of M ∈ Rm×m be

                               det(M − λ Im ) = λ m + cm−1 λ m−1 + · · · + c0 .

Then
                         cm−k = (−1)m−k          ∑          det(B)       for 1 ≤ k ≤ m .
                                             B: principal
                                           k-minor of AAT

Proof. First, it is clear that c0 = det(M). Next, let M 0 = M − λ I, and Sm be the set of permutations of
{1, 2, . . . , m}. The sign of a permutation sgn(τ), for τ ∈ Perm([m]), is equal to 1 if it is a product of an
even number of transpositions and −1 otherwise. For a subset S of rows, we denote the submatrix of
entries (Mi, j )i, j∈S by MS .

                 det(M − λ Im ) = det(M 0 ) =       ∑                0
                                                              sgn(τ)M1,τ(1)  0
                                                                            M2,τ(2)        0
                                                                                    · · · Mm,τ(m) .     (3.8)
                                                τ∈Perm([m])


The term cm−k λ m−k is the sum over τ which fix some set S ⊆ [m] of size (m − k), and the elements
∏i∈S Mi,i
       0 contribute (−1)m−k λ m−k and the coefficient comes from the constant term in



                                           ∑          sgn(τ) ∏ Mi,τ(i)
                                                                0
                                                                       .
                                      τ∈Perm([m]−S)             i∈S
                                                                 /

Each term in this sum is the c0 term of a principal minor of M and so the sum is equal to

                                                ∑      det(M[m]−S ) .
                                           S,|S|=m−k

Hence
                 cm−k = (−1)m−k       ∑        det(M[m]−S ) = (−1)m−k           ∑           det(B) .    (3.9)
                                   S,|S|=m−k                                 B: principal
                                                                           k-minor of AAT




    The bound proved in Theorem 1.3 is in fact asymptotically tight:

Proposition 3.3. Given any ε > 0, there exists a (k + 1) × (k + 1) matrix A such that for any k-subset S
of rows of A,
                           kA − πS,k (A)k2F ≥ (1 − ε) (k + 1) kA − Ak k2F .

Proof. The tight example consists of a matrix with k + 1 rows which are the vertices of a regular k-
dimensional simplex lying on the affine hyperplane {Xk+1 = α} in Rk+1 . Let A(1) , A(2) , . . . , A(k+1)
be the vertices with the point p = (0, 0, . . . , 0, α) as their centroid. For α small enough, the best k
dimensional subspace for these points is given by {Xk+1 = 0} and

                                         kA − Ak k2F = (k + 1)α 2 .                                    (3.10)

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                               237
                         A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

Consider any k-subset of rows from these, say S = {A(1) , A(2) , . . . , A(k) }, and let HS be the linear subspace
spanning them. Then,
                                  kA − πS,k (A)k2F = d(A(k+1) , HS )2 .                                      (3.11)
We claim that for any ε > 0, α can be chosen small enough so that
                                                  p
                                 d(A(k+1) , HS ) ≥ (1 − ε)(k + 1)α .                                        (3.12)
                                            p
Choose α small enough so that d(p, HS ) ≥ (1 − ε)α. Now

                          d(A(k+1) , HS ) d(A(k+1) , conv(A(1) , . . . , A(k) ))
                                         =                                       = k+1                      (3.13)
                            d(p, HS )       d(p, conv(A(1) , . . . , A(k) ))
since the points form a simplex and p is their centroid. The claim follows. Hence,

      kA − πS,k (A)k2F = d(A(k+1) , HS )2 ≥ (1 − ε) (k + 1)2 α 2 = (1 − ε) (k + 1) kA − Ak k2F .            (3.14)



    Next, we prove Theorem 1.4. This theorem follows by interpreting Theorem 1.3 as an existence
theorem and applying Theorem 2.1.

Proof of Theorem 1.4. By Theorem 1.3, there exists a k-subset S1 of rows of A such that

                                    kA − π S1 (A)k2F ≤ (k + 1)kA − Ak k2F .                                 (3.15)

Now, applying Theorem 2.1 with V = span(S1 ) and s = k(k + 1)/ε we get that a random sample S2 of
the rows of A (according to the specified distribution) satisfies

                             ES2 (kA − π V +span(S2 ),k (A)k2F ) ≤ (1 + ε)kA − Ak k2F                       (3.16)

so there exists a subset of the rows achieving the expectation. Since V + span(S2 ) = span(S1 ∪ S2 ), and
|S1 ∪ S2 | = k + k(k + 1)/ε, we have the desired result.


4    Application: Projective clustering
In this section, we give a polynomial-time approximation scheme for the projective clustering problem
described in Section 1.2. We note that a simple multiplicative (k + 1)-approximation follows from
Theorem 1.3 with running time O(dn jk ). Let V1 , . . . ,V j be the optimal subspaces partitioning the point
set into P1 ∪ · · · ∪ Pj , where Pi is the subset of points closest to Vi . Theorem 1.3 tells us that each Pi
contains a subset Si of k points whose span Vi0 is a (k + 1)-approximation, i. e.,

                                     ∑ d(p,Vi0 )2 ≤ (k + 1) ∑ d(p,Vi )2 .
                                    p∈Pi                      p∈Pi

We can find the Si ’s by simply enumerating all possible subsets of k points, considering j of them at a
time, and taking the best of these. This leads to the complexity of dn jk .

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                   238
            M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

     Getting a PTAS will be a bit more complicated. Theorem 1.4 implies that there exists a set P̂i ⊆ Pi
of size k + k(k + 1)/ε in whose span lies an approximately optimal k-dimensional subspace Wi . We can
enumerate over all combinations of j subsets, each of size k + k(k + 1)/ε to find the P̂i , but we cannot
enumerate the infinitely many k-dimensional subspaces lying in the span of P̂i . One natural approach to
solve this problem would be to put a finite grid down in a unit ball in the span of P̂i . The hope would be
that there are k grid points whose span G is “close” to Wi , since each basis vector for Wi is close to a grid
point. However, this will not work; consider a point p very far from the origin. Although the distance
between a basis vector and a grid point might be small, the error induced by projecting p onto a grid
point is proportional to its distance to the origin, which could be too large.
     The problem described above suggests that a grid construction must be dependent on the point set Pi .
Our grid construction considers grid points in the span of P̂i , but instead of a uniform grid in a unit ball,
we consider grid points at bounded distance from each p ∈ π span(P̂i ) (Pi ), i. e., the points in Pi projected
to the span of P̂i . This avoids the problem of points far from the origin, since there are grid points
around each point. Note that we only put grid points around projected points. This is because we seek
a subspace “close” to Wi , which itself lies in the span of P̂i ; Wi and any subspace lying in the span of P̂i
incur the same error for the component of a point orthogonal to the span of P̂i . In Lemma 4.1, we show
that there exists a subspace spanned by k points in our grid that is not much worse than Wi . The lemma
is stated for a general point set, but we apply it to the projected points in Theorem 1.5.
     The algorithm is given below.

Algorithm Cluster

Input: P ⊆ Rd , error parameter 0 < ε < 1, and upper bound B on the optimal cost.
Output: A set {F1 , . . . , Fj } of j k-dimensional subspaces.
                   √                 q
                  ε B
                  √
   1. Set δ =              ε   , R =    (1 + ε2 )B + 2δ k.
                 16 jk   (1+ 2 )n


    2. For each subset T of P of size j(k + 2k(k + 1)/ε):

        (a) For each equipartition T = T1 ∪ · · · ∪ T j :
                i. For each i, construct a δ -net Di with radius R for the projection of P to the span
                   of Ti .
               ii. For each way of choosing j subspaces F1 , . . . , Fj , where Fi is the span of k points
                   from Di , compute the cost C(F1 , . . . , Fj ).

    3. Report the subspaces F1 , . . . , Fj of minimum cost C(F1 , . . . , Fj ).

    In Step 2(a)i, we construct a δ -net Di . A δ -net D with radius R for S is a set such that for any point
q for which d(q, p) ≤ R, for some p ∈ S, there exists a g ∈ D such that d(q, g) ≤ δ . The size of a δ -net
is exponential in the dimension of S. This is why it is crucial that we construct the δ -net for P projected
to the span of Ti . By doing so, we reduce the dimension from d to O(k2 /ε). The correctness of the
algorithm relies crucially on the next lemma.

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                             239
                           A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

Lemma 4.1. Let δ > 0. Let P be a point set, with |P| = n and W be a subspace of dimension k. Let D
be a δ -net with radius R for P, satisfying

                                        R ≥ ∑ d(p,W )2 + 2δ k .
                                            r

                                                      p∈P

Then there exists a subspace F spanned by k points in D such that:

                           ∑ d(p, F)2 ≤ ∑ d(p,W )2 + 4k2 nδ 2 + 4kδ ∑ d(p,W ) .                               (4.1)
                           p∈P                 p∈P                              p∈P

Proof. We construct the subspace F in k steps. Let F0 = W . Inductively, in step i, we choose a point pi
and rotate Fi−1 so that it includes a grid point gi around pi . The subspace resulting from the last rotation,
Fk , is the subspace F with the bound promised by the lemma. To prove that (4.1) holds, we prove the
following inequality for any point p ∈ P going from Fi−1 to Fi
                                               d(p, Fi ) ≤ d(p, Fi−1 ) + 2δ .                                 (4.2)
Summing over the k steps, squaring, and summing over n points, we have the desired result.
    Let G1 = {~0}. Gi will be the span of the grid points {g1 , g2 , . . . , gi−1 }. We describe how to construct
the rotation Ri . Let pi ∈ P maximize k π G⊥ (π Fi−1 (pi ))k and let gi ∈ D minimize d(π Fi−1 (pi ), gi ). The
                                               i
point pi is chosen as the furthest point from the origin in the subspace of Fi−1 orthogonal to Gi . The grid
point gi is the point closest to pi in Fi−1 . Consider the plane Z defined by
                                         π G⊥ (gi ), π G⊥ (π Fi−1 (pi )), and ~0 .
                                               i         i

Let θ be the angle between π G⊥ (gi ) and π G⊥ (π Fi−1 (pi )). Let Ri be the rotation in the plane Z by the
                                  i              i
angle θ , and define Fi = Ri Fi−1 . Set Gi+1 = Gi + span{gi }. By choosing the rotation dependent on pi ,
we ensure that no point moves more than pi . This allows us to prove (4.2) for all points p.
    Now we prove inequality (4.2). We do so by proving the following inequality by induction on i for
any point p:
                                       d(π Fi−1 (p), Ri π Fi−1 (p)) ≤ 2δ .                            (4.3)
Note that this proves (4.2) by applying the triangle inequality, since:
                                 d(p, Fi ) ≤ d(p, Fi−1 ) + d(π Fi−1 (p), π Fi (p))                            (4.4)
                                          ≤ d(p, Fi−1 ) + d(π Fi−1 (p), Ri π Fi−1 (p)) .                      (4.5)
The base case of the inequality, i = 1, is trivial. Consider the inductive case; here, we are bounding the
distance between π Fi−1 (p) and Ri π Fi−1 (p). It suffices to bound the distance between these two points in
the subspace orthogonal to Gi , since the rotation Ri is chosen orthogonal to Gi . That is,
                        d(π Fi−1 (p), Ri π Fi−1 (p)) ≤ d(π G⊥ (π Fi−1 (p)), Ri π G⊥ (π Fi−1 (p))) .           (4.6)
                                                             i                       i

Now, consider the distance between a point π G⊥ (π Fi−1 (p)) and its rotation, Ri π G⊥ (π Fi−1 (p)). This dis-
                                                i                                    i
tance is maximized when k π G⊥ (π Fi−1 (p))k is maximized, so we have, by construction, that the maxi-
                              i
mum value is achieved by pi :
              d(π G⊥ (π Fi−1 (p)), Ri π G⊥ (π Fi−1 (p))) ≤ d(π G⊥ (π Fi−1 (pi )), Ri π G⊥ (π Fi−1 (pi ))) .   (4.7)
                    i                      i                       i                     i



                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                240
             M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

By the triangle inequality we have:

 d(π G⊥ (π Fi−1 (pi )), Ri π G⊥ (π Fi−1 (pi ))) ≤ d(π G⊥ (π Fi−1 (pi )), π G⊥ (gi )) + d(π G⊥ (gi ), Ri π G⊥ (π Fi−1 (pi ))) .
       i                            i                                    i                   i         i       i


To bound the first term, d(π G⊥ (π Fi−1 (pi )), π G⊥ (gi )), note that
                                          i                          i


                                         d(π G⊥ (π Fi−1 (pi )), π G⊥ (gi )) ≤ d(π Fi−1 (pi ), gi ) .
                                               i                             i


We show that π Fi−1 (pi ) is within a ball of radius R around pi ; this implies

                                                            d(π Fi−1 (pi ), gi ) ≤ δ                                     (4.8)

by construction of the δ -net around pi . We have:
                                                                             i−2
                               d(pi , π Fi−1 (pi )) ≤ d(pi , F0 ) + ∑ d(π Fj (pi ), π Fj+1 (pi ))
                                                                                 j=1
                                                                                       i−2
                                                           ∑ d(p,W )2 + ∑ d(π F (pi ), R j+1 π F (pi ))
                                                       r
                                                   ≤                                             j         j
                                                           p∈P                         j=1

                                                           ∑ d(p,W )2 + 2δ k ≤ R .
                                                       r
                                                   ≤
                                                           p∈P

The third line uses the induction hypothesis.
    Now we bound the second term, d(π G⊥ (gi ), Ri π G⊥ (π Fi−1 (pi ))). Note that Ri π G⊥ (π Fi−1 (pi )) is just a
                                                 i          i                                   i
rescaling of π G⊥ (gi ) and that k π G⊥ (π Fi−1 (p∗ ))k = kRi π G⊥ (π Fi−1 (p∗ ))k, since rotation preserves norms.
                i                     i                          i
The bound on the first term implies that k π G⊥ (gi )k ≥ k π G⊥ (π Fi−1 (p∗ ))k − δ , so
                                                                 i                       i


                                               d(π G⊥ (gi ), Ri π G⊥ (π Fi−1 (pi ))) ≤ δ .                               (4.9)
                                                       i                         i

Combining (4.8) and (4.9), we have proved (4.3).

    Now, we are ready to prove Theorem 1.5, which includes the correctness of the algorithm.

Proof of Theorem 1.5. Assume that the optimal solution is of value at most B. Let V1 , . . . ,V j be the
optimal subspaces, and let P1 , . . . , Pj be the partition of P such that Pi is the subset of points closest to
Vi . Theorem 1.4 implies that there exists Si ⊆ Pi of size at most k + 2k(k + 1)/ε such that there is a
k-dimensional subspace Wi in the span of Si with

                         ∑                               ∑
                                            2
                                                     ε              2
                                                                               ε
                             d(p,W       i )  ≤   1 +         d(p,Vi )  ≤   1 +    B .                   (4.10)
                        p∈Pi                          2 p∈P i
                                                                                2

Consider π span(Si ) (Pi ), the projection of Pi to span(Si ). We want to apply Lemma 4.1 to π span(Si ) (Pi ) and
Wi with radius R and δ as in the algorithm. Note that the optimal solution is of value at most B, so we
have that:
       s                                                                r
              ∑ d(p,Wi ) + 2δ k ≤ ∑ d(p,Wi ) + 2δ k ≤ 1 + 2 B + 2δ k = R .
                                  2
                                             r
                                                              2
                                                                                ε
                                                                                                           (4.11)
         p∈π        (P )
               span(Si )   i                     p∈Pi



                                   T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                     241
                                  A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

Let Fi be the subspace spanned by k points from the δ -net Di for π span(Si ) (Pi ) promised by Lemma 4.1.
For every i, we have that:

                ∑              d(p, Fi )2 ≤          ∑              d(p,Wi )2 + 4k2 nδ 2 + 4kδ           ∑              d(p,Wi )
         p∈π span(Si ) (Pi )                  p∈π span(Si ) (Pi )                                p∈π span(Si ) (Pi )
                                                                                    1/2
                                              ∑ d(p,Wi ) + 4 j B + 4kδ n ∑ d(p,Wi )
                                                           ε                2      2
                                        ≤
                                          p∈π  (P )span(Si )   i        p∈π (P )                        span(Si )   i


                                              ∑ d(p,Wi ) + 2 j B .
                                                           ε                2
                                        ≤                                                                                                    (4.12)
                                          p∈π  (P )span(Si )   i



Now, for any point p ∈ Pi , we can decompose the squared distance from p to Fi as follows:

                                   d(p, Fi )2 = d(π span(Si ) (p), Fi )2 + d(π span(Si )⊥ (p), Fi )2 .

The same decomposition can be done for d(p,Wi )2 . Now, since Fi and Wi both lie in the span of Si , we
have the following for any point p ∈ Pi : d(π span(Si )⊥ (p), Fi )2 = d(π span(Si )⊥ (p),Wi )2 . Applying this to
(4.12) and (4.10), we have:

                            ∑ d(p, Fi )2 ≤ 1 + 2 ∑ d(p,Vi )2 + 2 j B .
                                                  ε                      ε
                            p∈Pi                         p∈Pi

Let S = i Si . The algorithm will enumerate S in Step 2a, and it will enumerate the partition S1 ∪ · · · ∪ S j
         S

in Step 2a. In Step 2(a)i, the algorithm will, for each i, construct a δ -net Di for π span(Si ) (P). Lastly, in
Step 2(a)ii it will consider the subspaces F1 , . . . , Fj whose existence is proven above. The cost associated
with this solution is:
                                      j
                                                         ε j
              C(F1 , . . . , Fj ) ≤ ∑ ∑ d(p, Fi )2 ≤ 1 +   ∑ ∑ d(p,Vi )2 + 2 B = (1 + ε)B .
                                                                          ε
                                    i=1 p∈Pi             2 i=1 p∈Pi

                                                                                                        n
                                                                                                               j
    The number of subsets of size k + 2k(k + 1)/ε enumerated by the algorithm is at most 2(k+1)           2 /ε    .
A δ -net D with radius R for
                           √ a point set Q of dimension d is implemented by putting a box with side
length 2R of grid width δ / d around each point in Q. Let X be the set of grid points in the box around
a point p. The number of subspaces in each δ -net Di is therefore at most the number of j subspaces that
one can choose for a partition T1 ∪ · · · ∪ T j is (n|X|) jk . The computation for projecting points, finding a
basis, and determining the cost of a candidate family of subspaces takes time O(nd jk). The cardinality
of X is (for ε < 1):
                                                                      !2(k+1)2 /ε      r 2(k+1)2 /ε
                                                    2R                                     n
                               |X| =      p                                         ≤ O jk              .                                    (4.13)
                                       δ / 2(k + 1)2 /ε                                    ε

                                                                                                    2                                 3
                                                                                                                                   n O( jk /ε)
Therefore, the running time of the algorithm is at most O(nd jk) n2 j(k+1) /ε (n|X|) jk = d                                        ε           .


                                  T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                                                              242
           M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

5    Conclusions and subsequent work
Theorem 1.4 was further improved in [13] to show that for any real matrix, there exist O(k/ε + k log k)
rows whose span contains the rows of a multiplicative (1 + ε)-approximation to the best rank-k matrix.
Using this subset of O(k/ε + k log k) rows, the exponent in the running time of the projective clustering
algorithm decreases from O( jk3 /ε) to O( jk2 /ε). It would be interesting to know if we can compute this
set of O(k/ε + k log k) rows efficiently in a small number of passes. It would also be nice to see other
applications of volume sampling.
    In a recent result, Sarlos [34] proved that, using random linear combinations of rows instead of a
subset of rows, we can compute a multiplicative (1 + ε)-approximation using only 2 passes.


References
 [1] * D. ACHLIOPTAS AND F. M C S HERRY: Fast computation of low rank matrix approximations. In
     Proc. 33rd STOC., pp. 611–618. ACM Press, 2001. [STOC:380752.380858]. 1.1, 1.3

 [2] * P. AGARWAL , S. H AR -P ELED , AND K. VARADARAJAN: Geometric approximations via core-
     sets. Combinatorial and Computational Geometry - MSRI Publications, 52:1–30, 2005. 1.2,
     1.3

 [3] * P. AGARWAL AND N. M USTAFA: k-means projective clustering. In Proc. Principles of Database
     Systems (PODS’04), pp. 155–165. ACM Press, 2004. [PODS:1055558.1055581]. 1.1, 1.3

 [4] * P. AGARWAL , C. P ROCOPIUC , AND K. VARADARAJAN: Approximation algorithms for a k-line
     center. Algorithmica, 42:221–230, 2005. [Algorithmica:p321725768012505]. 1.3

 [5] * R. AGARWAL , J. G EHRKE , D. G UNOPULOS , AND P. R AGHAVAN: Automatic subspace cluster-
     ing of high dimensional data for data mining applications. Data Mining and Knowledge Discovery,
     11:5–33, 2005. [Springer:m18077t9134v55n2]. 1.1, 1.3

 [6] * C. AGGARWAL , C. P ROCOPIUC , J. W OLF, P. Y U , AND J. PARK: Fast algorithms for projected
     clustering. In Proc. Conf. on Management of Data (SIGMOD’99), pp. 61–72. ACM Press, 1999.
     [SIGMOD:304182.304188]. 1.1, 1.3

 [7] * N. A LON , Y. M ATIAS , AND M. S ZEGEDY: The space complexity of approximat-
     ing the frequency moments.        J. Computer and System Sciences, 58:137–147, 1999.
     [JCSS:10.1006/jcss.1997.1545]. 1.3

 [8] * M. A RTIN: Algebra. Prentice-Hall, 1991. 3

 [9] * Z. BAR -YOSSEF: Sampling lower bounds via information theory. In Proc. 35th STOC., pp.
     335–344. ACM Press, 2003. [STOC:780542.780593]. 1.3

[10] * M. B ĂDOIU , S. H AR -P ELED , AND P. I NDYK: Approximate clustering via core-sets. In Proc.
     34th STOC., pp. 250–257. ACM Press, 2002. [STOC:509907.509947]. 1.3

                       T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                           243
                      A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

[11] * W. F ERNANDEZ DE LA V EGA , M. K ARPINSKI , C. K ENYON , AND Y. R ABANI: Approxi-
     mation schemes for clustering problems. In Proc. 35th STOC., pp. 50–58. ACM Press, 2003.
     [STOC:780542.780550]. 1.3

[12] * A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , AND G. WANG: Matrix approximation and
     projective clustering via volume sampling. In Proc. 17th ACM-SIAM Symp. on Discrete Algorithms
     (SODA’06), pp. 1117–1126. SIAM, 2006. [SODA:1109557.1109681]. 1.4

[13] * A. D ESHPANDE AND S. V EMPALA: Adaptive sampling and fast low-rank matrix approximation.
     In Proc. 10th Internat. Workshop on Randomization and Computation (RANDOM’06), pp. 292–
     303. Springer, 2006. 5

[14] * P. D RINEAS AND R. K ANNAN: Pass efficient algorithms for approximating large matrices.
     In Proc. 14th ACM-SIAM Symp. on Discrete Algorithms (SODA’03), pp. 223–232. SIAM, 2003.
     [SODA:644108.644147]. 1.1, 1.2, 1.3

[15] * P. D RINEAS , R. K ANNAN , A. F RIEZE , S. V EMPALA , AND V. V INAY: Clustering
     large graphs via the singular value decomposition. Machine Learning, 56:9–33, 2004.
     [ML:u424k6nn6k622788]. 1.1, 1.1, 1.2, 1.3, 1.3

[16] * P. D RINEAS , R. K ANNAN , AND M. M AHONEY: Fast Monte Carlo algorithms for matrices II:
     Computing a low-rank approximation to a matrix. SIAM J. on Computing, 36:158–183, 2006.
     [SICOMP:10.1137/S0097539704442696]. 1.1

[17] * M. E FFROS AND L. J. S CHULMAN: Rapid near-optimal VQ design with a deterministic data net.
     In Proc. Int. Symp. on Information Theory, p. 298. IEEE, 2004. [ISIT:10.1109/ISIT.2004.1365336].
     1.3

[18] * J. F EIGENBAUM , S. K ANNAN , A. M C G REGOR , S. S URI , AND J. Z HANG .: On graph
     problems in a semi-streaming model. Theoretical Computer Science, 348:207–216, 2005.
     [TCS:10.1016/j.tcs.2005.09.013]. 1.3

[19] * A. F RIEZE , R. K ANNAN , AND S. V EMPALA: Fast Monte Carlo algorithms for finding low-rank
     approximations. Journal of the ACM, 51:1025–1041, 2004. [JACM:1039488.1039494]. 1.1, 1.1,
     1, 1.2, 1.3, 2.1

[20] * G. H. G OLUB AND C. F. VAN L OAN: Matrix Computations. Johns Hopkins, third edition, 1996.
     1.3

[21] * S. A. G OREINOV AND E. E. T YRTYSHNIKOV: The maximal-volume concept in approximation
     by low-rank matrices. Contemporary Mathematics, 280:47–51, 2001. 1.3

[22] * S. G UHA , N. KOUDAS , AND K. S HIM: Approximation and streaming algorithms for his-
     togram construction problems. ACM Transactions on Database Systems, 31:396–438, 2006.
     [TODS:1132863.1132873]. 1.3

                      T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                        244
          M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

[23] * S. H AR -P ELED AND S. M AZUMDAR: On coresets for k-means and k-median clustering. In
     Proc. 36th STOC., pp. 291–300. ACM Press, 2004. [STOC:1007352.1007400]. 1.3

[24] * S. H AR -P ELED AND K. VARADARAJAN: Projective clustering in high dimensions using
     core-sets. In Proc. 18th Symp. on Comp. Geometry (SOCG), pp. 312–318. ACM Press, 2002.
     [SOCG:513400.513440]. 1.2, 1.3

[25] * M. H ENZINGER , P. R AGHAVAN , AND S. R AJAGOPALAN: Computing on data streams. External
     Memory Algorithms – DIMACS Series in Discrete Mathematics and Computer Science, 50:107–
     118, 1999. 1.3

[26] * D. K EMPE AND F. M C S HERRY: A decentralized algorithm for spectral analysis. In Proc. 36th
     STOC., pp. 561–568. ACM Press, 2004. [STOC:1007352.1007438]. 1.3

[27] * J. K UCZY ŃSKI AND H. W O ŹNIAKOWSKI: Estimating the largest eigenvalues by the power
     and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl., 13:1094–1122, 1992.
     [SIMAX:10.1137/0613066]. 1.3

[28] * A. K UMAR , Y. S ABHARWAL , AND S. S EN .: A simple linear time (1 + ε)-approximation algo-
     rithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pp. 454–462. IEEE Computer
     Society Press, 2004. [FOCS:10.1109/FOCS.2004.7]. 1.3

[29] * J. M ATOU ŠEK: On approximate geometric k-clustering. Discrete and Computational Geometry,
     24:61–84, 2000. [Springer:xawxbau59gtjtvv6]. 1.3

[30] * N. M EGIDDO AND A. TAMIR: On the complexity of locating linear facilities in the plane.
     Operations Research Letters, 13:194–197, 1982. [Elsevier:10.1016/0167-6377(82)90039-6]. 1.1

[31] * R. O STROVSKY AND Y. R ABANI: Polynomial time approximation schemes for geometric min-
     sum median clustering. Journal of the ACM, 49:139–156, 2002. [JACM:506147.506149]. 1.3

[32] * C. P ROCOPIUC , P. AGARWAL , T. M URALI , AND M. J ONES: A Monte Carlo algorithm for fast
     projective clustering. In Proc. Conf. on Management of Data (SIGMOD’02), pp. 418–427. ACM
     Press, 2002. [SIGMOD:564691.564739]. 1.1, 1.3

[33] * L. R ADEMACHER , S. V EMPALA , AND G. WANG: Matrix approximation and projective cluster-
     ing. Technical Report 2005–018, MIT CSAIL, 2005. 1.4

[34] * T. S ARL ÓS: Improved approximation algorithms for large matrices via random projection. In
     Proc. 47th FOCS, pp. 143–152. IEEE Computer Society Press, 2006. 5




                      T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                      245
                   A. D ESHPANDE , L. R ADEMACHER , S. V EMPALA , G. WANG

AUTHORS

   Amit Deshpande
   2-342
   77 Mass. Ave.,
   Cambridge, MA 02139, USA.
   amitd mit edu
   http://www.mit.edu/~amitd/


   Luis Rademacher
   2-331
   77 Mass. Ave.
   Cambridge, MA 02139, USA.
   lrademac math mit edu
   http://www-math.mit.edu/~lrademac/


   Santosh Vempala
   College of Computing
   Georgia Institute of Technology
   Atlanta, GA 30332, USA.
   vempala cc gatech edu
   http://www-math.mit.edu/~vempala/


   Grant Wang
   Yahoo!
   701 First Avenue
   Sunnyvale, CA 94089, USA.
   gjw alum mit edu
   http://theory.csail.mit.edu/~gjw/


ABOUT THE AUTHORS

   A MIT D ESHPANDE is a graduate student in Applied Math at MIT. He did his undergraduate
      studies at Chennai Mathematical Institute (CMI) in India. Apart from math and theory,
      he enjoys north Indian classical music.




                   T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                        246
    M ATRIX A PPROXIMATION AND P ROJECTIVE C LUSTERING VIA VOLUME S AMPLING

L UIS R ADEMACHER is a Ph. D. candidate in the Department of Mathematics at MIT, su-
    pervised by Santosh Vempala. His research interests include game theory, matrix ap-
    proximation, computational lower bounds, and the intersection between geometry and
    algorithms. He grew up in Chile and enjoys music as a hobby.


S ANTOSH V EMPALA is a professor in the College of Computing and director of the newly
   formed Algorithms and Randomness Center at Georgia Tech. His research interests,
   ironically, are in algorithms, randomness, and geometry. He graduated from CMU in
   1997 following the advice of Avrim Blum and was at MIT till 2006 except for a year as
   a Miller fellow at UC Berkeley. He gets unreasonably excited when a phenomenon that
   appears complex from one perspective turns out to be simple from another.


G RANT WANG graduated from MIT in August 2006 with a Ph. D. in computer science. His
   advisor was Santosh Vempala. He graduated from Cornell University with a B. S. in
   Computer Science in 2001. His research interests are in algorithms, machine learning,
   and data mining. As of September 2006, he is working at Yahoo!




                T HEORY OF C OMPUTING, Volume 2 (2006), pp. 225–247                        247