Online Row Sampling

Authors Michael B. Cohen, Cameron Musco, Jakub Pachocki,
Plaintext
                          T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25
                                       www.theoryofcomputing.org

                          S PECIAL ISSUE : APPROX-RANDOM 2016



                                Online Row Sampling
             Michael B. Cohen*                      Cameron Musco†                 Jakub Pachocki‡
                 Received April 14, 2017; Revised April 10, 2020; Published December 11, 2020




       Abstract. Finding a small spectral approximation for a tall n × d matrix A is a fundamental
       numerical primitive. For a number of reasons, one often seeks an approximation whose rows
       are sampled from those of A. Row sampling improves interpretability, saves space when A
       is sparse, and preserves structure, which is important, e. g., when A represents a graph.
           However, correctly sampling rows from A can be costly when the matrix is large and
       cannot be stored and processed in memory. Hence, a number of recent publications focus
       on row sampling in the streaming setting, using little more space than what is required to
       store the returned approximation (Kelner–Levin, Theory Comput. Sys. 2013, Kapralov et al.,
       SIAM J. Comp. 2017).
           Inspired by a growing body of work on online algorithms for machine learning and
       data analysis, we extend this work to a more restrictive online setting: we read rows of
       A one by one and immediately decide whether each row should be kept in the spectral
       approximation or discarded, without ever retracting these decisions. We present an extremely
       simple algorithm that approximates A up to multiplicative error 1 + ε and additive error δ
     A preliminary version of this paper appeared in the Proceedings of the 19th International Workshop on Approximation
Algorithms for Combinatorial Optimization Problems (APPROX 2016).
   * Work supported in part by NSF grant CCF-1111109.
   † Work supported by NSF Graduate Research Fellowship No. 1122374, AFOSR grant FA9550-13-1-0042 and the NSF

Center for Science of Information.
   ‡ Work supported by NSF grant CCF-1065106.



ACM Classification: F.2.1, F.1.2, G.2.2
AMS Classification: 68W25, 68W20, 68R10
Key words and phrases: spectral sparsification, leverage scores, online algorithms, matrix approximation


© 2020 Michael B. Cohen, Cameron Musco, and Jakub Pachocki
c b Licensed under a Creative Commons Attribution License (CC-BY)                          DOI: 10.4086/toc.2020.v016a015
                      M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

      using O(d log d log(εkAk22 /δ )/ε 2 ) online samples, with memory overhead proportional to
      the cost of storing the spectral approximation. We also present an algorithm that uses O(d 2 )
      memory but only requires O(d log(εkAk22 /δ )/ε 2 ) samples, which we show is optimal.
          Our methods are clean and intuitive, allow for lower memory usage than prior work, and
      expose new theoretical properties of leverage score based matrix approximation.


1     Introduction
1.1   Background
A spectral approximation to a tall n × d matrix A is a smaller, typically Õ(d) × d matrix Ã such that
kÃxk2 ≈ kAxk2 for all x. Typically one asks for a multiplicative approximation, which guarantees that
(1 − ε)kAxk22 ≤ kÃxk22 ≤ (1 + ε)kAxk22 . In other notation,

                                          (1 − ε)A  Ã  (1 + ε)A.

    Such approximations have many applications, most notably for solving least squares regression over
A [9, 11]. If A is the vertex-edge incidence matrix of a graph, Ã is a spectral sparsifier [26]. It can be
used to approximate effective resistances, spectral clustering, mixing time and random walk properties,
and many other computations.
    A number of recent papers focus on fast algorithms for spectral approximation. Using sparse random
subspace embeddings [9, 23, 22], it is possible to find Ã in input sparsity time—i. e., running time scaling
linearly in the number of nonzero entries in A. These methods produce Ã by randomly recombining
the rows of A into a smaller number of rows. In some cases these embeddings are not enough, as it is
desirable for the rows of Ã to be a subset of the rows of A. If A is sparse, this ensures that Ã is also sparse.
If A represents a graph, it ensures that Ã is also a graph, specifically a weighted subgraph of the original.
    It is well known that sampling O(d log d/ε 2 ) rows of A with probabilities proportional to their
leverage scores yields a (1 ± ε)-factor spectral approximation to A. Further, this sampling can be done in
input sparsity time, either using subspace embeddings to approximate leverage scores, or using iterative
sampling techniques [20], some that only work with subsampled versions of the original matrix [11].

1.2   Streaming and online row sampling
When A is very large, input sparsity running times are not enough—memory restrictions also become
important. Hence, recent work has tackled row sampling in a streaming model of computation. [16]
presents a simple algorithm for sampling rows from an insertion-only stream, using space approximately
proportional to the size of the final approximation. [15] gives a sparse-recovery based algorithm that works
in dynamic streams with row insertions and deletions, also using nearly optimal space. Unfortunately, to
handle dynamic streams, the algorithm in [15] is complex, requires additional restrictions on the input
matrix, and uses significantly suboptimal running time to recover a spectral approximation from its low
memory representation of the input stream.
    While the algorithm in [16] is simple and efficient, we believe that its proof is incomplete, and do
not see an obvious way to fix it. The main idea behind the algorithm is to sample rows by their leverage

                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                   2
                                                 O NLINE ROW S AMPLING

scores with respect to the stream seen so far. These leverage scores may be coarse overestimates of the
true scores. However as more rows are streamed in, better estimates can be obtained and the sampled
rows pruned to a smaller set. Unfortunately, the probability of sampling a row becomes dependent on
which other rows are sampled. This seems to break the argument in that paper, which essentially claims
that their process has the same distribution as would a single round of leverage score sampling.1
    In this paper we initiate the study of row sampling in an online setting. As in an insertion stream,
we read rows of A one by one. However, upon seeing a row, we immediately decide whether it should
be kept in the spectral approximation or discarded, without ever retracting these decisions. We present
a similar algorithm to [16], however, since we never prune previously sampled rows, the probability
of sampling a row only depends on whether previous rows in the stream were sampled. This limited
dependency structure allows us to rigorously argue that a spectral approximation is obtained.
    In addition to addressing gaps in the literature on streaming spectral approximation, our restricted
model extends work on online algorithms for a variety of other machine learning and data analysis
problems, including principal component analysis [4], clustering [21], classification [3, 14], and regression
[14]. In practice, online algorithms are beneficial since they can be highly computationally and memory
efficient. Further, they can be applied in scenarios in which data is produced in a continuous stream
and intermediate results must be output as the stream is processed. Spectral approximation is a widely
applicable primitive for approximate learning and computation, so studying its implementation in an
online setting is a natural direction. Since the initial publication of this work, online row sampling
methods have found applications in kernel matrix approximation [7, 8] and sliding window algorithms
for streaming matrix approximation [6].


1.3    Our results
Our primary contribution is a very simple algorithm for leverage score sampling in an online manner. The
main difficultly with row sampling using leverage scores is that leverage scores themselves are not easy
to compute. They are given by li = aTi (AT A)−1 ai , and so require solving systems in AT A if computed
naively. This is not only expensive, but also impossible in an online setting, where we do not have access
to all of A.
    A critical observation is that it always suffices to sample rows by overestimates of their true leverage
scores. The number of rows that must be sampled is proportional to the sum of these overestimates. Since
the leverage score of a row can only go up when we remove rows from the matrix, a simple way to obtain
an overestimate is to compute leverage score using just a subset of the other rows of A. That is, letting A j
contain just j of the n rows of A, we can overestimate li by l˜i = aTi (ATj A j )−1 ai
    [11] shows that if A j is a subset of rows sampled uniformly at random, then the expected leverage
score of ai is d/ j. This simple fact immediately gives a result for online sampling from a randomly
ordered stream. If we compute the leverage score of the current row ai against all previously seen
rows (or some approximation to these rows), then the expected sum of our overestimates is bounded by
d + d/2 + · · · + · · · + d/n = O(d log n). So, sampling O(d log d log n/ε 2 ) rows is enough obtain a (1 + ε)
multiplicative-error spectral approximation.
   1 Since the initial publication this work, the independence issue in [16] has been resolved by [18], which presents a similar

algorithm for insertion-only streams that admits a correct proof.


                           T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                              3
                      M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

     What if we cannot guarantee a randomly ordered input stream? Is there any hope of being able to
compute good leverage score estimates in an online manner? Surprisingly the answer to this is yes—we
can in fact run nearly the exact same algorithm and be guaranteed that the sum of estimated leverage
scores is low, regardless of stream order. Roughly, each time we receive a row which has high leverage
score with respect to the previous rows, it must compose a significant part of the spectrum of A. If A
does not continue to grow unboundedly, there simply cannot be too many of these significant rows.
     Specifically, we show that if we sample by the ridge leverage scores [1] over all previously seen
rows, which are the leverage scores computed over ATi Ai + λ I for some small regularizing factor λ , then
with just O(d log d log(εkAk22 /δ )/ε 2 ) samples we obtain a (1 + ε) multiplicative-error, δ additive-error
spectral approximation. That is, with high probability we sample a matrix Ã with (1 − ε)AT A − δ I 
ÃT Ã  (1 + ε)AT A + δ I.
     To gain intuition behind this bound, note that we can convert it into a multiplicative one by setting
δ = εσmin (A)2 where σmin (A) is the minimum singular value of A (as long as we have some estimate
of σmin (A)). This setting of δ will require taking O(d log d log(κ(A))/ε 2 ) samples, where κ(A) =
σmax (A)/σmin (A) is the condition number of A. If we have a polynomial bound on this condition
number, as we do, for instance, for graphs with polynomially bounded edge weights, this becomes
O(d log2 d/ε 2 )—nearly matching the O(d log d/ε 2 ) achievable if sampling by true leverage scores.
     Our online sampling algorithm is extremely simple. When each row comes in, we compute the online
ridge leverage score, or an estimate of it, and then irrevocably either add the row to our approximation
or remove it. As mentioned, it is similar in form to the streaming algorithm of [16], except that
it does not require pruning previously sampled rows. This allows us to avoid difficult dependency
issues. Additionally, without pruning, we do not even need to store all previously sampled rows. As
long as we store a constant-factor spectral approximation our previous samples, we can compute good
approximations to the online ridge leverage scores. In this way, we can store just O(d log d log(εkAk22 /δ ))
rows in working memory (O(d log2 d) if we want a spectral graph sparsifier), filtering our input stream
into an O(d log d log(κ(A))/ε 2 )-size output stream. Note that this memory bound in fact improves as ε
decreases, and regardless, can be significantly smaller than the output size of the algorithm.
     In addition to our main sampling result, we use our bounds on online ridge leverage score approxi-
mations to show that an algorithm in the style of [2] allows us to remove a log d factor and sample just
O(d log(εkAk22 /δ )/ε 2 ) rows (Theorem 4.1). This algorithm is more complex and can require O(d 2 )
working memory. However, in Theorem 5.1 we show that it is asymptotically optimal. The log(εkAk22 /δ )
factor is not an artifact of our analysis, but is truly the cost of the restricting ourselves to online sampling.
No algorithm can obtain a multiplicative (1 + ε) additive δ spectral approximation taking fewer than
Ω(d log(εkAk22 /δ )/ε 2 ) rows in an online manner.


2    Overview
Let A be an n × d matrix with rows a1 , . . . , an . A natural approach to row sampling from A is picking an a
priori probability with which each row is kept, and then deciding whether to keep each row independently.
A common choice is for the sampling probabilities to be proportional to the leverage scores of the rows.
The leverage score of the i-th row of A is defined to be
                                                 aTi (AT A)† ai ,

                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                  4
                                               O NLINE ROW S AMPLING

where the dagger symbol denotes the pseudoinverse. In this work, we will be interested in approximating
AT A with some (very) small multiple of the identity added. Hence, we will be interested in the λ -ridge
leverage scores [1]:
                                         aTi (AT A + λ I)−1 ai ,
for a parameter λ > 0.
     In many applications, obtaining the (nearly) exact values of aTi (AT A + λ I)−1 ai for sampling is
difficult or outright impossible. A key idea is that as long as we have a sequence l1 , . . . , ln of overestimates
of the λ -ridge leverage scores, that is, for i = 1, . . . , n,
                                               li ≥ aTi (AT A + λ I)−1 ai ,
we can sample by these overestimates and obtain rigorous guarantees on the quality of the obtained
spectral approximation. This notion is formalized in Theorem 2.1.
Theorem 2.1. Let A be an n × d matrix with rows a1 , . . . , an . Let ε ∈ (0, 1), δ > 0, λ := δ /ε, c :=
8 log d/ε 2 . Assume we are given l1 , . . . , ln such that for all i = 1, . . . , n,
                                               li ≥ aTi (AT A + λ I)−1 ai .
For i = 1, . . . , n, let pi := min(cli , 1). Construct Ã by independently sampling each row ai of A with
                                           √
probability pi , and rescaling it by 1/ pi if it is included in the sample. Then, with high probability,
                            (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I,
and the number of rows in Ã is O (∑ni=1 li ) log d/ε 2 .
                                                       

Proof. This sort of guarantee for leverage score sampling
                                                        √ is well known. See for example Lemma 4 of
[11]. If we sampled both the rows of A and the rows of λ I with the leverage scores over (AT A + λ I),
we would have (1 − ε)(AT A + λ I)  ÃT Ã  (1 + ε)(AT A + λ I). However, we do not sample the rows
of the identity. Since we could have sampled them each with probability 1, we can simply subtract
λ I = (δ /ε)I from the multiplicative bound and have (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I.
    The idea of using overestimates of leverage scores to perform row sampling has been applied
successfully to various problems (see, e. g., [17, 11]). However, in these applications, access to the entire
matrix is required beforehand. In the streaming and online settings, we have to rely on partial data to
approximate the true leverage scores. The most natural idea is to just use the portion of the matrix seen
thus far as an approximation to A. This leads us to introduce the online λ -ridge leverage scores:
                                       li := min(aTi (ATi−1 Ai−1 + λ I)−1 ai , 1),
where Ai (i = 0, . . . , n) is defined as the matrix consisting of the first i rows of A.2
    Since clearly ATi Ai  AT A for all i, it is not hard to see that li does overestimate the true λ -ridge
leverage score for row ai . A more complex question, however, is establishing an upper bound on ∑ni=1 li
so that we can bound the number of samples needed by Theorem 2.1.
    A core result of this work, stated in Theorem 2.2, is establishing such an upper bound; in fact, this
bound is shown to be tight up to constants (Theorem 5.1) and is nearly linear in most cases.
   2 We use the proposed scores l for simplicity, however note that the following, perhaps more natural, definition of online
                                  i
leverage scores would also be effective: li0 := aTi (ATi Ai + λ I)−1 ai .


                           T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                           5
                       M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

Theorem 2.2. Let A be an n × d matrix with rows a1 , . . . , an . Let Ai for i ∈ {0, . . . , n} be the matrix
consisting of the first i rows of A. For λ > 0, let
                                       li := min(aTi (ATi−1 Ai−1 + λ I)−1 ai , 1)
be the online λ -ridge leverage score of the ith row of A. Then
                                             n
                                            ∑ li = O(d log(kAk22 /λ )).
                                            i=1

      Theorems 2.1 and 2.2 suggest a simple algorithm for online row sampling: simply use the online λ -
ridge leverage scores, for λ := δ /ε. This gives a spectral approximation with O(d log d log(εkAk22 /δ )/ε 2 )
rows. Unfortunately, computing each li exactly requires us to store all the rows we have seen in memory
(or alternatively to store the sum of their outer products, ATi Ai ). In many cases, such a requirement would
defeat the purpose of streaming row sampling.
      A natural idea is to use the sample we have kept thus far as an approximation to Ai when computing
li . It turns out that the approximate online ridge leverage scores l˜i computed in this way will not always
be good approximations to li ; however, we can still prove that they satisfy the requisite bounds and yield
the same row sample size! We formalize these results in the algorithm O NLINE -S AMPLE (Figure 1) and
Theorem 2.3.

     Ã = O NLINE -S AMPLE(A, ε, δ ), where A is an n × d matrix with rows a1 , . . . , an ,
     ε ∈ (0, 1), δ > 0.

        1. Set λ := δ /ε, c := 8 log d/ε 2 .

        2. Let Ã0 be a 0 × d matrix.

        3. For i = 1, . . . , n:

             (a) Let l˜i := min((1 + ε)aTi (ÃTi−1 Ãi−1 + λ I)−1 ai , 1).
             (b) Let pi := min(cl˜i , 1).
                             "          #
                                Ãi−1
                             
                                   √        with probability pi ,
             (c) Set Ãi :=     ai / pi
                             
                             
                              Ãi−1          otherwise.
                             

        4. Return Ã := Ãn .

                                   Figure 1: The basic online sampling algorithm


Theorem 2.3. Let Ã be the matrix returned by O NLINE -S AMPLE(A, ε, δ ). With high probability,
                                   (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I,
and the number of rows in Ã is O(d log d log(εkAk22 /δ )/ε 2 ).

                         T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                              6
                                         O NLINE ROW S AMPLING

    To save computation, we note that, with a small modification, we can run O NLINE -S AMPLE with
batch processing of rows. Specifically, say we start from the ith position in the stream. we can store
the next b = O(d) rows. We can then compute sampling probabilities for these rows all at once using a
system solver for (ÃTi+b Ãi+b + λ I). Using a trick introduced in [25], by applying a Johnson–Lindenstrauss
random projection to the rows whose scores we are computing, we need just O(log(1/γ)) system solves to
compute constant-factor approximations to the ridge scores with probability 1 − γ. If we set γ = 1/poly(n)
then we can union bound over our whole stream, using this trick with each batch of O(d) input rows. The
batch probabilities will only be closer to the true ridge leverage scores than the non-batch probabilities
and we will enjoy the same guarantees as O NLINE -S AMPLE.
    Additionally, it turns out that with a simple trick, it is possible to reduce the memory usage of the
algorithm by a factor of ε −2 , bringing it down to O(d log d log(εkAk22 /δ )) (assuming the row sample
is output to an output stream). Note that this expression gets smaller with ε; hence we obtain a row
sampling algorithm with memory complexity independent of desired multiplicative precision. The basic
idea is that, instead of keeping all previously sampled rows in memory, we store a smaller set of rows
that give a constant-factor spectral approximation, still enough to give good estimates of the online ridge
leverage scores.
    This result is presented in the algorithm S LIM -S AMPLE (Figure 2) and Lemma 3.5. A particularly
interesting consequence for graphs with polynomially bounded edge weights is

Corollary 2.4. Let G be a simple graph on d vertices, and ε ∈ (0, 1). We can construct a (1+ε)-sparsifier
of G of size O(d log2 d/ε 2 ), using only O(d log2 d) working memory in the online model.

Proof. Let σmin (A) and σmax (A) = kAk2 be the smallest and largest singular values of A respectively.
Let κ(A) = σmax (A)/σmin (A) be the condition number of A. If we apply Theorem 2.3 with δ =
    2 (A). we require sample complexity O(d log d log(εkAk2 /δ )/ε 2 ) = O(d log d log(κ(A)2 )/ε 2 ).
ε/σmin                                                            2
For an unweighted graph on d vertices, σmax (A)2 ≤ d, since d is the largest squared singular value of the
complete graph. Combining with Lemma 6.1 of [27], we have that the condition number of a graph on
d vertices whose edge weights are within a multiplicative poly(d) of each other is polynomial in d. So
log(κ 2 (A)) = O(log d), which gives the corollary.

     We remark that the algorithm of Corollary 2.4 can be made to run in nearly linear time in the stream
size. We combine S LIM -S AMPLE with the batch processing idea described above. Because A is a graph,
our matrix approximation is always a symmetric diagonally dominant matrix, with O(d) nonzero entries.
We can solve systems in it in time Õ(d). Using the Johnson–Lindenstrauss random projection trick of
[25], we can compute approximate ridge leverage scores for a batch of O(d) rows with failure probability
polynomially small in n in Õ(d log n) time. Union bounding over the whole stream, we obtain nearly
linear running time.
     To complement the row sampling results discussed above, we explore the limits of the proposed online
setting. In Section 4 we present the algorithm O NLINE -BSS, which obtains spectral approximations with
O(d log(εkAk22 /δ )/ε 2 ) rows in the online setting (with larger memory requirements than the simpler
sampling algorithms). Its analysis is given in Theorem 4.1. In Section 5, we show that this number of
samples is in fact the best achievable, up to constant factors (Theorem 5.1). The log(εkAk22 /δ ) factor is
truly the cost of requiring rows to be selected in an online manner.

                       T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                               7
                     M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

3    Analysis of sampling schemes
We begin by bounding the sum of online λ -ridge leverage scores. The intuition behind the proof of
Theorem 2.2 is that whenever we add a row with a large online leverage score to a matrix, we increase
its determinant significantly, as follows from the matrix determinant lemma (Lemma 3.1). Thus we can
reduce upper bounding the online leverage scores to bounding the matrix determinant.

Lemma 3.1 (Matrix determinant lemma). Assume S is an invertible square matrix and u is a vector. Then

                                  det(S + uuT ) = (det S)(1 + uT S−1 u).

Proof of Theorem 2.2. By Lemma 3.1, we have

                 det(ATi+1 Ai+1 + λ I) = det(ATi Ai + λ I) · 1 + aTi+1 (ATi Ai + λ I)−1 ai+1
                                                                                               

                                       ≥ det(ATi Ai + λ I) · (1 + li+1 )
                                       ≥ det(ATi Ai + λ I) · eli+1 /2 .

Hence,

                                    det(AT A + λ I) = det(ATn An + λ I)
                                                       ≥ det(λ I) · e∑ li /2
                                                       = λ d e∑ li /2 .

We have det(AT A + λ I) ≤ (kAk22 + λ )d . Therefore

                                         (kAk22 + λ )d ≥ λ d e∑ li /2 .

Taking logarithms of both sides, we obtain

                                 d log(kAk22 + λ ) ≥ d log λ + ∑ li /2
                                               ∑ li ≤ 2d log(1 + kAk22 /λ ).
    We now turn to analyzing the algorithm O NLINE -S AMPLE. Because the samples taken by the
algorithm are not independent, we are not able to use a standard matrix Chernoff bound like the one in
Theorem 2.1. However, we do know that whether we take row i does not depend on later rows; thus, we
are able to analyze the process as a martingale. We will use a matrix version of the Freedman inequality
given by Tropp.

Theorem 3.2 (Matrix Freedman inequality [28]). Let Y0 , Y1 , . . . , Yn be a matrix martingale whose values
are self-adjoint matrices with dimension d, and let X1 , . . . , Xn be the difference sequence. Assume that
the difference sequence is uniformly bounded in the sense that

                                kXk k2 ≤ R almost surely, for k = 1, . . . , n.

                       T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                             8
                                                 O NLINE ROW S AMPLING

Define the predictable quadratic variation process of the martingale as
                                      k
                                Wk := ∑ E X2j | Y j−1 , . . . , Y0 , for k = 1, . . . , n.
                                                                 
                                           j=1

Then, for all ε > 0 and σ 2 > 0,
                                                                                    ε 2 /2
                                                                                          
                          P kYn k2 ≥ ε and kWn k2 ≤ σ 2 ≤ d · exp −
                                                      
                                                                                             .
                                                                                σ 2 + Rε/3
    We begin by showing that the output of O NLINE -S AMPLE is in fact an approximation of A, and that
the approximate online leverage scores are lower bounded by the actual online leverage scores.
Lemma 3.3. After running O NLINE -S AMPLE, it holds with high probability that
                                  (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I,
and also
                                                 l˜i ≥ aTi (AT A + λ I)−1 ai
for i = 1, . . . , n.
Proof. Let
                                                 ui := (AT A + λ I)−1/2 ai .
We construct a matrix martingale Y0 , Y1 , . . . , Yn ∈ Rd×d with the difference sequence X1 , . . . , Xn . Set
Y0 = 0. If kYi−1 k2 ≥ ε, we set Xi := 0. Otherwise, let
                                  (
                                    (1/pi − 1)ui uTi      if ai is sampled in Ã,
                            Xi :=         T
                                    −ui ui                otherwise.
In the case that kYi−1 k2 < ε, by construction, kY j k2 < ε for all j < i − 1. So we have

                        Yi−1 = (AT A + λ I)−1/2 (ÃTi−1 Ãi−1 − ATi−1 Ai−1 )(AT A + λ I)−1/2 .
Since kYi−1 k2 ≤ ε, we can see that
    (AT A + λ I)−1/2 (ÃTi−1 Ãi−1 )(AT A + λ I)−1/2  (AT A + λ I)−1/2 (ATi−1 Ai−1 )(AT A + λ I)−1/2 + εI.

Multiplying on both right and left by (AT A + λ I)1/2 gives
                                         ÃTi−1 Ãi−1  ATi−1 Ai−1 + ε(AT A + λ I).
Hence, we have
                          l˜i = min((1 + ε)aTi (ÃTi−1 Ãi−1 + λ I)−1 ai , 1)
                            ≥ min((1 + ε)aTi (ATi−1 Ai−1 + λ I + ε(AT A + λ I))−1 ai , 1)
                            ≥ min((1 + ε)aTi ((1 + ε)(AT A + λ I))−1 ai , 1)
                            = aTi (AT A + λ I)−1 ai                                                      (3.1)
                            = uTi ui .


                           T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                              9
                     M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

Thus, pi = min(cl˜i , 1) ≥ min(cuTi ui , 1). If pi = 1, then Xi = 0. Otherwise, we have pi ≥ cuTi ui and
                          kXi k2 ≤ max{1, 1/pi − 1} · kui uTi k2 ≤ uTi ui /pi ≤ 1/c.                            (3.2)
Further
                   E X2i | Yi−1 , . . . , Y0  pi · (1/pi − 1)2 (ui uTi )2 + (1 − pi ) · (ui uTi )2
                                           

                                             = (ui uTi )2 · (1 − pi )/pi
                                              ui uTi · uTi ui /pi
                                                                   

                                              ui uTi /c.                                         (by equation (3.2))
And so, for the predictable quadratic variation process of the martingale {Yi }
                                             i
                                       Wi := ∑ E X2k | Yk−1 , . . . , Y0 ,
                                                                       
                                               k=1

we have
                                                       i
                                       kWi k2 ≤       ∑ uk uTk /c       ≤ 1/c.
                                                      k=1           2
Therefore by, Theorem 3.2, we have
                                                              −ε 2 /2
                                                                                
                                  P [kYn k2 ≥ ε] ≤ d · exp
                                                           1/c + ε/(3c)
                                                     ≤ d · exp(−cε 2 /4)
                                                     = 1/d.
This implies that with high probability
                        k(AT A + λ I)−1/2 (ÃT Ã + λ I)(AT A + λ I)−1/2 − Ik2 ≤ ε
and so
                         (1 − ε)(AT A + λ I)  ÃT Ã + λ I  (1 + ε)(AT A + λ I).
Subtracting λ I = (δ /ε)I from all sides, we get
                               (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I.
    Finally, note that, since we set Xi = 0 if kYi−1 k2 ≥ ε, kYn k2 < ε implies kYi k2 < ε for all i < n. We
thus have the desired bound on l˜i by equation (3.1).

    If we set c in O NLINE -S AMPLE to be proportional to log n rather than log d, we would be able
to take a union bound over all the rows and guarantee that with high probability all the approximate
online leverage scores l˜i are close to true online leverage scores li . Thus Theorem 2.2 would imply that
O NLINE -S AMPLE only selects O(d log n log(kAk22 /λ )/ε 2 ) rows with high probability.
    In order to remove the dependency on n, we have to sacrifice achieving close approximations to li at
every step. Instead, we show that the sum of the computed approximate online leverage scores is still
small with high probability, using a custom Chernoff bound.

                       T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                      10
                                                     O NLINE ROW S AMPLING

Lemma 3.4. After running O NLINE -S AMPLE, it holds with high probability that
                                                    n
                                                  ∑ l˜i = O(d log(kAk22 /λ )).
                                                  i=1

Proof. Define
                               δi := log det(ÃTi Ãi + λ I) − log det(ÃTi−1 Ãi−1 + λ I).
The proof closely follows the idea from the proof of Theorem 2.2. We will aim to show that large values
of l˜i correlate with large values of δi . Then, the sum of δi can be bounded by the logarithm of the ratio
ofthe determinants of ÃT Ã + λ I and λ I, giving us a bound on the sum of l˜i . First, we will show that
E exp(l˜i /8 − δi ) | Ãi−1 , . . . , Ã0 is always at most 1. Note that if row i is sampled by O NLINE -S AMPLE,
ÃTi Ãi + λ I = ÃTi−1 Ãi−1 + λ I + ai aTi /pi By the matrix determinant lemma (Lemma 3.1), we thus have
eδi = 1 + aTi (ÃTi−1 Ãi−1 + λ I)−1 ai /p in this case. Otherwise, if row i is not sampled, δi = 0. Thus,
                                                      ˜                                                            ˜
    E exp(l˜i /8 − δi ) | Ãi−1 , . . . , Ã0 = pi · eli /8 (1 + aTi (ÃTi−1 Ãi−1 + λ I)−1 ai /pi )−1 + (1 − pi )eli /8
                                            

                              ≤ pi · (1 + l˜i /4)(1 + aTi (ÃTi−1 Ãi−1 + λ I)−1 ai /pi )−1 + (1 − pi )(1 + l˜i /4). (3.3)

If cl˜i < 1, we have pi = cl˜i and l˜i = (1 + ε)aTi (ÃTi−1 Ãi−1 + λ I)−1 ai , and so,

         E exp(l˜i /8 − δi ) | Ãi−1 , . . . , Ã0 ≤ cl˜i · (1 + l˜i /4)(1 + 1/((1 + ε)c))−1 + (1 − cl˜i )(1 + l˜i /4)
                                                 

                                                    = (1 + l˜i /4)(cl˜i (1 + 1/((1 + ε)c))−1 + 1 − cl˜i )
                                                    = (1 + l˜i /4)(1 − l˜i /4)
                                                    ≤ 1.
Otherwise, we have pi = 1 and so, by (3.3),
            E exp(l˜i /8 − δi ) | Ãi−1 , . . . , Ã0 ≤ (1 + l˜i /4)(1 + aTi (ÃTi−1 Ãi−1 + λ I)−1 ai )−1
                                                    

                                                       ≤ (1 + l˜i /4)(1 + l˜i )−1
                                                               ≤ 1.

We will now analyze the expected product of exp(l˜i /8 − δi ) over the first k steps, E exp ∑ki=1 l˜i /8 − δi .
                                                                                                             

Since conditioned on the first k steps, exp(l˜k /8 − δk ) is independent of exp(l˜i /8 − δi ) for all i < k, for
k ≥ 1 we have
      "                  !#                "                     !                                        #
              k                                     k−1
    E exp ∑ l˜i /8 − δi                      exp ∑ l˜i /8 − δi E exp(l˜k /8 − δk ) | Ãk−1 , . . . , Ã0
                                                                                                        
                             =E          first k − 1 steps
                 i=1                                                 i=1
                                          "                                !#
                                                        k−1
                                    ≤ E exp              ∑ l˜i /8 − δi          ,
                                                         i=1

and so by induction on k
                                                    "                               !#
                                                                n
                                               E exp           ∑ l˜i /8 − δi             ≤ 1.
                                                               i=1



                          T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                          11
                    M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

Hence by Markov’s inequality
                                            "                        #
                                                n               n
                                       P        ∑ l˜i > 8d + 8 ∑ δi ≤ e−d .
                                             i=1               i=1


By Lemma 3.3, with high probability we have ÃT Ã + λ I  (1 + ε)(AT A + λ I). We also have with high
probability

                                  det(ÃT Ã + λ I) ≤ (1 + ε)d (kAk22 + λ )d ,
                              log det(ÃT Ã + λ I) ≤ d(1 + log(kAk22 + λ )).

Hence, with high probability it holds that
                                  n
                                 ∑ δi = log det(ÃT Ã + λ I) − d log(λ )
                                 i=1
                                            ≤ d(1 + log(kAk22 + λ ) − log(λ ))
                                            = d(1 + log(1 + kAk22 /λ )).

And so, with high probability,
                                       n                  n
                                      ∑ l˜i ≤ 8d + 8 ∑ δi
                                      i=1                i=1
                                                ≤ 16d + 8d log(1 + kAk22 /λ )
                                                = O(d log(kAk22 /λ )).

Proof of Theorem 2.3. The statement follows immediately from Lemmas 3.3 and 3.4.

    Observe that by Theorem 2.3, O NLINE -S AMPLE stores O(d log d log(εkAk22 /δ )/ε 2 ) rows in memory.
We now consider a simple modification of the algorithm, S LIM -S AMPLE (Figure 2), that removes the
1/ε 2 factor from the working memory usage with no additional cost.

Lemma 3.5. Let Ã be the matrix returned by S LIM -S AMPLE(A, ε, δ ). Then, with high probability,

                             (1 − ε)AT A − δ I  ÃT Ã  (1 + ε)AT A + δ I,

and the number of rows in Ã is O(d log d log(εkAk22 /δ )/ε 2 ).
   Moreover, with high probability, the memory requirement of S LIM -S AMPLE is dominated by storing
O(d log d log(εkAk22 /δ )) rows of A.

Proof. As the samples are independent, the statement follows from Theorem 2.1 and Lemmas 3.3
and 3.4.

                      T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                          12
                                             O NLINE ROW S AMPLING


     Ã = S LIM -S AMPLE(A, ε, δ ), where A is an n × d matrix with rows a1 , . . . , an ,
     ε ∈ (0, 1), δ > 0.

        1. Set λ := δ /ε, c := 8 log d/ε 2 .

        2. Let Ã0 be a 0 × d matrix.

        3. Let l˜1 , . . . , l˜n be the approximate online leverage scores computed by an independent
           instance of O NLINE -S AMPLE(A, 1/2, δ /(2ε)).

        4. For i = 1, . . . , n:

             (a) Let pi := min(cl˜i , 1).
                            "           #
                               Ã i−1
                            
                                  √            with probability pi ,
             (b) Set Ãi :=    ai / pi
                            
                            
                             Ãi−1              otherwise.
                            

        5. Return Ã := Ãn .

                            Figure 2: The low-memory online sampling algorithm


4    Asymptotically optimal algorithm
In addition to sampling by online leverage scores, we introduce a row sampling algorithm, O NLINE -BSS
(Figure 3), which improves the row count of O NLINE -S AMPLE by a log d factor, to

                                             O(d log(εkAk22 /δ )/ε 2 ).

This improved bound matches the lower bound for online sampling given in Theorem 5.1. This approach
uses a variant of the deterministic “BSS” method, introduced by Batson, Spielman, and Srivastava in
[2]. It is well known that this method yields spectral approximations with a log d factor fewer rows than
leverage scores sampling in the offline setting, and we show that this improvement extends to online
approximation.
     Unlike the original BSS algorithm of [2], our algorithm is randomized. It is similar to, and inspired by,
the randomized version of BSS from [19], especially “Algorithm 1” from that paper. In both algorithms,
like in online leverage score sampling, when a new row is processed, a probability pi is assigned to it, and
it is kept with probability pi and rejected otherwise. The key difference between the algorithms is in the
definition of pi . Like O NLINE -S AMPLE, at each step, O NLINE -BSS maintains a row sample Ãi which
approximates the matrix Ai that has been seen so far. However, pi cannot be computed solely based
on Ãi−1 —it is necessary to “remember” the entire input. Thus, O NLINE -BSS is not memory efficient,
using O(d 2 ) space. One may improve the memory dependence by simply running O NLINE -BSS on
the output stream of rows produced by O NLINE -S AMPLE. This reduces the storage cost to the size of
that output spectral approximation. Of course, this does not mean that O NLINE -BSS leads to a space

                         T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                             13
                       M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

savings over O NLINE -S AMPLE. However the number of rows in its output stream will be less than that
of O NLINE -S AMPLE, by a log d factor.
    We also remark that O NLINE -S AMPLE gives bounds on both the size of the output spectral approxi-
mation and its accuracy with high probability. In contrast, O NLINE -BSS gives an expected bound on the
output size, while it never fails to output a correct spectral approximation. These guarantees are similar to
those given in [19]. Below, we give present the performance guarantees of O NLINE -BSS and its analysis.

Theorem 4.1. Let Ã be the matrix output by O NLINE -BSS(A, ε, δ ) (Figure 3).

   1. We always have (1 − ε)AT A − δ I ≺ ÃT Ã ≺ (1 + ε)AT A + δ I.

   2. The expected number of rows in Ã is O(d log(εkAk22 /δ )/ε 2 ).


     Ã = O NLINE -BSS(A, ε, δ ), where A is an n × d matrix with rows a1 , . . . , an ,
     ε ∈ (0, 1), δ > 0.

        1. Set cU = ε2 + 1 and cL = ε2 − 1.

        2. Let Ã0 be a 0 × d matrix, BU0 = δ I, BL0 = −δ I.

        3. For i = 1, . . . , n:

             (a) Let XUi−1 = (BUi−1 − ÃTi−1 Ãi−1 ), XLi−1 = (ÃTi−1 Ãi−1 − BLi−1 ).
             (b) Let pi := min(cU aTi (XUi−1 )−1 ai + cL aTi (XLi−1 )−1 ai , 1).
                            "         #
                               Ãi−1
                            
                                  √        with probability pi ,
             (c) Set Ãi :=    ai / pi
                            
                            
                             Ãi−1          otherwise.
                            

             (d) Set BUi = BUi−1 + (1 + ε)ai aTi , BLi = BLi−1 + (1 − ε)ai aTi .

        4. Return Ã := Ãn .

                                    Figure 3: The Online BSS Algorithm


Proof of Theorem 4.1 Part 1. As in [2], a key of idea of O NLINE -BSS is to maintain two matrices, BUi
and BLi , acting as upper and lower “barriers.” We will prove that the current approximation Ãi always
falls between them:

                                               BLi ≺ ÃTi Ãi ≺ BUi .                                  (4.1)

Equivalently, XUi and XLi will always remain positive definite. Since, at the completion of the algorithm,
BUn = (1 + ε)AT A + δ I and BLn = (1 − ε)AT A − δ I this ensures that the final approximation always
satisfies the approximation bound in claim (1) of the theorem. pi is chosen at step 3(b) to ensure this

                         T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                            14
                                                O NLINE ROW S AMPLING

invariant—if either XUi or XLi are too small (we are too close to one of the barriers) then at least one of
aTi (XUi−1 )−1 ai or aTi (XLi−1 )−1 ai will be large and so pi will be large.
     We can prove this invariant if (4.1) holds, by induction on i. The base case follows from the
initialization of Ã0 with ÃT0 Ã0 = 0, BU0 = δ I, and BL0 = −δ I since clearly −δ I ≺ 0 ≺ δ I. For each
successive step, we consider two possibilities.
     Case 1: pi = 1. When pi = 1, ÃTi Ãi = ÃTi−1 Ãi−1 + ai aTi . Since we set BUi = BUi−1 + (1 + ε)ai aTi
and BLi = BLi−1 + (1 − ε)ai aTi , we can see that XUi = XUi−1 + εai aTi and XLi = XLi−1 + εai aTi . Since by the
induction assumption, XUi−1 and XLi−1 are both positive definite, so are XUi and XLi , giving the claim.
     Case 2: pi < 1. In this case, with probability pi , ÃTi Ãi = ÃTi−1 Ãi−1 + ai aTi /p. With probability
1 − pi , ÃTi Ãi = ÃTi−1 Ãi−1 . Thus, in any case, ÃTi Ãi ≺ ÃTi−1 Ãi−1 + ai aTi /p. In turn, XUi  XUi−1 − ai aTi /pi .
Observe that cU = 2/ε + 1 > 1 and thus pi > aTi (XUi−1 )−1 ai . This gives XUi−1  ai aTi /pi and so since
XUi  XUi−1 − ai aTi /pi , it must be postive definite.
     Analogously, since BLi = BLi−1 + (1 − ε)ai aTi , we have XLi  XLi−1 − (1 − ε)ai aTi  XLi−1 − ai aTi .
Since cL = 2/ε − 1 > 1 for ε ∈ (0, 1), and since pi < 1 (by the fact that we are in Case 2), we have
aTi (XLi−1 )−1 ai < 1. This in turn gives XLi−1  ai aTi and thus since XLi  XLi−1 − ai aTi , it must be positive
definite, giving the claim in this case.
     Thus, we have shown (4.1) for all i. In particular, BUn ≺ ÃT Ã ≺ BLn . We can see by construction that

                               BUn = (1 + ε)AT A + δ I and BLn = (1 − ε)AT A − δ I.

Thus, we have (1 − ε)AT A − δ I ≺ ÃT Ã ≺ (1 + ε)AT A + δ I, which gives the first claim of the theorem.


    In our proof of the second claim, bounding the expected number of rows sampled, we will need the
following technical lemma, which is derived from the Sherman–Morrison formula [24].
Lemma 4.2. Given a positive definite matrix X, two vectors u and v, two scalar multipliers a and b, and
a probability p, define the random variable X b to be X − auuT with probability p and X − buuT otherwise.
Then if uT X−1 u = 1,

                              b −1 v − vT X−1 v = (vT X−1 u)2 · pa + (1 − p)b − ab .
                          h                    i
                        E vT X
                                                                  (1 − a)(1 − b)
                                                                               b = XX − auuT and
Proof. We apply the Sherman–Morrison formula to each of the two possibilities (X
            T                                      −1
X = X + buu respectively). These give respective X values of
b                                                b
                                             −1   T −1
                          b −1 = X−1 + a · X uu X = X−1 + a · X−1 uuT X−1
                          X
                                           1 − auT X−1 u 1−a
and
                                            −1   T −1
                         b −1 = X−1 + b · X uu X = X−1 + b · X−1 uuT X−1 .
                         X
                                          1 − buT X−1 u 1−b
                 b −1 v − vT X−1 v are then respectively
The values of vT X
                                    a                                      a
                                       · vT X−1 uuT X−1 v = (vT X−1 u)2 ·
                                   1−a                                    1−a

                          T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                           15
                      M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

and
                                 b                                      b
                                    · vT X−1 uuT X−1 v = (vT X−1 u)2 ·     .
                                1−b                                    1−b
Combining these gives the stated result.

Proof of Theorem 4.1 Part 2. We will show that the probability that row ai is included in Ã is at most
8/ε 2 · li , where li is the online 2δ /ε-ridge leverage score of ai , i. e.,
                                                                     −1
                                  li = min(aTi ATi−1 Ai−1 + 2δ /ε · I    ai , 1).

Since ∑ni=1 li = O(d log(εkAk22 /δ )) by Theorem 2.2, this implies that Ã has O(d log(εkAk22 /δ )/ε 2 ) rows
in expectation, completing the second claim of the theorem.
    First, we introduce some notation to help in the analysis. Let qi be the probability that row ai is
sampled in the algorithm. Note that qi is fixed and we seek to prove that qi ≤ 8/ε 2 · li . The probability
pi that ai is sampled at step i is a random variable. We have qi = E [pi ]. Thus it suffices to prove
E [pi ] ≤ 8/ε 2 · li . We define
                                                 ε           ε T
                                   CUi,j = δ I + ATi Ai + 1 +     A Aj
                                                 2            2 j
                                                   ε           ε
                                   CLi, j = −δ I − ATi Ai + 1 −    ATj A j .
                                                   2            2

Note that CUi,i = BUi , CLi,i = BLi , and for j ≤ i, CUi,j  BUj and CLi, j  BLj . We can then define

                                               YUi,j = CUi,j − ÃTj Ã j
                                               YLi, j = ÃTj Ã j − CLi, j .

We then have, similarly, YUi,i = XUi , YLi,i = XLi , and for j ≤ i, YUi,j  XUj and YLi, j  XLj .
    Assume that li < 1. Otherwise, since pi ≤ 1 (it is a probability), we trivially have E [pi ] ≤ 8/ε 2 · li as
desired. Now, note that for all i > 0,

                            aTi (YUi,0 )−1 ai = aTi (YLi,0 )−1 ai
                                                    ε                −1
                                              = aTi      ATi Ai + δ I     ai
                                                      2
                                                                   2δ −1
                                                                      
                                                2 T         T                         2
                                              = ai Ai Ai + I                 ai   ≤     li .              (4.2)
                                                ε                   ε                 ε

Next, we will show that for j < i − 1,

                                E aTi (YUi−1, j+1 )−1 ai ≤ E aTi (YUi−1, j )−1 ai
                                                                               
                                                                                                          (4.3)
                                                  and
                                E ai (Yi−1, j+1 )−1 ai ≤ E aTi (YLi−1, j )−1 ai .
                                  T L                                       
                                                                                                          (4.4)


                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                16
                                              O NLINE ROW S AMPLING

Combined with (4.2) and the fact that YUi,i = XUi , YLi,i = XLi , (4.3) and (4.4) give

                     E [pi ] = cU E aTi (XUi−1 )−1 ai + cL E aTi (XLi−1 )−1 ai
                                                                            

                             = cU E aTi (YUi−1,i−1 )−1 ai + cL E aTi (YLi−1,i−1 )−1 ai
                                                                                    

                             ≤ cU E aTi (YUi−1,0 )−1 ai + cL E aTi (YLi−1,0 )−1 ai
                                                                                

                               2                    8
                             = li · (cU + cL ) = 2 li .
                               ε                   ε
The last equality follows from the fact that in O NLINE -BSS we set cU = 2/ε + 1 and cL = 2/ε − 1. This
completes the claim that for all i, the probability qi that row ai is sampled is bounded by qi = E [pi ] ≤
8/ε 2 · li , giving the second part of Theorem 4.1.
     It remains to prove (4.3) and (4.4). To do this we will show a somewhat stronger statement: condi-
tioned on any choices for the first j rows, the expected value of aTi (YUi−1, j+1 )−1 ai is no larger than that of
aTi (YUi−1, j )−1 ai , and analogously for (YLi−1, j+1 )−1 . Similar to the proof of part 1, we consider two cases:
Case 1: p j+1 = 1. In that case, the positive semidefinite matrix a j+1 aTj+1 is added at step j + 1 to give
ÃTj+1 Ã j+1 = ÃTj Ã j + a j+1 aTj+1 . This gives that

                         YUi−1, j+1 = CUi,j+1 − ÃTj+1 Ã j+1
                                                   ε
                                    = CUi,j + 1 +        a j+1 aTj+1 − (ÃTj Ã j + a j+1 aTj+1 )
                                                     2
                                                 ε
                                    = YUi−1, j + · a j+1 aTj+1 .
                                                 2
    Thus we have YUi−1, j+1  YUi−1, j and so aTi (YUi−1, j+1 )−1 ai ≤ aTi (YUi−1, j )−1 ai , giving (4.3). An analo-
gous argument holds for YLi−1, j+1 , giving (4.4).
    Case 2: p j+1 < 1. This case is more tricky. Importantly, by how p j+1 is set in step 3(b) of O NLINE -
BSS and by the observation that YUi−1, j  XUj and YLi−1, j  XLj for j ≤ i − 1 (recall that we must prove
(4.3) and (4.4) under the assumption that j ≤ i − 1), we have

                           p j+1 ≥ cU · aTj+1 (XUj )−1 a j+1 ≥ cU · aTj+1 (YUi−1, j )−1 a j+1                  (4.5)
                                                              and
                            p j+1 ≥ cL · aTj+1 (XLj )−1 a j+1 ≥ cL · aTj+1 (YLi−1, j )−1 a j+1 .               (4.6)
                              √
Now, we define w j+1 = a j+1 / p j+1 and additionally

                                            sUj+1 = wTj+1 (YUi−1, j )−1 wTj+1
                                           sLj+1 = wTj+1 (YLi−1, j )−1 wTj+1
                                                    w j+1
                                           uUj+1 = q
                                                      sUj+1
                                                   w j+1
                                           uLj+1 = q      .
                                                    sLj+1


                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                    17
                      M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

We then deploy Lemma 4.2 to bound the expectations in (4.3) and (4.4).
Upper barrier bound (4.3): Here we apply Lemma 4.2 with X = YUi−1, j , u = uUj+1 , v = aTi , a =
sUj+1 (1 − p j+1 (1 + ε/2)), b = −sUj+1 p j+1 (1 + ε/2), p = p j+1 . Note that we have

                                uT X−1 u = 1/sUj+1 · wTj+1 (YUi−1, j )−1 w j+1 = 1
as required. Additionally, with probability p j+1 ,
                           b = X − auuT
                           X
                                                                                        T
                             = YUi−1, j − sUj+1 (1 − p j+1 (1 + ε/2)) · uUj+1 uUj+1
                             = YUi−1, j − (1 − p j+1 (1 + ε/2)) · w j+1 w j+1 T
                                            1
                             = YUi−1, j −       · a j+1 a j+1 T + (1 + ε/2)a j+1 a j+1 T .
                                          p j+1
Similarly, with probability 1 − p j+1 ,
                                 b = X − buuT
                                 X
                                                                                    T
                                    = YUi−1, j + sUj+1 p j+1 (1 + ε/2)uUj+1 uUj+1
                                    = YUi−1, j + (1 + ε/2)a j+1 a j+1 T .
         b = YU
That is, X    i−1, j+1 . Thus, Lemma 4.2 gives

                          b −1 ai − vT X−1 v = E aTi YUi−1, j+1 −1 ai − E aTi YUi−1, j −1 ai
                     h                      i   h                    i   h                   i
                  E aTi X
                                                                 pa + (1 − p)b − ab
                                               = (vT X−1 u)2 ·                      .
                                                                   (1 − a)(1 − b)
To prove (4.3) it suffices to show that
                                               pa + (1 − p)b − ab
                                                 (1 − a)(1 − b)
is non-positive. Letting r = aTj+1 (YUi−1, j )−1 aTj+1 we can write a = r · (1/p j+1 − (1 + ε/2)) and b =
−r · (1 + ε/2) < 0. By (4.5), r ≤ p j+1 /cU and thus a ≤ r/p j+1 ≤ 1/cU = 1/(2/ε + 1) < 1. Thus, the
denominator (1 − a)(1 − b) is positive, and so it remains to show that the numerator pa + (1 − p)b − ab
is non-positive. We can write
                                                                          
                                                  ε          1         ε          ε
                   pa + (1 − p)b − ab = −r · + r2 ·              − 1+         · 1+
                                                  2        p j+1        2            2
                                                                 
                                                  ε          1          ε 
                                        ≤ −r · + r2 ·              · 1+
                                                  2        p j+1        2
                                                  ε        1         ε 
                                        ≤ −r · + r ·             · 1+
                                                  2     2/ε + 1       2
                                                  ε     ε
                                        = −r · + r · ≤ 0.
                                                  2     2

                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                         18
                                            O NLINE ROW S AMPLING
                                  h                 i   h               i
                                                 −1                  −1
This completes the argument that E aTi YUi−1, j+1 ai − E aTi YUi−1, j ai ≤ 0, giving (4.3).
Lower barrier bound (4.4): For the lower barrier bound we give a similar argument. We use X = YLi−1, j ,
u = uLj+1 , v = aTi , a = −sLj+1 (1 − p j+1 (1 − ε/2)), b = sLj+1 p j+1 (1 − ε/2), and p = p j+1 . We again have

                                uT X−1 u = 1/sLj+1 · wTj+1 (YLi−1, j )−1 w j+1 = 1,
as required. Additionally, with probability p j+1 , we have
                           b = X − auuT
                           X
                                                                                        T
                             = YLi−1, j + sLj+1 (1 − p j+1 (1 − ε/2)) · uLj+1 uLj+1
                                             1
                             = YLi−1, j +        · a j+1 a j+1 T − (1 − ε/2)a j+1 a j+1 T .
                                          p j+1
Similarly, with probability 1 − p j+1 ,
                                 b = X − buuT
                                 X
                                                                                    T
                                    = YLi−1, j − sLj+1 p j+1 (1 − ε/2)uLj+1 uLj+1
                                    = YLi−1, j − (1 − ε/2)a j+1 a j+1 T .
         b = YL
That is, X    i−1, j+1 . Thus, Lemma 4.2 gives
                    h                     i
                E aTi X b −1 ai − vT X−1 v = E aTi (YLi−1, j+1 )−1 ai − E aTi (YLi−1, j )−1 ai
                                                                                            

                                                                pa + (1 − p)b − ab
                                             = (vT X−1 u)2 ·                       .
                                                                  (1 − a)(1 − b)
Again, to prove (4.4) it suffices to show that
                                               pa + (1 − p)b − ab
                                                 (1 − a)(1 − b)
is non-positive. Let r = aTj+1 (YLi−1, j )−1 a j+1 . We can write a = −r (1/p j+1 − (1 − ε/2)) < 0 and b =
r (1 − ε/2). Note that by (4.6), r ≤ p j+1 /cL = p j+1 /(2/ε − 1) < 1, and thus b < 1. So the denominator
(1 − a)(1 − b) is positive. It thus remains to show that the numerator pa + (1 − p)b − ab is non-positive.
We simplify this numerator as
                                                                            
                                                     ε        1         ε         ε
                     pa + (1 − p)b − ab = −r · + r2               − 1−         · 1−
                                                     2      p j+1        2           2
                                                                   
                                                     ε         1           ε 
                                           ≤ −r · + r2 ·             · 1−
                                                     2       p j+1         2
                                                     ε       1          ε 
                                           ≤ −r · + r ·            · 1−
                                                     2    2/ε − 1        2
                                                     ε    ε
                                           = −r · + r · ≤ 0,
                                                     2    2
giving the required bound. This proves (4.4) and completes the theorem.

                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                19
                      M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

5    Matching lower bound
Here we show that the row count obtained by Theorem 4.1 is in fact optimal. While it is possible to obtain
a spectral approximation    with O(d/ε 2 ) rows in the offline setting,
                                                                      online sampling always incurs a loss
                 2                                        2        2
                      
of Ω log(εkAk2 /δ ) and must sample Ω d log(εkAk2 /δ )/ε rows.
                                                          √
Theorem 5.1. Assume that εkAk22 ≥ c1 δ and ε ≥ c2 / d, for fixed constants c1 and c2 . Then any algo-
rithm that selects rows in an online manner and outputs a spectral approximation to AT A with (1+ε) mul-
tiplicative error and δ additive error with probability at least 1/2 must sample Ω d log(εkAk22 /δ )/ε 2
                                                                                                          

rows of A in expectation.

    Note that the lower bounds we assume on εkAk22 and ε are very minor. They just ensure that
log(εkAk22 /δ ) ≥ 1 and that ε is not so small that we can essentially sample all rows.

Proof. We apply Yao’s minimax principle, constructing, for any large enough M, a distribution on
inputs A with kAk22 ≤ M for which any deterministic online     row selection algorithm that succeeds with
probability at least 1/2 must output Ω d log(εM/δ )/ε 2 rows in expectation. The best randomized
algorithm that works with probability 1/2 on any input matrix with kAk22 ≤ M therefore must select at
least Ω d log(εM/δ )/ε 2 rows in expectation on the worst case input, giving us the theorem.
     Our distribution is as follows. We select an integer N uniformly at random from [1, log(Mε/δ )]. We
then stream in the vertex-edge incidence matrices of N complete graphs on d vertices. We double the
weight of each successive graph. Intuitively,
                                           √ spectrally approximating a complete graph requires selecting
        2
Ω(d/ε ) edges [2] (as long as ε ≥ c2 / d for some fixed constant c2 ). Each time we stream in a new
graph with double the weight, we force the algorithm to add Ω(d/ε 2 ) more edges to its output, eventually
forcing it to return Ω(d/ε 2 · N) edges,  which is Ω(d log(Mε/δ )/ε 2 ) in expectation.
                                  d
                                    
     Specifically, let Kd be the 2 × d vertex-edge incidence matrix of the complete graph on d vertices.
KTd Kd is the Laplacian matrix of the complete graph on d vertices. We weight the first graph so that its
Laplacian has all its nonzero eigenvalues equal to δ /ε. (That is, each edge has weight δ /(dε)). In this way,
even if we select N = blog(Mε/δ )c we have overall kAk22 ≤ δ /ε + 2δ /ε + · · · + 2blog(Mε/δ )c−1 δ /ε ≤ M.
     Even if N = 1, all nonzero eigenvalues of AT A are at least δ /ε, so achieving (1 + ε) multiplicative
error and δ I additive error is equivalent to achieving (1 + 2ε) multiplicative error. AT A is a graph
Laplacian so has a null space. However, as all rows are orthogonal to the null space, achieving additive
error δ I is equivalent to achieving additive error δ Ir where Ir is the identity projected to the span of AT A.
δ Ir  εAT A which is why we must achieve (1 + 2ε) multiplicative error.
     In order for a deterministic algorithm to be correct with probability 1/2 on our distribution, it must
be correct for at least 1/2 of our blog(Mε/δ )c possible choices of N.
     Let i be the lowest choice of N for which the algorithm is correct. By the lower bound of [2], the
algorithm must output Ω(d/ε 2 ) rows of Ai to achieve a (1 + 2ε) multiplicative-error spectral approxi-
mation. Here Ai is the input consisting of the vertex-edge incidence matrices of i increasingly weighted
complete graphs. Call the output on this input Ãi . Now let j be the second lowest choice of N on which
the algorithm is correct. Since the algorithm was correct on Ai to within a multiplicative (1 + 2ε), to be
correct on A j , it must output a set of edges Ã j such that

               (ATj A j − ATi Ai ) − 4εATj A j  ÃTj Ã j − ÃTi Ãi  (ATj A j − ATi Ai ) + 4εATj A j .

                        T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                                20
                                         O NLINE ROW S AMPLING

      Since we double each successive copy of the complete graph, ATj A j  2(ATj A j − ATi Ai ). So,
ÃTj Ã j − ÃTi Ãi must be a 1 + 8ε spectral approximation to the true difference ATj A j − ATi Ai . Noting
that this difference is itself just a weighting of the complete graph, by the lower bound in [2] the
algorithm must select Ω(d/ε 2 ) additional edges between the ith and jth input graphs. Iterating this
argument over all blog(Mε/δ )c/2 inputs on which the algorithm must be correct, it must select a total of
Ω(d log(Mε/δ )/ε 2 ) edges in expectation over all inputs.


6    Future work
The main open question arising from the original publication of this work [13] was if one could prove
that the algorithm of [16] works despite dependencies arising due to the row pruning step. By operating
in the online setting, our algorithm avoids row pruning, and hence is able to skirt these dependencies, as
the probability that a row is sampled only depends on earlier rows in the stream. However, because the
streaming setting offers the potential for sampling fewer rows than in the online case, obtaining a rigorous
proof of [16] is very interesting. This open question was essentially resolved in [18], which presents an
algorithm similar to the one presented in [16] for insertion-only streams that admits a correct proof.
     While our work focuses on spectral approximation, variants on (ridge) leverage score sampling and the
BSS algorithm are also used to solve low-rank approximation problems, including column subset selection
[5, 12] and projection-cost-preserving sketching [10, 12]. Compared with spectral approximation, there is
less work on streaming sampling for low-rank approximation, and understanding how online algorithms
may be used in this setting would an interesting directino. Since initial publication, this question has
been studied extensively [6, 8, 7], with online ridge leverage scores being employed for online low-rank
approximation of kernel matrices and for low-rank approximation in sliding window streams.

Acknowledgements. The authors would like to thank Kenneth Clarkson, Jonathan Kelner, Gary Miller,
Christopher Musco and Richard Peng for helpful discussions and comments. Cameron Musco and
Jakub Pachocki both acknowledge the Gene Golub SIAM Summer School program on Randomization in
Numerical Linear Algebra, where work on this project was initiated.


References
 [1] A HMED A LAOUI AND M ICHAEL W. M AHONEY: Fast randomized kernel ridge regression with
     statistical guarantees. In Adv. Neural Info. Proc. Sys. 30 (NIPS’15), pp. 775–783. Curran Assoc.,
     Inc., 2015. NIPS. [arXiv:1411.0306] 4, 5

 [2] J OSHUA BATSON , DANIEL A. S PIELMAN , AND N IKHIL S RIVASTAVA: Twice-Ramanujan
     sparsifiers. SIAM J. Comput., 41(6):1704–1721, 2012. Preliminary version in STOC’09.
     [doi:10.1137/090772873, arXiv:0808.0163] 4, 13, 14, 20, 21

 [3] A NTOINE B ORDES AND L ÉON B OTTOU: The huller: a simple and efficient online SVM. In
     Machine Learning: ECML’05, pp. 505–512. Springer, 2005. [doi:10.1007/11564096_48] 3

                       T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                              21
                    M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

 [4] C HRISTOS B OUTSIDIS , DAN G ARBER , Z OHAR K ARNIN , AND E DO L IBERTY: Online principal
     components analysis. In Proc. 26th Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA’15), pp.
     887–901. ACM Press, 2015. [doi:10.1137/1.9781611973730.61] 3
 [5] C HRISTOS B OUTSIDIS AND DAVID P. W OODRUFF: Optimal CUR matrix decompositions. SIAM J.
     Comput., 46(2):543–589, 2017. Preliminary version in STOC’14. [arXiv:1405.7910] 21
 [6] V LADIMIR B RAVERMAN , P ETROS D RINEAS , C AMERON M USCO , C HRISTOPHER M USCO ,
     JALAJ U PADHYAY, DAVID P. W OODRUFF , AND S AMSON Z HOU: Numerical linear algebra in the
     online and sliding window models. In Proc. 61st FOCS, pp. 517–528. IEEE Comp. Soc., 2020.
     [doi:10.1109/FOCS46700.2020.00055, arXiv:1805.03765] 3, 21
 [7] DANIELE C ALANDRIELLO , A LESSANDRO L AZARIC , AND M ICHAL VALKO: Efficient second-
     order online kernel learning with adaptive embedding. In Adv. Neural Info. Proc. Sys. 30 (NIPS’17),
     pp. 6140–6150. Curran Assoc., Inc., 2017. NIPS. 3, 21
 [8] DANIELE C ALANDRIELLO , A LESSANDRO L AZARIC , AND M ICHAL VALKO: Second-order kernel
     online convex optimization with adaptive sketching. In Proc. 34th Int. Conf. on Machine Learning
     (ICML’17), pp. 645–653, 2017. MLR Press. [arXiv:1706.04892] 3, 21
 [9] K ENNETH L. C LARKSON AND DAVID P. W OODRUFF: Low rank approximation and regression
     in input sparsity time. J. ACM, 63(6):54:1–54:45, 2017. Preliminary version in in STOC’13.
     [doi:10.1145/3019134, arXiv:1207.6365] 2
[10] M ICHAEL B. C OHEN , S AM E LDER , C AMERON M USCO , C HRISTOPHER M USCO , AND
     M ADALINA P ERSU: Dimensionality reduction for k-means clustering and low rank approxi-
     mation. In Proc. 47th STOC, pp. 163–172. ACM Press, 2015. [doi:10.1145/2746539.2746569,
     arXiv:1410.6801] 21
[11] M ICHAEL B. C OHEN , Y IN TAT L EE , C AMERON M USCO , C HRISTOPHER M USCO , R ICHARD
     P ENG , AND A ARON S IDFORD: Uniform sampling for matrix approximation. In Proc.
     6th Innovations in Theoret. Comp. Sci. conf. (ITCS’15), pp. 181–190. ACM Press, 2015.
     [doi:10.1145/2688073.2688113, arXiv:1408.5099] 2, 3, 5
[12] M ICHAEL B. C OHEN , C AMERON M USCO , AND C HRISTOPHER M USCO: Input spar-
     sity time low-rank approximation via ridge leverage score sampling. In Proc. 28th Ann.
     ACM-SIAM Symp. on Discrete Algorithms (SODA’17), pp. 1758–1777. ACM Press, 2017.
     [doi:10.1137/1.9781611974782.115, arXiv:1511.07263] 21
[13] M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI: Online row sampling. In Proc
     19th Internat. Workshop on Approximation Algorithms for Combinat. Opt. Probl. (APPROX’16), pp.
     7:1–7:18. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2016. [doi:10.4230/LIPIcs.APPROX-
     RANDOM.2016.7] 21
[14] KOBY C RAMMER , O FER D EKEL , J OSEPH K ESHET, S HAI S HALEV-S HWARTZ , AND YORAM
     S INGER: Online passive-aggressive algorithms. J. Mach. Learning Res., 7:551–585, 2006. JMLR.
     3

                     T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                           22
                                       O NLINE ROW S AMPLING

[15] M ICHAEL K APRALOV, Y IN TAT L EE , C AMERON M USCO , C HRISTOPHER M USCO , AND A ARON
     S IDFORD: Single pass spectral sparsification in dynamic streams. SIAM J. Comput., 46(1):456–477,
     2017. Preliminary version in in FOCS’14. [doi:10.1137/141002281, arXiv:1407.1289] 2

[16] J ONATHAN A. K ELNER AND A LEX L EVIN: Spectral sparsification in the semi-streaming setting.
     Theory Comput. Sys., 53(2):243–262, 2013. [doi:10.1007/s00224-012-9396-1] 2, 3, 4, 21

[17] I OANNIS KOUTIS , G ARY L. M ILLER , AND R ICHARD P ENG: Approaching optimality for solving
     SDD linear systems. SIAM J. Comput., 43(1):337–354, 2014. Preliminary version in in FOCS’10.
     [doi:10.1137/110845914, arXiv:1003.2958] 5

[18] R ASMUS K YNG , JAKUB PACHOCKI , R ICHARD P ENG , AND S USHANT S ACHDEVA: A framework
     for analyzing resparsification algorithms. In Proc. 28th Ann. ACM-SIAM Symp. on Discrete
     Algorithms (SODA’17), pp. 2032–2043. ACM Press, 2017. [doi:10.1137/1.9781611974782.132,
     arXiv:1611.06940] 3, 21

[19] Y IN TAT L EE AND H E S UN: Constructing linear-sized spectral sparsification in almost-
     linear time. SIAM J. Comput., 47(6):2315–2336, 2018. Preliminary version in in FOCS’15.
     [doi:10.1137/16M1061850, arXiv:1508.03261] 13, 14

[20] M U L I , G ARY L. M ILLER , AND R ICHARD P ENG: Iterative row sampling. In Proc. 54th FOCS, pp.
     127–136. IEEE Comp. Soc., 2013. [doi:10.1109/FOCS.2013.22, arXiv:1211.2713] 2

[21] E DO L IBERTY, R AM S RIHARSHA , AND M AXIM S VIRIDENKO: An algorithm for online k-means
     clustering. In Proc. 18th Workshop on Algorithm Engineering and Experiments (ALENEX’16), pp.
     81–89, 2016. [doi:10.1137/1.9781611974317.7, arXiv:1412.5721] 3

[22] X IANGRUI M ENG AND M ICHAEL W. M AHONEY: Low-distortion subspace embeddings in input-
     sparsity time and applications to robust linear regression. In Proc. 45th STOC, pp. 91–100. ACM
     Press, 2013. [doi:10.1145/2488608.2488621, arXiv:1210.3135] 2
                                          ˜ : OSNAP: Faster numerical linear algebra algorithms
[23] J ELANI N ELSON AND H UY L Ê N GUY ÊN
     via sparser subspace embeddings. In Proc. 54th FOCS, pp. 117–126. IEEE Comp. Soc., 2013.
     [doi:10.1109/FOCS.2013.21, arXiv:1211.1002] 2

[24] JACK S HERMAN AND W INIFRED J. M ORRISON: Adjustment of an inverse matrix corresponding
     to a change in one element of a given matrix. Ann. Math. Stat., 21(1):124–127, 1950. JSTOR. 15

[25] DANIEL A. S PIELMAN AND N IKHIL S RIVASTAVA: Graph sparsification by effective resis-
     tances. SIAM J. Comput., 40(6):1913–1926, 2011. Preliminary version in in STOC’08.
     [doi:10.1137/080734029, arXiv:0803.0929] 7

[26] DANIEL A. S PIELMAN AND S HANG -H UA T ENG: Nearly-linear time algorithms for graph parti-
     tioning, graph sparsification, and solving linear systems. In Proc. 36th STOC, pp. 81–90. ACM
     Press, 2004. [doi:10.1145/1007352.1007372, arXiv:cs/0310051] 2

                     T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                         23
                   M ICHAEL B. C OHEN , C AMERON M USCO , AND JAKUB PACHOCKI

[27] DANIEL A. S PIELMAN AND S HANG -H UA T ENG: Nearly linear time algorithms for preconditioning
     and solving symmetric, diagonally dominant linear systems. SIAM J. Matrix Anal. Appl., 35(3):835–
     885, 2014. [doi:10.1137/090771430] 7

[28] J OEL A. T ROPP: Freedman’s inequality for matrix martingales. Electr. Comm. Probability,
     16(25):262–270, 2011. [doi:10.1214/ECP.v16-1624, arXiv:1101.3039] 8


AUTHORS

      Michael B. Cohen†
      Massachusetts Institute of Technology
      Cambridge, MA, USA


      Cameron Musco
      Assistant Professor
      College of Information and Computer Sciences
      University of Massachusetts Amherst
      Amherst, MA, USA
      cmusco cs umass edu
      https://people.cs.umass.edu/~cmusco/


      Jakub Pachocki
      Research Lead
      Open AI
      San Francisco, CA, USA
      jakub openai com


ABOUT THE AUTHORS

      M ICHAEL B. C OHEN (1992–2017) received his Ph. D. from the Massachusetts Institute
         of Technology posthumously in 2018. He was advised by Jonathan Kelner. He was
         supported by an NSF Graduate Student Fellowship and was awarded the Machtey Award
         for the best student paper at FOCS 2016. He made important contributions to algorithmic
         spectral graph theory, randomized numerical linear algebra, and convex optimization.
         Michael tragically passed away in September 2017 due to complications from undi-
         agnosed Type I diabetes. He is dearly missed and remembered for his unbounded
         excitement, his energy, and his love of both knowledge and people.




                     T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                         24
                                 O NLINE ROW S AMPLING

C AMERON M USCO is an Assistant Professor at the University of Massachusetts Amherst.
   He received his Ph. D. from Massachusetts Institute of Technology in 2018, where he
   was advised by Nancy Lynch and supported by an NSF Graduate Student Fellowship.
   After MIT, Cameron was a postdoctoral researcher at Microsoft Research New England.
   He studies algorithms, often randomized ones, working at the intersection of theoretical
   computer science, numerical linear algebra, optimization, and machine learning


JAKUB PACHOCKI is a Research Lead at OpenAI. He received his Ph. D. from Carnegie
   Mellon University in 2016, where he was advised by Gary Miller. After that he was a
   postdoctoral fellow under Jelani Nelson at Harvard University.




               T HEORY OF C OMPUTING, Volume 16 (15), 2020, pp. 1–25                          25