Testing Properties of Collections of Distributions

Authors Reut Levi, Dana Ron, Ronitt Rubinfeld,
Plaintext
                          T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347
                                        www.theoryofcomputing.org




                           Testing Properties of
                        Collections of Distributions∗
                      Reut Levi†                 Dana Ron‡                Ronitt Rubinfeld§
                   Received June 14, 2011; Revised August 28, 2012; Published March 10, 2013




       Abstract: We propose a framework for studying property testing of collections of distri-
       butions, where the number of distributions in the collection is a parameter of the problem.
       Previous work on property testing of distributions considered single distributions or pairs of
       distributions. We suggest two models that differ in the way the algorithm is given access to
       samples from the distributions. In one model the algorithm may ask for a sample from any
       distribution of its choice, and in the other the choice of the distribution is random.
           Our main focus is on the basic problem of distinguishing between the case that all the
       distributions in the collection are the same (or very similar), and the case that it is necessary
       to modify the distributions in the collection in a non-negligible manner so as to obtain this
       property. We give almost tight upper and lower bounds for this testing problem, as well as
       study an extension to a clusterability property. One of our lower bounds directly implies a
       lower bound on testing independence of a joint distribution, a result which was left open by
       previous work.

ACM Classification: G.3, F.2
AMS Classification: 68Q25
Key words and phrases: property testing, distributions, independence
   ∗ A preliminary version of this paper appeared in the the proceedings of the Second Symposium on Innovations in Computer

Science (ICS) 2011 [32].
   † Supported by the Israel Science Foundation grant numbers 1147/09 and 246/08
   ‡ Supported by the Israel Science Foundation grant number 246/08.
   § Supported by NSF grants 0732334 and 0728645, Marie Curie Reintegration grant PIRG03-GA-2008-231077 and the Israel

Science Foundation grant nos. 1147/09 and 1675/09.


© 2013 Reut Levi, Dana Ron, and Ronitt Rubinfeld
c b Licensed under a Creative Commons Attribution License (CC-BY)                            DOI: 10.4086/toc.2013.v009a008
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

1     Introduction
In recent years, several works have investigated the problem of testing various properties of data that is
most naturally thought of as samples of an unknown distribution. More specifically, the goal in testing a
specific property is to distinguish the case that the samples come from a distribution that has the property
from the case that the samples come from a distribution that is far (usually in terms of `1 norm, but
other norms have been studied as well) from any distribution that has the property. To give just a few
examples, such tasks include testing whether a distribution is uniform [25, 39] or similar to another known
distribution [12], and testing whether a joint distribution is independent [10]. Related tasks concern
sublinear estimation of various measures of a distribution, such as its entropy [9, 26] or its support
size [41]. Recently, general techniques have been designed to obtain nearly tight lower bounds on such
testing and estimation problems [48, 47].
     These types of questions have arisen in several disparate areas, including physics [33, 45, 36],
cryptography and pseudorandom number generation [31], statistics [19, 27, 50, 38, 39, 37], learning
theory [51], property testing of graphs and sequences (e. g., [25, 20, 30, 35, 40, 23]) and streaming
algorithms (e. g., [4, 21, 24, 26, 18, 17, 7, 28, 15, 16, 14, 29]). In these works, there has been significant
focus on properties of distributions over very large domains, where standard statistical techniques based
on learning an approximation of the distribution may be very inefficient.
     In this work we consider the setting in which one receives data which is most naturally thought of
as samples of several distributions, for example, when studying purchase patterns in several geographic
locations, or the behavior of linguistic data among varied text sources. Such data could also be generated
when samples of the distributions come from various sensors that are each part of a large sensor-net. In
these examples, it may be reasonable to assume that the number of such distributions might be quite large,
even on the order of a thousand or more. However, for the most part, previous research has considered
properties of at most two distributions [11, 48]. We propose new models of property testing that apply to
properties of several distributions. We then consider the complexity of testing properties within these
models, beginning with properties that we view as basic and expect to be useful in constructing building
blocks for future work. We focus on quantifying the dependence of the sample complexities of the testing
algorithms in terms of the number of distributions that are being considered, as well as the size of the
domain of the distributions.


1.1     Our contributions
1.1.1    The models

We begin by proposing two models that describe possible access patterns to multiple distributions
D1 , . . . , Dm over the same domain [n]. In these models there is no explicit description of the distribution –
the algorithm is only given access to the distributions via independent samples. In the first model, referred
to as the sampling model, at each time step, the algorithm receives a pair of the form (i, j) where i is
selected uniformly in [m] and j ∈ [n] is distributed according to Di . In the second model, referred to
as the query model, at each time step, the algorithm is allowed to specify i ∈ [m] and receives j that is
distributed according to Di . It is immediate that any algorithm in the sampling model can also be used
in the query model. On the other hand, as is implied by our results, there are property testing problems

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                               296
                                T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

which have a significantly larger sample complexity in the sampling model than in the query model.
     In both models the task is to distinguish between the case that the tested distributions have the property
and the case that they are ε-far from having the property, for a given distance parameter ε. Distance to the
property is measured in terms of the average `1 -distance between the tested distributions and the closest
collection of distributions that have the property. The `1 -distance between two probability distributions
D1 , D2 over domain [n],
                                       kD1 , D2 k1 = ∑ |D1 (i) − D2 (i)| ,
                                                                      i∈[n]

is perhaps the most standard measure of distance between distributions, as it measures the maximum
difference between the probability of any event (i. e., set S ⊆ [n]) occurring according to one distribution
as compared to the other distribution. In other words, if the distance is small, then the distributions
are essentially indistinguishable in terms of their behavior. In all of our results, the dependence of the
algorithms on the distance parameter ε is (inverse) polynomial. Hence, for the sake of succinctness,
in all that follows we do not mention this dependence explicitly. We also consider an extension to the
sampling model in which the distribution over distributions (that is, the distribution of the index i) is
non-uniform (i. e., is determined by a weight wi ) and the distance measure is adapted accordingly (see in
the preliminaries section for details).

1.1.2     Testing equivalence in the sampling model
One of the first properties of distributions studied in the property testing model is that of determining
whether two distributions over domain [n] are identical (alternatively, very close) or far (according to the
`1 -distance). In [12], an algorithm is given that uses Õ(n2/3 ) samples1 and distinguishes between the case
                                                                              √
that the two distributions are ε-far and the case that they are O(ε/ n)-close. This algorithm has been
shown to be nearly tight (in terms of the dependence on n) by Valiant [47]. Valiant also shows that in
order to distinguish between the case that the distributions are ε-far and the case that they are β -close, for
two constants ε and β , requires almost linear dependence on n.
     Our main focus is on a natural generalization, which we refer to as the equivalence property of
distributions D1 , . . . , Dm , in which the goal of the tester is to distinguish the case in which all distributions
are the same (or, slightly more generally, that there is a distribution D∗ for which (1/m) ∑m                  ∗
                                                                                                   i=1 kDi −D k1 ≤
          √
poly(ε)/ n), from the case in which there is no distribution D∗ for which (1/m) ∑m                           ∗
                                                                                                i=1 kDi − D k1 ≤ ε.
                                                                                              2/3
To solve this problem in the (uniform) sampling model with sample complexity Õ(n m) (which ensures
with high probability that each distribution is sampled Ω̃(n2/3 log m) times), one can make m − 1 calls to
the algorithm of [12] to check that every distribution is close to D1 .

Our algorithms We show that one can get a better sample complexity dependence on m. Specifically,
we give two algorithms, one with sample complexity Õ(n2/3 m1/3 + m) and the other with sample
complexity Õ(n1/2 m1/2 + n). The first result in fact holds for the case that for each sample pair (i, j), the
distribution Di (which generated j) is not selected necessarily uniformly, and furthermore, it is unknown
according to what weight it is selected. The second result holds for the case where the selection is
   1 For a function f over t ≥ 1 variables n , . . . , n we use Õ( f (n , . . . , n )) as a shorthand for O( f (n , . . . , n ) logk f (n , . . . , n ))
                                            1           t               1           t                             1           t           1           t
for some constant k.


                             T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                                                297
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

non-uniform, but the weights are known. Moreover, the second result extends to the case in which it is
desired that the tester pass distributions that are close for each element, to within a multiplicative factor of
(1 ± ε/c) for some constant c > 1, and for sufficiently large frequencies. Thus, starting from the known
result for m = 2, as long as n ≥ m, the complexity grows as Õ(n2/3 m1/3 + m) = Õ(n2/3 m1/3 ), and once
m ≥ n, the complexity is Õ(n1/2 m1/2 + n) = Õ(n1/2 m1/2 ) (which is lower than the former expression
when m ≥ n).
    Both of our algorithms build on the close relation between testing equivalence and testing indepen-
dence of a joint distribution over [m] × [n] which was studied in [10]. The Õ(n2/3 m1/3 + m) algorithm
follows from [10] after we fill in a certain gap in the analysis of their algorithm due to an imprecision of a
claim given in [11]. The Õ(n1/2 m1/2 + n) algorithm exploits the fact that i is selected uniformly (or, more
generally, according to a known weight wi ) to improve on the Õ(n2/3 m1/3 + m) algorithm (in the case
that m ≥ n).


Almost matching lower bounds We show that the behavior of the upper bound on the sample complex-
ity of the problem is not just an artifact of our algorithms, but rather (almost) captures the complexity of the
problem. Indeed, we are able to give almost matching lower bounds of Ω(n2/3 m1/3 ) for n = Ω(m log m)
and Ω(n1/2 m1/2 ) (for every n and m). The latter lower bound can be viewed as a generalization of a lower
bound given in [12], but the analysis is somewhat more subtle.
     Our lower bound of Ω(n2/3 m1/3 ) consists of two parts. The first is a general theorem concerning
testing symmetric properties of collections of distributions. This theorem extends a central lemma of
Valiant [47] on which he builds his lower bounds, and in particular the lower bound of Ω(n2/3 ) for testing
whether two distributions are identical or far from each other (i. e., the case of equivalence for m = 2).
The second part is a construction of two collections of distributions to which the theorem is applied
(where the construction is based on the one proposed in [10] for testing independence). As in [47], the
lower bound is shown by focusing on the similarity between the typical collision statistics of a family
of collections of distributions that have the property and a family of collections of distributions that are
far from having the property. However, since many more types of collisions are expected to occur in the
case of collections of distributions, our proof outline is more intricate and requires new ways of upper
bounding the probabilities of certain types of events.


1.1.3   Testing clusterability in the query model

The second property that we consider is a natural generalization of the equivalence property. The question
we ask is whether the distributions can be partitioned into at most k subsets (clusters), such that within
every cluster the distance between every two distributions is (very) small. We study this property in the
query model, and give an algorithm whose complexity does not depend on the number of distributions
and for which the dependence on n is Õ(n2/3 ). The dependence on k is almost linear. The algorithms
works by combining the diameter clustering algorithm of [3] (for points in a general metric space where
the algorithm has access to the corresponding distance matrix) with the tester for equivalence of a pair of
distributions of [12]. We observe that the above-mentioned lower bound of Ω(n2/3 ) in [47] for testing
equivalence of a pair of distributions implies that sample complexity of Õ(n2/3 ) for the clustering problem
is tight to within polylogarithmic factors in n, for constant k. This is true because one can reduce testing

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                               298
                              T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

equivalence of a pair of distributions to testing clusterability for a constant number of clusters.2

1.1.4     Implications of our results
As noted previously, in the course of proving the lower bound of Ω(n2/3 m1/3 ) for the equivalence property,
we prove a general theorem concerning testability of symmetric properties of collections of distributions
(which extends a lemma in [47]). This theorem may have applications to proving other lower bounds
on collections of distributions. Further byproducts of our research regard the sample complexity of
testing whether a joint distribution is independent. More precisely, the following question is considered
in [10]: Let Q be a distribution over pairs of elements drawn from [m] × [n] (without loss of generality,
assume n ≥ m); what is the sample complexity in terms of m and n required to distinguish independent
joint distributions, from those that are far from the nearest independent joint distribution (in term of `1
distance)? The lower bound claimed in [10], contains a known gap in the proof. Similar gaps in the
lower bounds of [12] for testing the equivalence of a pair of distributions and of [9] for estimating the
entropy of a distribution were settled by the work of [47], which applies to symmetric properties. Since
independence is not a symmetric property, the work of [47] cannot be directly applied here. In this
work, we show that the lower bound of Ω(n2/3 m1/3 ) indeed holds. Furthermore, by the aforementioned
correction of the upper bound of Õ(n2/3 m1/3 ) from [10], we get nearly tight bounds on the complexity of
testing independence.

1.2     Other related work
Other works on testing and estimating properties of (single or pairs of) distributions include [8, 26, 13,
43, 2, 44, 6, 1, 5].

1.3     Open problems and further research
There are several possible directions for further research on testing properties of collections of distribu-
tions, and we next give a few examples. One natural extension of our results is to give algorithms for
testing the property of clusterability for k > 1 in the sampling model. One may also consider testing
properties of collections of distributions that are defined by certain measures of distributions, and may
be less sensitive to the exact form of the distributions. For example, a very basic measure is the mean
(expected value) of the distribution, when we view the domain [n] as integers instead of element names, or
when we consider other domains. Given this measure, we may consider testing whether the distributions
all have similar means (or whether they should be modified significantly so that this holds). It is not hard
to verify that this property can be quite easily tested in the query model by selecting Θ(1/ε) distributions
                                                                                                √
uniformly and estimating the mean of each. On the other hand, in the sampling model an Ω( m) lower
    2 Consider the following reduction. Let D∗ and D∗ be a pair of distributions over [n] that we wish to test. We wish to accept
                                                   1      2
if D∗1 = D∗2 and to reject if kD∗1 − D∗2 k1 > ε. Let D be a collection of m distributions over domain [2n] such that an ε-fraction of
the distributions are identical to D∗1 and an ε-fraction are identical to D∗2 . We select the remaining t = (1 − 2ε)m distribution,
denoted by D1 , . . . , Dt , so that they have the following properties. First, the support of each is (a subset of) [n + 1, . . . , 2n] (so
that it is disjoint from the support of D∗1 and D∗2 ). Second, the collection {D1 , . . . Dt } is (k − 1, 0)-clusterable (see Definition 7.1)
and ε/(1 − 2ε)-far from being (k − 2, 0)-clusterable. The reduction follows from the fact that D is (k, 0)-clusterable if and only
if D∗1 = D∗2 and ε-far from being (k, 0)-clusterable if and only if kD∗1 = D∗2 k1 > ε.


                            T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                                     299
                               R EUT L EVI , DANA RON , AND RONITT RUBINFELD

bound is quite immediate even for n = 2 (and a constant ε). We are currently investigating whether
the complexity of this problem (in the sampling model) is in fact higher, and it would be interesting to
consider other measures as well.


1.4    Organization
We start by providing notation and definitions in Section 2. In Section 3 we give the lower bound
of Ω(n2/3 m1/3 ) for testing equivalence in the uniform sampling model, which is the main technical
contribution of this paper. In Section 4 we give our second lower bound (of Ω(n1/2 m1/2 )) for testing
equivalence and our algorithms for the problem follow in Sections 5 and 6. We conclude with our
algorithm for testing clusterability in the query model in Section 7.


2     Preliminaries
        def
Let [n] = {1, . . . , n}, and let D = (D1 , . . . , Dm ) be a list of m distributions, where Di : [n] → [0, 1] and
∑nj=1 Di ( j) = 1 for every 1 ≤ i ≤ m. For a vector v = (v1 , . . . , vn ) ∈ Rn , let kvk1 = ∑ni=1 |vi | denote the `1
norm of the vector v.
     For a property P of lists of distributions and 0 ≤ ε ≤ 1, we say that D is ε-far from (having) P
             i=1 kDi − Di k1 > ε for every list D = (D1 , . . . , Dm ) that has the property P. Note that the
if (1/m) ∑m               ∗                            ∗       ∗       ∗

statistical distance (also known as the total variation distance) between two discrete distributions, D and
D∗ over [n], that is, maxS⊆[n] |∑i∈S D(i) − ∑i∈S D∗ (i)|, equals kD − D∗ k1 /2.
     Given a distance parameter ε, a testing algorithm for a property P should distinguish between the
case that D has the property P and the case that it is ε-far from P. We consider two models within which
this task is performed.

The Query Model In this model the testing algorithm may indicate an index 1 ≤ i ≤ m of its choice and
     it gets an independent sample j distributed according to Di .

The Sampling Model In this model the algorithm cannot select (“query”) a distribution of its choice.
     Rather, it may obtain a pair (i, j) where i is selected uniformly (we refer to this as the Uniform
     sampling model) and j is an independent sample distributed according to Di .
      We also consider a generalization in which there is an underlying weight vector w = (w1 , . . . , wm )
      (where ∑m   i=1 wi = 1), and the distribution Di is selected according to w. In this case the notion of
      “ε-far” needs to be modified accordingly. We shall say that D is ε-far from P with respect to w if
        i=1 wi · kDi − Di k1 > ε for every list D = (D1 , . . . , Dm ) that has the property P.
                          ∗                        ∗      ∗        ∗
      ∑m
      We consider two variants of this non-uniform model: The known-weights sampling model, in which
      w is known to the algorithm, and the unknown-weights sampling model in which w is known.

    A main focus of this work is on the following property. We shall say that a list D = (D1 . . . Dm ) of m
                                   eq                        eq
distributions over [n] belongs to Pm,n (or has the property Pm,n ) if Di = Di0 for all 1 ≤ i, i0 ≤ m.

                       T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                     300
                           T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

3     A lower bound for testing equivalence in the uniform sampling model
In this section we prove the following theorem:
                                                                      eq
Theorem 3.1. Any testing algorithm for the property Pm,n in the uniform sampling model for every
ε ≤ 1/20 and for n > cm log m where c is some sufficiently large constant, requires Ω(n2/3 m1/3 ) samples.
    The proof of Theorem 3.1 consists of two parts. The first is a general theorem (Theorem 3.5)
concerning testing symmetric properties of lists of distributions. This theorem extends a lemma of
Valiant [47, Lem. 4.5.4] (which leads to what Valiant refers to as the “Wishful Thinking Theorem”). The
second part is a construction of two lists of distributions to which Theorem 3.5 is applied. Our analysis
uses a technique called Poissonization [46] (which was used in the past in the context of lower bounds
for testing and estimating properties of distributions in [41, 48, 47]), and hence we first introduce some
preliminaries concerning Poisson distributions. We later provide some intuition regarding the benefits of
Poissonization.

3.1    Preliminaries concerning Poisson distributions
For a positive real number λ , the Poisson distribution poi(λ ) takes the value x ∈ N (where N =
{0, 1, 2, . . .}) with probability poi(x; λ ) = e−λ λ x /x!. The expectation and variance of poi(λ ) are both λ .
For λ1 and λ2 we shall use the following bound on the `1 distance between the corresponding Poisson
distributions (for a proof see for example [41, Claim A.2]):

                                        kpoi(λ1 ) − poi(λ2 )k1 ≤ 2|λ1 − λ2 | .                                      (3.1)

     For a vector ~λ = (λ1 , . . . , λd ) of positive real numbers, the corresponding multivariate Poisson
 distribution poi(~λ ) is the product distribution poi(λ1 ) × · · · × poi(λd ). That is, poi(~λ ) assigns each vector
~x = x1 . . . , xd ∈ Nd the probability ∏di=1 poi(xi ; λi ).
     We shall sometimes consider vectors ~λ whose coordinates are indexed by vectors ~a = (a1 , . . . , am ) ∈
 N , and will use ~λ (~a) to denote the coordinate of ~λ that corresponds to ~a. Thus, poi(~λ (~a)) is a univariate
   m

 Poisson distribution. With a slight abuse of notation, for a subset I ⊆ [d] (or I ⊆ Nm ), we let poi(~λ (I))
 denote the multivariate Poisson distributions restricted to the coordinates of ~λ in I.
     For any two d-dimensional vectors ~λ + = (λ1+ , . . . , λd+ ) and ~λ − = (λ1− , . . . , λd− ) of positive real values,
 we get from the proof of [47, Lemma 4.5.3] that
                                                               d
                            poi(~λ + ) − poi(~λ − )        ≤ ∑ poi(λ j+ ) − poi(λ j− )       .                      (3.2)
                                                       1       j=1                       1

For our purposes we generalize equation (3.2) in the following lemma.
Lemma 3.2. For any two d-dimensional vectors ~λ + = (λ1+ , . . . , λd+ ) and ~λ − = (λ1− , . . . , λd− ) of positive
real values, and for any partition {Ii }`i=1 of [d],
                                                           `
                        poi(~λ + ) − poi(~λ − )       ≤ ∑ poi(~λ + (Ii )) − poi(~λ − (Ii ))          .              (3.3)
                                                  1     i=1                                      1



                        T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                        301
                                    R EUT L EVI , DANA RON , AND RONITT RUBINFELD

Proof. Let {Ii }`i=1 be a partition of [d], let ~i denote (i1 , . . . id ), by the triangle inequality we have that for
every k ∈ [`],

          poi(~i ; ~λ + ) − poi(~i ; ~λ − )    =           ∏ poi(i j ; λ j+ ) − ∏ poi(i j ; λ j− )                                 (3.4)
                                                           j∈[d]                    j∈[d]

                                               ≤           ∏ poi(i j ; λ j+ ) − ∏ poi(i j ; λ j+ ) ∏ poi(i j ; λ j− )              (3.5)
                                                           j∈[d]                    j∈[d]\Ik                 j∈Ik

                                                       +           ∏ poi(i j ; λ j+ ) ∏ poi(i j ; λ j− ) − ∏ poi(i j ; λ j− ) .    (3.6)
                                                                j∈[d]\Ik                j∈Ik                     j∈[d]

Hence, we obtain that

                     poi(~λ + ) − poi(~λ − )               =       ∑ poi(~i ; ~λ + ) − poi(~i ; ~λ − )                             (3.7)
                                                   1
                                                                ~i∈Nd

                                                           ≤         poi(~λ + (Ik )) − poi(~λ − (Ik ))                             (3.8)
                                                                                                         1

                                                                   + poi(~λ + ([d] \ Ik )) − poi(~λ − ([d] \ Ik ))           .     (3.9)
                                                                                                                         1

Thus, the lemma follows by induction on `.

We shall also make use of the following lemma.

Lemma 3.3. For any two d-dimensional vectors ~λ + = (λ1+ , . . . , λd+ ) and ~λ − = (λ1− , . . . , λd− ) of positive
real values,
                                                  v
                                                           (λ j− − λ j+ )2
                                                  u d
                           ~         ~ −
                             +
                                                  u
                       poi(λ ) − poi(λ ) ≤ 2t2 ∑                              .                              (3.10)
                                           1          j=1           λ j−

Proof. In order to prove the lemma we shall use the KL-divergence between distributions. For two
distributions p1 and p2 over a domain X, we set

                                                                      def                   p1 (x)
                                              DKL (p1 k p2 ) = ∑ p1 (x) · ln                       .
                                                                            x∈X             p2 (x)

Let ~λ + = (λ1+ . . . , λd+ ), ~λ − = (λ1− . . . , λd− ) and let ~i denote (i1 , . . . id ). We have that

                                 poi(~i ; ~λ + )               d                          i 
                                                                      λ j− −λ j+
                                                                                 
                                                                                    +    − j
                              ln                       =       ∑ ln e             λ j /λ j                                        (3.11)
                                 poi(~i ; ~λ − )               j=1
                                                               d                                            
                                                       =       ∑ (λ j− − λ j+ ) + i j · ln(λ j+ /λ j− )                           (3.12)
                                                               j=1
                                                               d                                   
                                                                    −     +              +    −
                                                       ≤         (λ
                                                               ∑ j j  − λ   ) + i j · (λ j /λ j − 1)  ,                           (3.13)
                                                               j=1


                           T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                                   302
                       T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

where in the last inequality we used the fact that ln x ≤ x − 1 for every x > 0. Therefore, we obtain that
                                                                     poi(~i ; ~λ + )
                 DKL poi(~λ + )k poi(~λ − )   = ∑ poi(~i ; ~λ + ) · ln                              (3.14)
                                                  ~i∈Nd                poi(~i ; ~λ − )
                                                    d                                    
                                              ≤ ∑ (λ j− − λ j+ ) + λ j+ · (λ j+ /λ j− − 1)          (3.15)
                                                     j=1
                                                      d    (λ j− − λ j+ )2
                                                 =   ∑                       ,                       (3.16)
                                                     j=1        λ j−
where in equation (3.15) we used the facts that ∑i∈N poi(i; λ ) = 1 and ∑i∈N poi(i; λ ) · i = λ . The `1
distance is related to the KL-divergence by
                                     kD − D0 k1 ≤ 2 2DKL (Dk D0 )
                                                   p

and thus we obtain the lemma.

The next lemma bounds the probability that a Poisson random variable is significantly smaller than its
expected value.
Lemma 3.4. Let X ∼ poi(λ ), then,
                                         Pr[X < λ /2] < (3/4)λ /4 .                                  (3.17)
Proof. Consider the matching between j and j + λ /2 for every j = 0, . . . , λ /2 − 1. We consider the ratio
between poi( j; λ ) and poi( j + λ /2; λ ):
                        poi( j + λ /2; λ )        e−λ · λ j+λ /2 /( j + λ /2)!
                                             =                                                       (3.18)
                           poi( j; λ )                   e−λ · λ j / j!
                                                                 λ λ /2
                                             =                                                       (3.19)
                                               ( j + λ /2)( j + λ /2 − 1) · · · ( j + 1)
                                                   λ             λ              λ
                                             =           ·              ···                          (3.20)
                                               j + λ /2 j + λ /2 − 1          j+1
                                                  λ        λ          λ
                                             ≥         ·       ···                                   (3.21)
                                               λ − 1 λ − 2 λ /2
                                                          λ /4
                                                     λ
                                             >                                                       (3.22)
                                                  (3/4)λ
                                             = (4/3)λ /4 .                                           (3.23)
This implies that
                                                Pr[X < λ /2]
                        Pr[X < λ /2] =                          · Pr[λ /2 ≤ X < λ ]                  (3.24)
                                              Pr[λ /2 ≤ X < λ ]
                                                Pr[X < λ /2]
                                         <                                                           (3.25)
                                              Pr[λ /2 ≤ X < λ ]
                                         < (3/4)λ /4 ,                                               (3.26)

                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             303
                              R EUT L EVI , DANA RON , AND RONITT RUBINFELD

and the proof is completed.

   The next two notations will play an important technical role in our analysis. For a list of distributions
D = (D1 . . . Dm ), an integer κ and a vector ~a = (a1 , . . . , am ) ∈ Nm , let
                                                        m
                                      pD,κ ( j;~a) = ∏ poi(ai ; κ · Di ( j)) .
                                                  def
                                                                                                           (3.27)
                                                        i=1

That is, for a fixed choice of a domain element j ∈ [n], consider performing m independent trials, one for
each distribution Di , where in trial i we select a non-negative integer according to the Poisson distribution
poi(λ ) for λ = κ · Di ( j). Then pD,κ ( j;~a) is the probability of the joint event that we get an outcome of
ai in trial i, for each i ∈ [m]. Let ~λ D,κ be a vector whose coordinates are indexed by all ~a ∈ Nm , such that
                                                              n
                                          ~λ D,κ (~a) =
                                                        ∑ pD,κ ( j;~a) .                                   (3.28)
                                                            j=1


That is, ~λ D,κ (~a) is the expected number of times we get the joint outcome (a1 , . . . , am ) if we perform the
probabilistic process defined above independently for every j ∈ [n]. In the next subsection we show that
the quantity ~λ D,κ (~a) characterizes the view of the tester, provided that it tests a symmetric property, and
thus a lower bound can be proved via a construction of two instances of collections that agree on this
quantity while one instance has the property and the other is far from having the property.

3.2   Testability of symmetric properties of lists of distributions
In this subsection we prove the following theorem (which is used to prove Theorem 3.1).

Theorem 3.5. Let D+ and D− be two lists of m distributions over [n], all of whose frequencies are at
      δ
most κ·m where κ is some positive integer and 0 < δ ≤ 1/2. If
                                    +           −                     31 352δ
                                poi ~λ D ,κ − poi ~λ D ,κ             <      −    ,                        (3.29)
                                                                  1       60   5
then testing in the uniform sampling model any symmetric property of distributions such that D+ has the
property, while D− is Ω(1)-far from having the property requires Ω(κ · m) samples.

A high-level discussion of the proof of Theorem 3.5 For an element j ∈ [n] and a distribution Di ,
i ∈ [m], let αi, j be the number of times the pair (i, j) appears in the sample (when the sample is selected
according to some sampling model). Thus (α1, j , . . . , αm, j ) is the sample histogram of the element j. The
histogram of the elements’ histograms is called the fingerprint of the sample. That is, the fingerprint
indicates, for every ~a ∈ Nm , the number of elements j such that (α1, j , . . . , αm, j ) = ~a. As shown in [12],
when testing symmetric properties of distributions, it can be assumed without loss of generality that the
testing algorithm is provided only with the fingerprint of the sample. Furthermore, since the number, n,
of elements is fixed, it suffices to give the tester the fingerprint of the sample without the ~0 = (0, . . . , 0)
entry.

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                  304
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

     For example, consider the distributions D1 and D2 over {1, 2, 3} such that D1 [ j] = 1/3 for every
 j ∈ {1, 2, 3}, D2 [1] = D2 [2] = 1/2 and D2 [3] = 0. Assume that we sample (D1 , D2 ) four times, according
to the uniform sampling model and we get the samples (1, 1), (2, 1), (2, 2), (1, 3), where the first coordinate
denotes the distribution, and the second coordinate denotes the element. Then the sample histogram of
element 1 is (1, 1) because 1 was selected once by D1 and once by D2 . For the elements 2, 3 we have the
sample histograms (0, 1) and (1, 0), respectively. The fingerprint of the sample is (0, 1, 1, 0, 1, 0, 0, . . .)
for the following order of histograms: ((0, 0), (0, 1), (1, 0), (2, 0)(1, 1), (0, 2), (3, 0), . . .).
     In order to prove Theorem 3.5, we would like to show that the distributions of the fingerprints when
the sample is generated according to D+ and when it is generated according to D− are similar, for a
sample size that is below the lower bound stated in the theorem. For each choice of element j ∈ [n] and a
distribution Di , the number of times the sample (i, j) appears, i. e., αi, j , depends on the number of times
the other samples appear simply because the total number of samples is fixed. Furthermore, for each
histogram ~a, the number of elements with sample histogram identical to ~a is dependent on the number of
times the other histograms appear, because the number of samples is fixed. For instance, in the example
above, if we know that we have the histogram (0, 1) once and the histogram (1, 1) once, then we know
that third histogram cannot be (2, 0). In addition, it is dependent because the number of elements is fixed.
     We thus see that the distribution of the fingerprints is rather difficult to analyze (and therefore it is
difficult to bound the statistical distance between two different such distributions). Therefore, we would
like to break as much of the above dependencies. To this end we define a slightly different process
for generating the samples that involves Poissonization [46]. In the Poissonized process the number
of samples we take from each distribution Di , denoted by κi0 , is distributed according to the Poisson
distribution. We prove that, while the overall number of samples the Poissonized process takes is bigger
just by a constant factor from the uniform process, we get with very high probability that κi0 > κi , for
every i, where κi is the number of samples taken from Di . This implies that if we prove a lower bound for
algorithms that receive samples generated by the Poissonized process, then we obtain a related lower
bound for algorithms that work in the uniform sampling model.
     As opposed to the process that takes a fixed number of samples according to the uniform sampling
model, the benefit of the Poissonized process is that the αi, j ’s determined by this process are independent.
Therefore, the type of sample histogram that element j has is completely independent of the types of
sample histograms the other elements have. We get that the fingerprint distribution is a generalized
multinomial distribution, which has been studied by Roos [42] (the connection is due to Valiant [48]).

Definition 3.6. In the Poissonized uniform sampling model with parameter κ (which we’ll refer to as the
κ-Poissonized model), given a list D = (D1 , . . . , Dm ) of m distributions, a sample is generated as follows.

    • Draw m independent samples κ1 , . . . , κm from poi(κ),

    • Return κi independent samples distributed according to Di for each i ∈ [m].

Lemma 3.7. Assume there exists a tester T in the uniform sampling model for a property P of lists of m
distributions, that takes a sample of size s = κm where κ ≥ c for some sufficiently large constant c, and
works for every ε ≥ ε0 where ε0 is a constant (and whose success probability is at least 2/3). Then there
exists a tester T 0 for P in the Poissonized uniform sampling model with parameter 2κ, that works for
every ε ≥ ε0 and whose success probability is at least 19/30.

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                305
                                R EUT L EVI , DANA RON , AND RONITT RUBINFELD

Proof. Roughly speaking, the tester T 0 tries to simulate T if it has a sufficiently large sample, and
otherwise it guesses the answer. More precisely, let T 0 be a tester in the Poissonized uniform sampling
model with parameter 2κ. The tester T 0 receives a set of samples, S0 , where each sample in the set is
distributed uniformly and independently over the m distributions, and their number, κ 0 , is distributed as
κ 0 ∼ poi(2κm) (see [22, p. 216]). By Lemma 3.4 we have that,

                                      Pr κ 0 < κm ≤ (3/4)κm/2 .
                                                 
                                                                                                     (3.30)

If κ 0 ≥ κm then T 0 picks a random subset of samples, S ⊂ S0 , of size κm and simulates T on S. Otherwise
it “accepts” or “rejects” with equal probability. Since the samples in S are distributed as κm samples
drawn according to the uniform sampling model, if equation (3.30) holds then T 0 simulates a tester in the
uniform sampling model successfully.
     The probability that κ 0 ≥ κm is at least 1 − (3/4)κm/2 , which is greater than 4/5 for κ > c and a
sufficiently large constant c. Therefore, the success probability of T 0 is at least

                                                 4 2 1 1 19
                                                  · + · =   ,
                                                 5 3 5 2 30
as desired.

     Given Lemma 3.7 it suffices to consider samples that are generated in the Poissonized uniform
sampling model. The process for generating a sample {α1, j , . . . , αm, j } j∈[n] (recall that αi, j is the number
of times that element j was selected by distribution Di ) in the κ-Poissonized model is equivalent to the
following process: For each i ∈ [m] and j ∈ [n], independently select αi, j according to poi(κ · Di ( j))
(see [22, p. 216]). Thus the probability of getting a particular histogram ~a = (a1 , . . . , am ) for element j is
pD,κ ( j;~a) (as defined in equation (3.27)). We can represent the event that the histogram of element j is ~a
by a Bernoulli random vector ~b j that is indexed by all ~v ∈ Nm , is 1 in the coordinate corresponding to ~a,
and is 0 elsewhere. Given this representation, the fingerprint of the sample corresponds to ∑nj=1~b j .
     As defined above, the dimension of the~b j ’s is unbounded. This is simply because poi(λ ) is unbounded
for every λ > 0, and so every possible histogram has non-zero probability. However, we may consider
vectors of finite dimension (we show in the proof of Theorem 3.5 that it is sufficient) and map the
histograms that we do not consider to the vector (0, . . . , 0). Roos’s theorem, stated next, shows that the
distribution of the fingerprints can be approximated by a multivariate Poisson distribution (the Poisson
here is related to the fact that the fingerprints’ distributions are generalized multinomial distributions
and not related to the Poisson from the Poissonization process). For simplicity, the theorem is stated for
vectors ~b j that are indexed directly, that is ~b j = (b j,1 , . . . , b j,h ).

 Theorem 3.8 ([42]). LetDSn be the       distributionof the sum Sn of n independent Bernoulli random vectors
~b1 , . . . ,~bn in Rh where Pr ~b j =~e` = p j,` and Pr ~b j = (0, . . . , 0) = 1 − ∑h p j,` (here~e` satisfies e j,` = 1
                                                                                      `=1
                              0                                                     ~
 and e j,` = 0 for every ` 6= `). Let us define an h-dimensional vector λ = (λ1 , . . . , λh ) by λ` := ∑nj=1 p j,` .
              0

 Then
                                                                             n    2
                                                                88 h ∑ j=1 p j,`
                                          DSn − poi(~λ ) ≤          ∑ ∑n p j,` .                                   (3.31)
                                                           1     5 `=1        j=1



                        T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                       306
                       T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

   We next show how to obtain a bound on sums of the form given in equation (3.31) under appropriate
conditions.
Lemma 3.9. Given a list D = (D1 , . . . , Dm ) of m distributions over [n] and a real number 0 < δ ≤ 1/2
such that for all i ∈ [m] and for all j ∈ [n], Di ( j) ≤ δ /(m · κ) for some integer κ, we have that

                                                  ∑nj=1 pD,κ ( j;~a)2
                                           ∑ ∑n pD,κ ( j;~a) ≤ 2δ .                                 (3.32)
                                        ~a∈Nm \~0    j=1


Proof.

                             ∑nj=1 pD,κ ( j;~a)2                                  
                      ∑ ∑n pD,κ ( j;~a)            ≤      ∑        max pD,κ ( j;~a)                 (3.33)
                                j=1                                 j
                   ~a∈Nm \~0                           ~a∈Nm \~0
                                                                                       !
                                                                          m
                                                   =      ∑ max
                                                             j
                                                                ∏ poi(ai ; κ · Di ( j))             (3.34)
                                                       ~a∈Nm \~0         i=1
                                                                a1 +...+am
                                                                δ
                                                   ≤    ∑ m                                         (3.35)
                                                     ~a∈Nm \~0
                                                       ∞        a
                                                             a δ
                                                   ≤ ∑m                                             (3.36)
                                                      a=1       m
                                                   ≤ 2δ ,                                           (3.37)

where the inequality in equation (3.37) holds for δ ≤ 1/2 and the inequality in equation (3.35) follows
from:
                                                         e−κ·Di ( j) (κ · Di ( j))a
                                 poi(a; κ · Di ( j)) =                                              (3.38)
                                                                      a!
                                                       ≤ (κ · Di ( j))a                             (3.39)
                                                          a
                                                           δ
                                                       ≤           ,                                (3.40)
                                                           m
and the proof is completed.

Proof of Theorem 3.5. Let DD and DD denote the distribution of the fingerprints when taking samples
                                 −            +


in the κ-Poissonized model (see Definition 3.6) from D− and D+ , respectively. We first bound the
`1 -distance between DD and DD and then prove the desired result on the uniform sampling model by
                         −        +


applying Lemma 3.7. Let A ⊂ Nm be some finite subset of histograms such that for both D− and D+ , the
probability (in the κ-Poissonized model) that at least one of the elements has a sample histogram ~a ∈
                                                                                                     / A is
smaller than 1/120 (it is easy to verify, for example by Markov’s inequality, that such a subset exists).
Denote by F the subset of fingerprints describing outcomes where for each element in [n], the sample
histogram is in A. Hence,
                                               
                                           D− ~
                                                            
                                                        D+ ~         1
                                    ∑    D      f  − D       f ≤       .
                                                                    60
                                      F̄3 f
                                      ~


                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                            307
                            R EUT L EVI , DANA RON , AND RONITT RUBINFELD

We next turn to bound                                          
                                       ∑ DD             ~f − DD+ ~f .
                                                   −


                                       F3~f

    By the first premise of the theorem, D+           +
                                             i ( j), Di ( j) ≤ δ /(κm)− for every i ∈ [m] and j ∈ [n]. By
Lemma 3.9 this implies that equation (3.32) holds both for D = D+ and for D = D− . Combining this
with Theorem 3.8 we get that the `1 -distance between the fingerprint distribution over F when the sample
is generated according to D+ (in the κ-Poissonized model) and the distribution poi ~λ D ,κ is at most
                                                                                          + 



                                              88        176
                                                 · 2δ =     δ,
                                              5          5
and an analogous statement holds    for D− . Byapplying the premise in equation (3.29) (concerning the `1
distance between poi ~λ  D+ ,κ 
                                 and poi ~λ D ,κ ) and the triangle inequality, we get that
                                             −




                             D− ~
                                            
                                          D+ ~           176      31 352δ        1     16
                   ∑      D       f  − D       f ≤ 2·        δ+ −             +     =     ,
                                                          5       60     5       60    30
                  F̄∪F3 f
                       ~

which implies that the statistical distance is smaller than 8/30. Thus a tester that accepts D+ and rejects
D− with probability greater than
                                              1 1 8         19
                                                + ·       =    ,
                                              2 2 30 30
viewing only the above-mentioned fingerprints, does not exist. By Lemma 3.7 the theorem follows.

3.3   Proof of Theorem 3.1
In this subsection we show how to apply Theorem 3.5 to two lists of distributions, D+ and D− , which we
                                           eq
will define shortly, where D+ ∈ Peq = Pm,n while D− is (1/20)-far from Peq . Recall that by the premise
of Theorem 3.1, n ≥ cm log m for some sufficiently large constant c > 1. In the proof it will be convenient
to assume that m is even and that n (which corresponds in Lemma 3.10 to 2t) is divisible by 4. It is not
hard to verify that it is possible to reduce the general case to this case. In order to define D− , we shall
need the next lemma.
Lemma 3.10. For every two even integers m and t, there exists a 0/1-valued matrix M with m rows and t
columns for which the following holds:
   1. In each row and each column of M, exactly half of the elements are 1 and the other half are 0.
   2. For every integer 2 ≤ x < m/2, and for every subset S ⊆ [m] of size x, the number of columns j
      such that M[i, j] = 1 for every i ∈ S is at least
                                                           r         !
                                                      2x2
                                               
                                             1                2x ln m
                                       t·         1−       −            ,
                                            2x         m         t

      and at most                                             r             !
                                                       1          2x ln m
                                              t·          +                     .
                                                       2x            t


                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             308
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Proof. Consider selecting a matrix M randomly as follows. Denote the first t/2 columns of M by F. For
each column in F, pick, independently from the other t/2 − 1 columns in F, a random half of its elements
to be 1, and the other half of the elements to be 0. Columns t/2 + 1, . . . ,t are the negations of columns
1, . . . ,t/2, respectively. Thus, in each row and each column of M, exactly half of the elements are 1 and
the other half are 0.
      Consider a fixed choice of x. For each column j between 1 and t, each subset of rows S ⊆ [m] of size
x, and b ∈ {0, 1}, define the indicator random variable IS, j,b to be 1 if and only if M[i, j] = b for every
i ∈ S. Hence,
                                                                        
                                               1    1 1           1 x−1
                              Pr[IS, j,b = 1] = ·    −    ·...·     −         .                       (3.41)
                                               2    2 m           2   m

Clearly, Pr[IS, j,b = 1] < 1/2x . On the other hand,

                                                             1 x x
                                                                  
                                    Pr[IS, j,b = 1] ≥          −                                       (3.42)
                                                             2 m
                                                                   2x x
                                                                     
                                                           1
                                                       =       1 −                                     (3.43)
                                                           2x      m
                                                                   2x2
                                                                      
                                                           1
                                                       ≥       1−        ,                             (3.44)
                                                           2x       m

where the last inequality is due to Bernoulli’s inequality which states that (1 + x)n > 1 + nx, for every
real, nonzero number x > −1 and an integer n > 1 ([34]).
                                                t/2
    Let ES,b denote the expected value of ∑ j=1 IS, j,b . From the fact that columns t/2 + 1, . . . ,t are the
negations of columns 1, . . . ,t/2 it follows that

                                             t              t/2
                                            ∑ IS, j,1 = ∑ IS, j,0 .
                                          j=t/2+1           j=1

Therefore, the expected number of columns 1 ≤ j ≤ t such that M[i,      j] = 1 for every i ∈ S is simply
                                     x                     x        2
ES,1 + ES,0 (that is, at most t · 1/2 and at least (t · 1/2 ) 1 − 2x /m ). By the additive Chernoff bound,
                       "                                 #
                          t/2
                                              r
                                                tx ln m
                    Pr ∑ IS, j,b − ES,b >                    < 2 exp(−2(t/2)(2x ln m)/t)             (3.45)
                          j=1                      2
                                                            = 2m−2x .                                  (3.46)

Thus, by taking a union bound (over b ∈ {0, 1}),
                        "                                         #
                              t                          √
                       Pr    ∑ IS, j,1 − (ES,1 + ES,0 ) > 2tx ln m < 4m−2x .                           (3.47)
                             j=1

By taking a union bound over all subsets S we get that M has the desired properties with probability
greater than 0.

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              309
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

    We first define D+ , in which all distributions are identical. Specifically, for each i ∈ [m]:
                                                                     2/3 1/3
                                                   1
                                            
                                            
                                              n2/3 m1/3 if 1 ≤ j ≤ n 2m ,
                                            
                                        def
                                D+
                                 i ( j) =      1
                                               n         if n2 < j ≤ n ,                               (3.48)
                                            
                                            
                                            
                                            0           otherwise.

We now turn to defining D− . Let M be a matrix as in Lemma 3.10 for t = n/2. For every i ∈ [m]:
                                                           2/3 1/3
                                         1
                                   
                                   
                                    n2/3 m1/3 if 1 ≤ j ≤ n 2m ,
                                   
                               def
                      D−i ( j) =     2
                                     n         if 2n < j ≤ n and M[i, j − n/2] = 1 ,            (3.49)
                                   
                                   
                                   
                                   0          otherwise.

For both D+ and D− , we refer to the elements 1 ≤ j ≤ n2/3 m1/3 /2 as the heavy elements, and to the
elements n/2 ≤ j ≤ n, as the light elements. Observe that each heavy element has exactly the same
                                                                    −
probability weight, 1/(n2/3 m1/3 ), in all distributions D+i and Di . On the other hand, for each light
element j, while D+ i ( j) = 1/n (for every i), in D we have that Di ( j) = 2/n for half of the distributions,
                                                    −              +
                                              +
the distributions selected by the M, and Di ( j) = 0 for half of the distributions, the distributions which
are not selected by M. We later use the properties of M to bound the `1 distance between the fingerprints’
distributions of D+ and D− .

 A high-level discussion To gain some intuition before delving into the detailed proof, consider first the
 special case that m = 2 (which was studied by Valiant [48], and indeed the construction is the same as the
 one he analyzes (and was initially proposed in [11])). In this case each heavy element has probability
 weight Θ(1/n2/3 ) and we would like to establish a lower bound of Ω(n2/3 ) on the number of samples
 required to distinguish between D+ and D− . That is, we would like to show that the corresponding
 fingerprints’ distributions when the sample is of size o(n2/3 ) are very similar.
      The first main observation is that since the probability weight of light elements is Θ(1/n) in both D+
 and D− , the probability that a light element will appear more than twice in a sample of size o(n2/3 ) is
 very small. That is (using the fingerprints of histograms notation we introduced previously), for each
~a = (a1 , a2 ) such that a1 + a2 > 2, the sample will not include (with high probability) any light element j
 such that α1, j = a1 and α2, j = a2 (for both D+ and D− ). Moreover, for every x ∈ {1, 2}, the expected
 number of elements j such that (α1, j , α2, j ) = (x, 0) is the same in D+ and D− , as well as the variance
 (from symmetry, the same applies to (0, x)). Thus, most of the difference between the fingerprints’
 distributions is due to the numbers of elements j such that (α1, j , α2, j ) = (1, 1). For this setting we do
 expect to see a non-negligble difference for light elements between D+ and D− (in particular, we cannot
 get the (1, 1) histogram for light elements in D− , as opposed to D+ ).
      Here is where the heavy elements come into play. Recall that in both D+ and D− the heavy
 elements have the same probability weight, so that the expected number of heavy elements i such that
 (a1, j , a2, j ) = (1, 1) is the same for D+ and D− . However, intuitively, the variance of these numbers
 for the heavy elements “swamps” the differences between the light elements so that it is not possible
 to distinguish between D+ and D− . The actual proof, which formalizes (and quantifies) this intuition,

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             310
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

considers the difference between the values of the vectors ~λ D ,k and ~λ D ,k (as defined in equation (3.28))
                                                                +           −


in the coordinates corresponding to ~a such that a1 + a2 = 2. We can then apply Lemmas 3.2 and 3.3 to
obtain equation (3.29) in Theorem 3.5.
    Turning to m > 2, it is no longer true that in a sample of size o(n2/3 m1/3 ) we will not get histogram
vectors ~a such that ∑mi=1 ai > 2 for light elements. Thus we have to deal with many more vectors ~        a (of
dimension m) and to bound the total contribution of all of them to the difference between fingerprints of
D+ and of D− . To this end we partition the set of all possible histograms’ vectors into several subsets
according to their Hamming weight ∑m                                            0
                                         i=1 ai and depending on whether all ai s are in {0, 1}, or there exists
a least one ai such that ai ≥ 2. In particular, to deal with the former (whose number, for each choice of
Hamming weight x is relatively large, i. e., roughly mx ), we use the properties of the matrix M based on
which D− is defined. We note that from the analysis we see that, similarly to when m = 2, we need the
variance of the heavy elements to play a role just for the cases where ∑m  i=1 ai = 2 while in the other cases
the total contribution of the light elements is rather small.
In the remainder of this section we provide the details of the analysis.
    Before establishing that indeed D− is Ω(1)-far from Peq , we introduce some more notation (which
will be used throughout the remainder of the proof of Theorem 3.1). Let Sx be the set of vectors of
dimension m that contain exactly x coordinates that are 1, and all the rest are 0 (which corresponds to an
element that was sampled once or 0 times by each distribution). Let Ax be the set of vector of dimension
m that their coordinates sum up to x but must contain at least one coordinate that is 2 (which corresponds
to an element that was samples at least twice by at least one distribution). More formally, for any integer
x, we define the following two subsets of Nm :

                                                     ∑m
                                                                      
                                    def       m       i=1 ai = x and
                                 Sx = ~a ∈ N :                            ,                          (3.50)
                                                     ∀i ∈ [m], ai < 2

and
                                                ∑m
                                                                
                                     def     m   i=1 ai = x and
                                  Ax = ~a ∈ N :                    .                                     (3.51)
                                                ∃i ∈ [m], ai ≥ 2
                        def
For ~a ∈ Nm , let sup(~a) = {i : ai 6= 0} denote the support of ~a, and let
                                                                           
                                          def      −       2
                                  IM (~a) = j : Di ( j) =     ∀i ∈ sup(~a) .                             (3.52)
                                                           n

Note that in terms of the matrix M (based on which D− is defined), IM (~a) consists of the columns in M
whose restriction to the support of ~a contains only 1’s. In terms of the D− , it corresponds to the set of
light elements that might have a sample histogram of ~a (when sampling according to D− ).

Lemma 3.11. For every m > 5 and for n ≥ c ln m for some sufficiently large c, we have that
                                             m
                                                                m
                                            ∑ kD−i − D∗ k1 > 20
                                            i=1

for every distribution D∗ over [n]. That is, the list D− is (1/20)-far from Peq .

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                311
                               R EUT L EVI , DANA RON , AND RONITT RUBINFELD

Proof. Consider any ~a ∈ S2 . By Lemma 3.10, setting t = n/2, the size of IM (~a), i. e., the number of light
elements ` such that D−
                      i [`] = 2/n for every i ∈ sup(~
                                                    a), is at most
                                                   r         !
                                          n 1         8 ln m
                                                 +             .
                                          2 4            n

The same upper bound holds for the number of light elements ` such that D−
                                                                         i [`] = 0 for every i ∈ sup(~
                                                                                                     a).
This implies that for every i 6= i0 in [m], for at least
                                                           r        !
                                             n        1      8 ln m
                                               −n        +
                                             2        4         n

of the light elements, `, we have that D−                     −                  −                    −
                                         i [`] = 2/n while Di0 [`] = 0, or that Di0 [`] = 2/n while Di [`] = 0.
Therefore,                                                    r
                                          −      −      1         8 ln m
                                      kDi − Di0 k1 ≥ − 2                 ,
                                                        2            n
which for n ≥ c ln m and a sufficiently large constant c, is at least 1/8. Thus, by the triangle inequality we
have that for every D∗ ,
                                          m                  jmk 1
                                                −    ∗
                                        ∑ i kD    − D  k 1 ≥       · ,
                                        i=1                    2 8
which greater than m/20 for m > 5.

    In what follows we work towards establishing that equation (3.29) in Theorem 3.5 holds for D+ and
D− . Set κ = δ · n2/3 /m2/3 , where δ is a constant to be determined later. We shall use the shorthand ~λ +
for ~λ D ,κ , and ~λ − for ~λ D ,κ (recall that the notation ~λ D,κ was introduced in equation (3.28)). By the
        +                      −


definition of ~λ + , for each ~a ∈ Nm ,

                ~λ + (~a) =
                                 n    m
                                           (κ · D+ ( j))ai
                                ∑ ∏ eκ·D i( j) · a !
                                                 +
                                                 i
                                                                                                        (3.53)
                                j=1 i=1                i
                                n2/3 m1/3 /2 m                   n     m
                                                 (δ /m)ai                 (δ /(n1/3 m2/3 ))ai
                           =         ∑       ∏ eδ /m · ai ! +   ∑ ∏ eδ /(n1/3 m2/3 ) · a !              (3.54)
                                     j=1     i=1              j=n/2+1 i=1                 i

                                n2/3 m1/3 m (δ /m)ai        n       m
                                                                      (δ /(n1/3 m2/3 ))ai
                           =             ∏           +          1/3 ∏                     .             (3.55)
                                  2eδ i=1 ai !         2eδ (m/n) i=1         ai !
                                                   −        2 m
By the construction of M, for every light j, ∑m                     m
                                              i=1 Di ( j) = n · 2 = n . Therefore,


             ~λ − (~a) =    n2/3 m1/3 m (δ /m)ai        1              m
                                                                          (2δ /(n1/3 m2/3 ))ai
                                     ∏           +         1/3 ∑      ∏                        .        (3.56)
                              2eδ i=1 ai !         eδ (m/n) j∈IM (~a) i=1         ai !

Hence, ~λ + (~a) and ~λ − (~a) differ only on the term which corresponds to the contribution of the light
elements. Equations (3.55) and (3.56) demonstrate why we choose M with the specific properties defined

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              312
                         T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

in Lemma 3.10. First of all, in order for every D−    i to be a probability distribution, we want each row of
M to sum up to exactly n/4. We also want each column of M to sum up to exactly m/2, in order to get
                              −
         −κ·D+
             i ( j) = ∏m e−κ·Di ( j) . Finally, we would have liked |I (~
∏m i=1 e               i=1
                                                                               m    ai
                                                                      M a)| · ∏i=1 2 to equal n/2 for every ~a.
                         ~ +           ~ −
This would imply that λ (~a) and λ (~a) are equal. As we show below, this is in fact true for every ~a ∈ S1 .
For vectors ~a ∈ Sx where x > 1, the second condition in Lemma 3.10 ensures that |IM (~a)| is sufficiently
close to (n/2) · (1/2x ). This property of M is not necessary in order to bound the contribution of the
vectors in Ax . The bound that we give for those vectors is less tight, but since there are fewer such vectors,
it suffices.
    We start by considering the contribution to equation (3.29) of histogram vectors ~a ∈ S1 (i. e., vectors
of the form (0, . . . , 0, 1, 0, . . . , 0)) which correspond to the number of elements that are sampled only by
one distribution, once. We prove that in the Poissonized uniform sampling model, for every ~a ∈ S1 the
number of elements with such sample histogram is distributed exactly the same in D+ and D− .

Lemma 3.12.

                                      ∑ poi(~λ + (~a) − poi(~λ − (~a)) 1 = 0 .                           (3.57)
                                     ~a∈S1

Proof. For every ~a ∈ S1 , the size of IM (~a) is n/4, thus,
                                 m
                                     (2δ /(n1/3 m2/3 ))ai   n m (δ /(n1/3 m2/3 ))ai
                          ∑ ∏                             =                         .                    (3.58)
                        j∈I (~a) i=1
                                             ai !           2∏
                                                             i=1       ai !
                           M



By equations (3.55) and (3.56), it follows that ~λ + (~a) −~λ − (~a) = 0 for every ~a ∈ S1 . The lemma follows
by applying equation (3.1).

We now turn to bounding the contribution to equation (3.29) of histogram vectors ~a ∈ A2 (i. e., vectors of
the form (0, . . . , 0, 2, 0, . . . , 0) which correspond to the number of elements that are sampled only by one
distribution, twice.

Lemma 3.13.

                                      poi(~λ + (A2 )) − poi(~λ − (A2 ))       ≤ 3δ .                     (3.59)
                                                                          1

Proof. For every ~a ∈ A2 , the size of IM (~a) is n/4, thus,
                                 m                            m
                                     (2δ /(n1/3 m2/3 ))ai        (δ /(n1/3 m2/3 ))ai
                          ∑ ∏                             = n∏                       .                   (3.60)
                        j∈I (~a) i=1
                                             ai !            i=1        ai !
                           M


By equations (3.55), (3.56) and (3.60) it follows that
                                                     m
               ~λ − (~a) −~λ + (~a) =        n         (δ /(n1/3 m2/3 ))ai        n1/3 δ 2
                                                 1/3 ∏
                                                                           =          1/3
                                                                                            ,            (3.61)
                                        2eδ (m/n) i=1         ai !           4eδ (m/n) m4/3

                       T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                               313
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

and that
                                               2/3 m1/3 m        (δ /m)ai    n2/3 δ 2
                             ~λ − (~a) ≥ n                                =           .                           (3.62)
                                                   2eδ
                                                             ∏ ai !         4eδ m5/3
                                                             i=1

By equations (3.61) and (3.62) we have that
                                                2
                             ~λ − (~a) −~λ + (~a)                               1/3
                                                                eδ −2δ (m/n) δ 2 δ 2
                                                              ≤                 ≤    .                            (3.63)
                                      ~λ − (~a)                        4m         m

By equation (3.63) and the fact that |A2 | = m we get
                                                                 2
                                              ~λ − (~a) −~λ + (~a)
                                                                                       δ2
                                      ∑                                      ≤ m·         = δ2                    (3.64)
                                     ~a∈A2
                                                         ~λ − (~a)                     m

The lemma follows by applying Lemma 3.3.

    Recall that for a subset I of Nm , poi(~λ (I)) denotes the multivariate Poisson distributions restricted to
the coordinates of ~λ that are indexed by the vectors in I. We separately deal with Sx where 2 ≤ x < m/2,
and x ≥ m/2, where our main efforts are with respect to the former, as the latter correspond to very low
probability events.

Lemma 3.14. For m ≥ 16, n ≥ cm ln m (where c is a sufficiently large constant) and for δ ≤ 1/16

                                        m/2−1                                  m/2−1
                                                        !!                                     !!
                              ~+                                         ~−
                                          [                                        [
                        poi λ                      Sx        − poi λ                     Sx             ≤ 32δ .   (3.65)
                                         x=2                                       x=2              1

Proof. Let ~a be a vector in Sx then by the definition of Sx , every coordinate of ~a is 0 or 1. Therefore we
                                                                       Sm/2−1
make the following simplification of equation (3.55): For each ~a ∈ x=2 Sx ,

                                     2/3 m1/3
                                                        x                             x
                     ~λ + (~a) = n                 ·
                                                        δ
                                                            +
                                                                   n
                                                                           ·
                                                                                   δ
                                                                                            .                     (3.66)
                                                                       1/3
                                      2eδ               m     2eδ (m/n)        n1/3 m2/3
                                 Sm/2−1
By Lemma 3.10, for every ~a ∈         x=2       Sx the size of IM (~a) is at most
                                                                     r             !
                                                       n     1           4x ln m
                                                         ·      +
                                                       2     2x             n

and at least                                                             r             !
                                             n          1   2x2              4x ln m
                                               ·          −     −                          .
                                             2          2x 2x m                 n

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                        314
                          T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

By equation (3.56) this implies that
                                2/3 m1/3
                                               x                                 x
                ~λ − (~a) = n                  δ          n          1         2δ
                                            ·      +          1/3
                                                                  · x +η                 ,               (3.67)
                                 2eδ           m     2eδ (m/n)       2      n1/3 m2/3

where                                                    r               !           r
                                              2x2              4x ln m                   4x ln m
                                        −          +                         ≤η ≤
                                              2x m                n                         n

and thus                                                                       r             !
                                                                  2x2
                                                     r
                                                         x                         4m ln m
                                           |η| ≤           ·       √ +                           .
                                                         m       2x m                n

By the facts that n ≥ cm ln m for some sufficiently large constant c, and that

                                                                2x2   1
                                                                x
                                                                  √ ≤
                                                               2 m 2
                                                                               px
for every 2 ≤ x < m/2 and m ≥ 16, we obtain that |η| ≤                             m . So we have that

                                                                           x r 2             x
                                                                                    n2    4δ 2
                                                                                     
                                             n           2δ                    x                  x
        (~λ + (~a) −~λ − (~a))2 ≤                1/3
                                                     · 1/3 2/3               ·     ≤ · 2/3 4/3 · .       (3.68)
                                        2eδ (m/n)      n m                     m    4    n m      m

and that
                                                               2/3 m1/3
                                                                                x
                                             ~λ − (~a) ≥ n                   ·
                                                                                δ
                                                                                    .                    (3.69)
                                                                 2eδ            m

Then we get, for δ ≤ 1/2, that
                                                    2
                                 ~λ + (~a) −~λ − (~a)                                         x
                                                                        eδ n4/3
                                                                                  
                                                                                       4δ          x
                                                                 ≤              ·                ·       (3.70)
                                         ~λ − (~a)                      2m1/3       n2/3 m1/3      m
                                                                                            x
                                                                        n4/3
                                                                                
                                                                                     4δ          x
                                                                 ≤        1/3
                                                                              · 2/3 1/3 ·                (3.71)
                                                                        m         n m           m
                                                                                            !x
                                                                        n4/3       4x1/x δ
                                                                 ≤            ·                          (3.72)
                                                                        m4/3      n2/3 m1/3
                                                                                            x
                                                                        n4/3
                                                                                
                                                                                     8δ
                                                                 ≤            · 2/3 1/3 .                (3.73)
                                                                        m4/3      n m

Summing over all
                                                                      m/2−1
                                                                       [
                                                               ~a ∈           Sx
                                                                       x=2


                        T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             315
                              R EUT L EVI , DANA RON , AND RONITT RUBINFELD

we get:
                                                                                               !x
                                        (~λ − (~a) −~λ + (~a))2         n4/3
                                                                         ∞
                                                                                     8δ m2/3
                             ∑                                    ≤ ∑ 4/3 ·                                    (3.74)
                          Sm/2−1                ~λ − (~a)           x=2 m             n2/3
                       ~a∈   x=2   Sx
                                                                                               !x
                                                                         ∞
                                                                                     8δ m2/3
                                                                  =      ∑ 64δ 2 ·    n2/3
                                                                                                               (3.75)
                                                                         x=0
                                                                     64δ 2
                                                                  ≤                                            (3.76)
                                                                    1 − 8δ
                                                                  ≤ 128δ 2 .                                   (3.77)
where in equation (3.75) we used the fact that n > m, and equation (3.77) holds for δ ≤ 1/16. The lemma
follows by applying Lemma 3.3.

Lemma 3.15. For n ≥ m, m ≥ 12 and δ ≤ 1/4,                   ∑ ∑ poi(~λ + (~a)) − poi(~λ − (~a)) 1 ≤ 32δ 3 .
                                                           x≥m/2 ~a∈Sx

Proof. We first observe that |Sx | ≤ mx /x for every x ≥ 6. To see why this is true, observe that |Sx | equals
the number of possibilities of arranging x balls in m bins, i. e.,
                                                      em x mx
                                               
                                             m
                                    |Sx | =       ≤         ≤      .                                    (3.78)
                                              x        x         x
where we have used the premise that m ≥ 12 and thus x ≥ 6. By equations (3.55) and (3.56) (and the fact
that |x − y| ≤ max{x, y} for every positive real numbers x,y),
                                                                                     ai
                                                                   n m
                                                                         
                                ~λ + (~a) −~λ − (~a) ≤                         2δ
                      ∑ ∑                                ∑ ∑ 2 ∏ 1/3 m2/3                       (3.79)
                    x≥m/2 ~a∈Sx                        x≥m/2 ~a∈Sx i=1 n
                                                                                 ∑mi=1 ai
                                                                   n       2δ
                                                     = ∑ ∑                                      (3.80)
                                                       x≥m/2 ~a∈Sx
                                                                   2 n1/3 m2/3
                                                                                  x
                                                              mx n
                                                         ∞            
                                                                           2δ
                                                     ≤ ∑          ·                             (3.81)
                                                       x=m/2
                                                               x 2 n1/3 m2/3
                                                                                   x
                                                              2mx n
                                                         ∞              
                                                                             2δ
                                                     ≤ ∑            ·                           (3.82)
                                                       x=m/2
                                                               m 2 n1/3 m2/3
                                                                            !x
                                                        n ∞         2δ m1/3
                                                     =      ∑                                   (3.83)
                                                       m x=m/2        n1/3
                                                                                 !x
                                                          3
                                                                ∞
                                                                        2δ m1/3
                                                     = 8δ      ∑                                (3.84)
                                                            x=m/2−3       n1/3
                                                                8δ 3
                                                            ≤                                                  (3.85)
                                                              1 − 2δ
                                                            ≤ 16δ 3                                            (3.86)

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                     316
                       T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

where in equation (3.85) we used the fact that n ≥ m and equation (3.86) holds for δ ≤ 1/4. The lemma
follows by applying equation (3.1).

We finally turn to the contribution of ~a ∈ Ax such that x ≥ 3.

Lemma 3.16. For n ≥ m and δ ≤ 1/4,

                              ∑ ∑ poi(~λ + (~a)) − poi(~λ − (~a)) 1 ≤ 16δ 3 .                         (3.87)
                              x≥3 ~a∈Ax


Proof. We first observe that |Ax | ≤ mx−1 for every x. To see why this is true, observe that |Ax | equals the
number of possibilities of arranging x − 1 balls, where one ball is a “special” (“double”) ball in m bins.
By equations (3.55) and (3.56) (and the fact that |x − y| ≤ max{x, y} for every positive real numbers x,y),
                                                                                  ai
                                                                 n m
                                                                      
                                ~λ + (~a) −~λ − (~a) ≤                      2δ
                      ∑∑                                ∑ ∑ ∏ 1/3 m2/3                                (3.88)
                      x≥3 ~a∈Ax                        x≥3 ~a∈Ax 2 i=1 n
                                                                              ∑mi=1 ai
                                                                 n      2δ
                                                     = ∑ ∑            1/3 m2/3
                                                                                                      (3.89)
                                                       x≥3 ~a∈Ax 2 n
                                                         ∞                     x
                                                              x−1 n        2δ
                                                     ≤ ∑m ·                                           (3.90)
                                                       x=3          2 n1/3 m2/3
                                                                           !x
                                                         n ∞ 2δ m1/3
                                                     =       ∑ n1/3                                   (3.91)
                                                        2m x=3
                                                                           !x
                                                           3
                                                              ∞
                                                                   2δ m1/3
                                                     = 4δ ∑                                           (3.92)
                                                             x=0     n1/3
                                                            4δ 3
                                                        ≤                                             (3.93)
                                                          1 − 2δ
                                                        ≤ 8δ 3                                        (3.94)

where in equation (3.93) we used the fact that n ≥ m and equation (3.94) holds for δ ≤ 1/4. The lemma
follows by applying equation (3.1).

We are now ready to finalize the proof of Theorem 3.1.
Proof of Theorem 3.1. Let D+ and D− be as defined in equations (3.48) and (3.49), respectively, and
recall that κ = δ · n2/3 /m2/3 (where δ will be set subsequently). By the definition of the distributions
in D+ and D− , the probability weight assigned to each element is at most 1/(n2/3 m1/3 ) = δ /(κ · m), as
required by Theorem 3.5. By Lemma 3.11, D− is (1/20)-far from Peq . Therefore, it remains to establish
that equation (3.29) holds for D+ and D− . Consider the following partition of Nm :
                                                                                         
                                             m/2−1                                       
                                               [
                             {~a}~a∈S1 , A2 ,       Sx , {~a}~a∈Sx≥m/2 Sx , {~a}~a∈Sx≥3 Ax ,          (3.95)
                                              x=2
                                                                                          


                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             317
                              R EUT L EVI , DANA RON , AND RONITT RUBINFELD

where {~a}~a∈T denotes the list of all singletons of elements in T . By Lemma 3.2 it follows that

                poi(~λ + ) − poi(~λ − )       ≤    ∑ poi(~λ + (~a) − poi(~λ − (~a)) 1                               (3.96)
                                          1
                                                  ~a∈S1

                                                  + poi(~λ + (A2 ) − poi(~λ − (A2 ))                                (3.97)
                                                                                             1
                                                                    m/2−1                    m/2−1
                                                  + poi(~λ + (              Sx )) − poi(~λ − (
                                                                     [                           [
                                                                                                       Sx ))        (3.98)
                                                                                                               1
                                                                     x=2                         x=2

                                                  +       ∑ ∑ poi(~λ + (~a) − poi(~λ − (~a)) 1                      (3.99)
                                                      x≥m/2 ~a∈Sx

                                                  + ∑ ∑ poi(~λ + (~a) − poi(~λ − (~a))                     .       (3.100)
                                                      x≥3 ~a∈Ax                                        1


For δ < 1/16 we get by Lemmas 3.12–3.16 that

                                   poi(~λ + ) − poi(~λ − )        ≤ 35δ + 48δ 3 ,                                  (3.101)
                                                              1

which is less than
                                                      31 352δ
                                                         −
                                                      60   5
for δ = 1/200. Hence, the theorem follows from Theorem 3.5.


3.4   A lower bound for testing independence
In this subsection we show that Theorem 3.1 implies a lower bound on testing independence of joint
distributions over [m] × [n]. A tester for independence gets samples from a distribution Q over [m] × [n]
and distinguishes between Q that is a product distribution Q1 × Q2 , where Q1 is a distribution over [m]
and Q2 is a distribution over [n] (i. e., Q(i, j) = Q1 (i) · Q2 ( j) for every (i, j) ∈ [m] × [n]) and Q that is
ε-far in the `1 -distance from any product distribution. In other words, if we denote by π1 Q the marginal
distribution according to Q of the first coordinate, i, and by π2 Q the marginal distribution of the second
coordinate, j, then Q has the property of independence if π1 Q and π2 Q are independent. With a slight
abuse of the terminology, we shall say in such a case that Q is independent.
    In Lemma 3.18 we prove that the problem of testing equivalence of a collection of distributions in the
sampling model reduces to the problem of testing whether a pair of random variables are independent
given samples from their joint distribution. Due to this relation between the two problems we get that any
lower bound on the number of samples for the problem of equivalence implies a lower bound for testing
independence. Therefore, the lower bound in Theorem 3.1 holds for testing independence, as stated in
Corollary 3.19. In the proof of the lemma we shall use the following proposition.

Proposition 3.17 ([10]). Let p,q be distributions over [m] × [n]. If kp − qk1 ≤ ε/3 and q is independent,
then kp − π1 p × π2 pk1 ≤ ε.

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                          318
                         T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Lemma 3.18. If there exists an algorithm T for testing whether a joint distribution Q over [m] × [n]
is independent using a sample of size s(m, n, ε), then there exists an algorithm T 0 for testing whether
D ∈ Peq in the unknown-weights sampling model using a sample of size s(m, n, ε/3).
     If T is provided with (and uses) an explicit description of the marginal distribution π1 Q, then the
claim holds for T 0 in the known-weights sampling model.
                                    s(m,n,ε/3)
Proof. Given a sample {(i` , j` )}`=1          generated according to D in the sampling model with a weight
vector ~w = (w1 , . . . , wm ), the algorithm T 0 simply runs T on the sample and returns the answer that T
gives. If ~w is known, then T 0 provides T with ~w (as the marginal distribution of i). If D1 , . . . , Dm are
identical and equal to some D∗ , then for each (i, j) ∈ [m] × [n] we have that the probability of getting (i, j)
in the sample is wi · D∗ ( j). That is, the joint distribution of the first and second coordinates is independent
and therefore T (and hence T 0 ) accepts with probability at least 2/3.
     On the other hand, suppose that D is ε-far from Peq , that is, ∑m                            ∗
                                                                                 i=1 wi · kDi − D k1 > ε for every
distribution D∗ over [n]. In such a case, in particular we have that ∑m        i=1 wi · Di − D 1 > ε, where D is
the distribution over [n] such that D( j) = ∑m     i=1 w i · Di ( j). By Proposition  3.17, the joint distribution Q
over i and j (determined by the list D and the sampling process) is ε/3-far from independent, so T (and
hence T 0 ) rejects with probability greater than 2/3.


The next corollary follows directly from Lemma 3.18 and Theorem 3.1.

Corollary 3.19. Given a joint distribution Q over [m] × [n] it is impossible to test if Q is independent or
1/48-far from independent using o(n2/3 m1/3 ) samples.


4    A lower bound for testing equivalence in the uniform sampling model
In this section we prove the following theorem.
                                          eq
Theorem 4.1. Testing the property Pm,n in the uniform sampling model for every ε ≤ 1/2 and m ≥ 64
requires Ω(n1/2 m1/2 ) samples.

    We assume without loss of generality that n is even (or else, we set the probability weight of the
element n to 0 in all distributions considered, and work with n − 1 that is even). Define Hn to be the set
of all distributions over [n] that have probability 2/n on exactly half of the elements and 0 on the other
half. Define Hnm to be the set of all possible lists of m distributions from Hn . Define Umn to consist of only
a single list of m distributions each of which is identical to Un , where Un denotes the uniform distribution
                                                     eq
over [n]. Thus the single list in Umn belongs to Pm,n . On the other hand we show, in Lemma 4.2, that Hn
                                                                                                              m
                                                                   eq
contains mostly lists of distributions that are Ω(1)-far from Pm,n . However, we also show, in Lemma 4.3,
that any tester in the uniform sampling model that takes less than n1/2 m1/2 /6 samples cannot distinguish
between D that was uniformly drawn from Hnm and D = (Un , . . . ,Un ) ∈ Um     n . Details follow.
                                                               √
Lemma 4.2. For every m ≥ 3, with probability at least (1 − 2/ m) over the choice of D ∼ UHnm , where
                                                                                   eq
UHnm denotes the uniform distribution over Hnm , we have that D is (1/2)-far from Pm,n .


                       T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                  319
                               R EUT L EVI , DANA RON , AND RONITT RUBINFELD

                                                                           √
Proof. We need to prove that with probability at least (1 − 2/ m) over the choice of D, for every
v = (v1 , . . . , vn ) ∈ Rn which corresponds to a distribution (i. e., v j ≥ 0 for every j ∈ [n] and ∑nj=1 v j = 1),

                                                 1 m              1
                                                   ∑   kDi − vk1 > .                                               (4.1)
                                                 m i=1            2

We shall actually prove a slightly more general statement, namely, that equation (4.1) holds for every
vector v ∈ Rn . We define the function med D : [n] → [0, 1] by med D ( j) = µ 1 (D1 ( j), . . . , Dm ( j)), where
                                                                                         2
µ 1 (x1 , . . . , xm ) denotes the median of x1 , . . . , xm (where if m is even, it is the value in position m/2 in
  2
sorted non-decreasing order). The sum ∑m           i=1 |xi − c| is minimized when c = µ 12 (x1 , . . . , xm ). Therefore,
for every D and every vector v ∈ Rn ,

                                      m                       m
                                     ∑ Di − med D 1 ≤ ∑ kDi − vk1 .                                                (4.2)
                                     i=1                     i=1


Recall that for every D = (D1 , . . . , Dm ) in Hnm , and for each (i, j) ∈ [m] × [n], we have that either
Di ( j) = 2/n, or Di ( j) = 0. Thus, med D ( j) = 0 when Di ( j) = 0 for at least half of the i’s in [m] and
med D ( j) = 2/n otherwise. We next show that for every (i, j) ∈ [m] × [n], the probability over D ∼ UHnm
that Di ( j) will have the same value as med D ( j) is at most a bit bigger than half. More precisely, we show
that for every (i, j) ∈ [m] × [n]:

                                      h                   i 1     1
                                                                      
                                                     D
                                 Pr    Di ( j) 6= med ( j) ≥   1− √     .                                          (4.3)
                               D∼UHnm                        2      m

Fix (i, j) ∈ [m] × [n], and consider selecting D uniformly at random from Hnm . Suppose we first determine
the values Di0 ( j) for i0 6= i, and set Di ( j) in the end. For each (i0 , j) the probability that Di0 ( j) = 0 is 1/2,
and the probability that Di0 ( j) = 2/n is 1/2. If at least m/2 of the outcomes are 0, or more than m/2 are
2/n, then the value of med D ( j) is already determined. Conditioned on this we have that the probability
that Di ( j) 6= med D ( j) is exactly 1/2. On the other hand, in the complementary event, i. e., when for odd
m there are (m − 1)/2 that are 0 and (m − 1)/2 that are 2/n, and for even m there are m/2 − 1 that are 0
and m/2 that are 2/n, it holds that med D ( j) = Di ( j). We thus bound the probability of this event. First
consider the case that m is even:
                                                 
                        1    m                1     m    m   1    m!   1
              Pr Bin m,     = − 1 < Pr Bin m,     =    = m · m = m m · m.                                          (4.4)
                        2    2                2     2    2  2    2!2! 2

                                           √      m
By Stirling’s approximation, m! =           2πm me eλm , where λm is a parameter that satisfies

                                                   1            1
                                                        < λm <     ,
                                                12m + 1        12m

                        T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                      320
                      T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

thus,
                                                                   √              1
                             m!   1                                 2πm( me )m e 12m  1
                            m m · m            <                             1      · m                                 (4.5)
                            2!2! 2                     ( 2πm/2( m/2 m/2 e 12m/2+1 )2 2
                                                        p
                                                                 e )
                                                           1       2
                                                       e 12m − 6m+1
                                               =        p                                                               (4.6)
                                                           πm/2
                                                            1
                                               <       p                                                                (4.7)
                                                           πm/2
                                                         1
                                               <       √ .                                                              (4.8)
                                                          m

where Inequalities (4.7) and (4.8) hold for m ≥ 3. In case m is odd, the probability (over the choice of
Di0 ( j) for i0 6= i) that med D ( j) is determined by Di ( j) is
                                                               
                                1     m−1                   1     m+1
                      Pr Bin m,     =       = Pr Bin m + 1,     =
                                2      2                    2      2
                   √
which is at most 1/ m + 1 by equation (4.8). Hence, equation (4.3) holds for all m and we obtain the
following bound on the expectation
                        "                              #
                            m                                          m     n          h                          i
                                               D
              ED∼UHnm       ∑       Di − med               =       ∑ ∑ ED∼UHnm              Di ( j) − med D ( j)        (4.9)
                        i=1                        1               i=1 j=1
                                                                            h                    i 2
                                                           = m·n·            D j (i) 6= med D (i) ·
                                                                                   Pr                                  (4.10)
                                                                  D∼UHnm G=                         n
                                                                             
                                                                  1        1        2
                                                           ≥ m·n·    1− √         ·                                    (4.11)
                                                                  2         m       n
                                                                √
                                                           = m− m.                                                     (4.12)

while the maximum value is bounded as
                                m                                      m     n
                            ∑        Di − med D                =    ∑ ∑ Di ( j) − med D ( j)                           (4.13)
                            i=1                        1            i=1 j=1
                                                                     n
                                                                             m 2
                                                               ≤       ∑ 2 ·n                                          (4.14)
                                                                       j=1
                                                               = m.                                                    (4.15)

Assume for the sake of contradiction that
                                   "                         #
                                           m
                                                           m    2
                                     Pr
                                D∼UHnm
                                          ∑ Di − med D 1 ≤ 2 > √m .                                                    (4.16)
                                          i=1


                    T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                               321
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

then by equation (4.15) we have,
                                     "                          #
                                         m
                                                        D
                           ED∼UHnm       ∑ Di − med         1
                                                                                                         (4.17)
                                      i=1
                                             "                   #
                                                 m
                                                               m   m
                             ≤       Pr
                                  D∼UHnm
                                              ∑ Di − med D 1 ≤ 2 · 2                                     (4.18)
                                              i=1
                                                  "              #!
                                                       m
                                                               m
                               + 1 − Pr      ∑ Di − med D 1 ≤ 2 · m
                                     D∼UHnm i=1
                                                                                                         (4.19)
                                        "                    #
                                          m
                                                           m     m
                             = m − Pr     ∑  Di − med D ≤      ·                                         (4.20)
                                  D∼UHnm i=1           1   2     2
                                  √
                             < m− m ,                                                                    (4.21)

which contradicts equation (4.12).

    Recall that for an element j ∈ [n] and a distribution Di , i ∈ [m], we let αi, j denote the number of
times the pair (i, j) appears in the sample (when the sample is selected in the uniform sampling model).
Thus (a1, j , . . . , am, j ) is the sample histogram of the element j. Since the sample points are selected
independently, a sample is simply the union of the histograms of the different elements, or equivalently, a
matrix M in Nm×n .

Lemma 4.3. Let U be the distribution of the histogram of q samples taken from the uniform distribution
over [m] × [n], and let H be the distribution of the histogram of q samples taken from a random list of
distributions in Hnm . Then,

                                                                4q2
                                                 kU − Hk1 ≤         .                                    (4.22)
                                                                mn

Proof. For every matrix M ∈ Nm×n , let AM be the event of getting the histogram M i. e., M[i, j] = x if
element j is chosen exactly x times from distribution Di in the sample; For every ~x = (x1 , . . . , xm ) ∈ Nm ,
let B~x be the event of getting a histogram M with exactly xi samples from distribution Di i. e., such that
for every i ∈ [m], ∑ j∈[n] M[i, j] = xi ; Let C be the event of getting a histogram M such that there exists
(i, j) ∈ [m] × [n] such that M[i, j] ≥ 2; Let
                                                                 
                                             V = {B~x : Pr B~x ∩C > 0}
                                                       H


(where C denotes the event complementary to C). Recall that if we take strictly less than n/2 samples
then conditioned on the event that there are no collisions in the sample, i. e., each element is drawn at
most once, a sample from Un and a sample from a random distribution in Hn are distributed exactly
the same. We extend this observation for samples from H and U as follows. For every B~x ∈ V , given
the occurrence of B~x ∩C, i. e., given the histogram projected on the first coordinate and given that there

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                               322
                         T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

were no collisions, samples from H and U are distributed the same. We shall use this fact to bound the
statistical distance between H and U. We next formalize this.



                 kU − Hk1 =              ∑ Pr (AM ) − Pr (AM ) + ∑ Pr (AM ) − Pr (AM )                        (4.23)
                                       AM ⊆C U        H            U          H
                                                                              AM ⊆C

                                  ≤ Pr (C) + Pr (C) + ∑ Pr (AM ) − Pr (AM ) .                                 (4.24)
                                         U        H                      U                H
                                                                AM ⊆C



We start by bounding the third term in equation (4.24).



                  ∑ PrU (AM ) − Pr
                                H
                                   (AM )               =      ∑ ∑             Pr (AM ) − Pr (AM )             (4.25)
                 AM ⊆C                                        B~x AM ⊆B~x ∩C U                H


                                                       =       ∑        ∑        Pr (AM ) − Pr (AM )          (4.26)
                                                              B~x ∈V AM ⊆B~x ∩C U                 H


                                                              + ∑            ∑        Pr (AM ) − Pr (AM ) .   (4.27)
                                                                                      U               H
                                                                  B~x ∈V AM ⊆B~x ∩C



We next bound the expression in equation (4.26).



                 ∑        ∑         Pr (AM ) − Pr (AM )                                                       (4.28)
                B~x ∈V AM ⊆B~x ∩C U             H

                                                                                             
                   =      ∑ PrU (B~x ) ∑              Pr AM | B~x ∩C · Pr C | B~x − Pr C | B~x                (4.29)
                         B~x ∈V                       U                          U                    H
                                         AM ⊆B~x ∩C
                                                                         
                   =      ∑ PrU (B~x ) PrU C | B~x − Pr
                                                     H
                                                        C | B~x                                               (4.30)
                         B~x ∈V
                                                                        
                   =      ∑ Pr (B~x ) 1 − Pr (C | B~x ) − 1 − Pr (C | B~x )                                   (4.31)
                         B~x ∈V U                U                               H


                   =      ∑ Pr (B~x ) PrU (C | B~x ) − Pr (C | B~x )                                          (4.32)
                         B~x ∈V U                      H

                   ≤ Pr (C) + Pr (C) ,                                                                        (4.33)
                         U           H



where in equation (4.29) we used the fact that for every B~x ∈ V, M ∈ Nm×n , PrU (B~x ) = PrH (B~x ) and

                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                     323
                              R EUT L EVI , DANA RON , AND RONITT RUBINFELD

                                
PrU AM | B~x ∩C = PrH AM | B~x ∩C . Turning to the expression in equation (4.27),


                       ∑       ∑          Pr (AM ) − Pr (AM )   =    ∑           ∑      Pr (AM )   (4.34)
                                          U          H                                  U
                      B~x ∈V AM ⊆B~x ∩C                             B~x ∈V AM ⊆B~x ∩C

                                                                ≤    ∑ PrU (B~x )                  (4.35)
                                                                    B~x ∈V

                                                                =    ∑ Pr(B~x )                    (4.36)
                                                                       H
                                                                    B~x ∈V

                                                                =    ∑ Pr(B~x ∩C)                  (4.37)
                                                                       H
                                                                    B~x ∈V
                                                                ≤ Pr(C) ,                          (4.38)
                                                                    H

where the first equality follows from the fact that B~x ∈ V , hence by the definition of V we get that
PrH (AM ) = 0. We thus obtain that kU − Hk1 ≤ 2 PrU (C) + 3 PrH (C). If we take q uniform independent
samples from [`], then by a union bound over the q samples, the probability to get a collision is at most

                                                1 2      q−1
                                                 + +···+
                                                ` `       `
which is q2 /(2`). Thus,
                                                             q2     q2   4q2
                               2 Pr (C) + 3 Pr (C) ≤ 2 ·        +3·    =     ,
                                  U            H            2mn     mn   mn
and the lemma follows.
                                                                                 eq
Proof of Theorem 4.1. Assume there is a tester, T , for the property Pm,n in the uniform sampling model,
which takes q ≤ m1/2 n1/2 /6 samples. By Lemma 4.2,
                                                                          
                                                        2              2       1
                            Pr [A accepts D] ≤        √     ·1+ 1−    √      ·                    (4.39)
                          D∼UHnm                         m              m      3
                                                                  
                                                      1         4
                                                =          1+  √                                  (4.40)
                                                      3          m
                                                      1
                                                ≤        ,                                        (4.41)
                                                      2

where the last inequality holds for m ≥ 64. By Lemma 4.3, for q ≤ m1/2 n1/2 /6,

                                                 1             1
                                                   kU − Hk1 ≤    ,
                                                 2            18
while by equation (4.41),
                                                                             !
                                                                                      2 1  1
                         Pr   [A accepts D] −         Pr    [A accepts D]        ≥     − >   .
                     D∼UHnm                        D∼UHnm                             3 2 18


                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                          324
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

5     Algorithms for testing equivalence in the sampling model
In this section we state our two main theorems (Theorems 5.1 and 5.2) regarding testing Equivalence in
the sampling model. We prove Theorem 5.1 in this section. In Section 6 we prove a stronger version
of Theorem 5.2 (Theorem 6.11) as well as a stronger version of Theorem 5.1 (Theorem 6.13). We have
chosen to give the proof of Theorem 5.1, in addition to the proof of Theorem 6.13, because it is simpler
than the latter.

Theorem 5.1. Let D be a list of m distributions over [n]. It is possible to test whether D ∈ Peq in the
                                                        e 2/3 m1/3 + m) · poly(1/ε)).
unknown-weights sampling model using a sample of size O((n

Theorem 5.2. Let D be a list of m distributions over [n]. It is possible to test whether D ∈ Peq in the
                                                      e 1/2 m1/2 + n) · poly(1/ε)).
known-weights sampling model using a sample of size O((n

     Thus, when the weight vector ~w is known, and in particular when all weights are equal (the uniform
sampling model), we get a combined upper bound of O(min{n
                                                       e         2/3 m1/3 + m, n1/2 m1/2 + n} · poly(1/ε)).

Namely, as long as n ≥ m, the complexity (in terms of n and m) grows as O(ne 2/3 m1/3 ), and when m ≥ n,
it grows as O(n
            e   1/2  1/2
                    m ).


5.1    Proof of Theorem 5.1
By Lemma 3.18, in order to prove Theorem 5.1 it suffices to design an algorithm for testing independence
of a joint distribution (with the complexity stated in the theorem). Indeed, testing independence was
studied in [10]. However, there was a certain flaw in one of the claims on which their analysis built
(Theorem 15 in [10], which is attributed to [11]), and hence we fix the flaw next (building on [12], which
is the full version of [11]).
     Given a sampling access to a pair of distributions p and q and bounds on their `∞ -norm b p and bq ,
respectively, the algorithm Bounded-`∞ -Closeness-Test (Algorithm 1) tests the equivalence of p and q.
The sample complexity of the algorithm depends on b p and bq , as described in the next theorem.
     For a multiset of sample points F over a domain R and an element j ∈ R, let occ( j, F) de-
note the number of times that j appears in the sample F and define the collision count of F to be
         def
coll(F) = ∑ j∈R occ(2j,F) .
                          


Theorem 5.3. Let p and q be two distributions over the same finite domain R. Suppose that kpk∞ ≤ b p and
kqk∞ ≤ bq where bq ≥ b p . For every ε ≤ 1/4 , Algorithm Bounded-`∞ -Closeness-Test(p, q, b p , bq , |R|, ε)
has the following properties.

    1. If kp − qk1 ≤ ε/(2|R|1/2 ), then the test accepts with probability at least 2/3.

   2. If kp − qk1 > ε, then the test rejects with probability at least 2/3.
                                                          
                              1/2
The algorithm takes O |R| · b p /ε 2 + |R|2 · bq · b p /ε 4 sample points from each distribution.



                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                            325
                                R EUT L EVI , DANA RON , AND RONITT RUBINFELD


 Bounded-`∞ -Closeness-Test (sampling access to p, sampling access to q, b p , bq , |R|, ε > 0)
                                                                                                           
                                                                             1/2
      1. Take samples Fp1 and Fp2 from p, each of size t, where t = O |R| · b p /ε 2 + |R|2 · bq · b p /ε 4

      2. Take samples Fq2 and Fq2 from q, each of size t

      3. Let r p = coll(Fp1 ) be the number of collisions in Fp1

      4. Let rq = coll(Fq1 ) be the number of collisions in Fq1

      5. Let s p,q = ∑ j∈R (occ( j, Fp2 ) · occ( j, Fq2 )) be the number of collisions between Fp2 and Fq2

      6. If rq > (7/4) 2t b p then REJECT
                          

                    def              def
                                    2t                             def
      7. Define δ = ε/|R|1/2 , r = t−1 (r p + rq ) and s = 2s p,q

      8. If r − s > t 2 δ 2 /2 then REJECT, otherwise ACCEPT

                  Algorithm 1: The algorithm for testing `1 distance when `∞ is bounded.


Proof. Following the analysis of [11, Lemma 5], we have that

                                                 Exp[r − s] = t 2 kp − qk22 ,                                (5.1)

and we have the following bounds on the variances of r p , rq and s (for some constant c):

                            Var[s] ≤ ct 2 ∑ p(`)q(`) + ct 3 ∑ (p(`)q(`)2 + p(`)2 q(`)) ,                     (5.2)
                                           `∈R                     `∈R
                           Var[r p ] ≤ ct 2 ∑ p(`)2 + ct 3 ∑ p(`)3 ,                                         (5.3)
                                           `∈R               `∈R

and

                           Var[rq ] ≤ ct 2 ∑ q(`)2 + ct 3 ∑ q(`)3 .                                          (5.4)
                                           `∈R              `∈R

Using the bounds we have on the `∞ norms of p and q we get (possibly for a larger constant c):

                  Var[s] ≤ ct 2 kpk∞ + ct 3 (kpk∞ kqk22 + kpk2∞ ) ≤ ct 2 b p + ct 3 (b p kqk22 + b2p ) ,     (5.5)

                Var[r p ] ≤ ct 2 kpk22 + ct 3 kpk∞ kpk22 ≤ ct 2 kpk∞ + ct 3 kpk2∞ ≤ ct 2 b p + ct 3 b2p ,    (5.6)

and

                 Var[rq ] ≤ ct 2 kqk22 + ct 3 kqk∞ kqk22 ≤ ct 2 kqk22 + ct 3 bq kqk22 .                      (5.7)


                          T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              326
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

    By equations (5.5) and (5.7), a tighter bound on kqk22 will imply a tighter bound on Var[s] and Var[rq ].
To this end, the check in Step 6 in the algorithm was added to the original `2 -Distance-Test of [11].
This check is beneficial in achieving a tighter bound on the sample complexity. First, prove that the
                                                                                  2
tester distinguishes with high constant probability between the case that
    2                                                                      t
                                                                              kqk2 > 2b p and the case that
kqk2 ≤ (3/2)b p by rejecting (with high probability) when rq > (7/4) 2 b p . Notice that by the triangle
inequality kp − qk2 ≥ kqk2 − kpk2 . Thus, if kqk22 > (3/2)b p and kpk22 ≤ b p then it follows that
                                                           1/2    1/2
                                                 p
                                      kp − qk2 ≥ (3/2)b p − b p .
Therefore, by the fact that b p ≥ 1/|R|, we obtain that
                                                      p         
                               kp − qk1 ≥ kp − qk2 ≥    (3/2) − 1 /|R|1/2
                              1/2                                                 2
which is greater than ε/(2|R| ) for ε ≤ 1/4. Consider first the case that kqk2 > 2b p , so that Exp[rq ] >
   t
2 2 b p . Then we can bound the probability that the tester accepts, that is, that rq ≤ (7/4) 2t b p, by the
probability that rq < (7/8)Exp[rq ]. In the case that kqk22 ≤ (3/2)b p , so that Exp[rq ] ≤ (3/2) 2t b p , we
can bound the probability that the tester rejects, that is, that rq > (7/4) 2t b p , by the probability that
rq > (7/6)Exp[rq ]. Then the probability to accept when kqk22 > 2b p and reject when kqk22 ≤ b p is upper
bounded by
                                     Pr[|rq − Exp[rq ]| > Exp[rq ]/8 .
Now, using the upper bound on the variance of rq that we have (the first bound in equation (5.7)), the fact
that for every distribution q over R, kqk22 ≤ 1/|R| and Exp[rq ] = 2t kqk22 , we have that
                                                              64Var[rq ]
                      Pr[|rq − Exp[rq ]| > Exp[rq ]/8] ≤                                                (5.8)
                                                              Exp2 [rq ]
                                                           c · (t 2 kqk22 + t 3 kqk∞ kqk22 )
                                                         ≤                                              (5.9)
                                                                        t 4 kqk42
                                                               c         ckqk∞
                                                         = 2          +                                (5.10)
                                                           t kqk2 tkqk22
                                                                    2

                                                           c|R| c|R|kqk∞
                                                         ≤        +               .                    (5.11)
                                                            t2              t
To make this a small constant, we choose t so that
                                                         
                                      t = Ω |R|1/2 + |R|bq .                                           (5.12)

Next, we prove that the tester distinguishes between the case that kp − qk2 > δ and kp − qk2 ≤ δ /2 by
rejecting when r − s > t 2 δ 2 /2. We have that Exp[r − s] = t 2 kp − qk22 . Chebyshev gives us that
                                     Pr[|A − Exp[A]| > ρ] ≤ Var[A]/ρ 2 ,
and so, for the case kp − qk2 > δ (i. e., Exp[r − s] > t 2 δ 2 ) we have that
                        Pr[r − s < t 2 δ 2 /2] ≤ Pr[|(r − s) − Exp[r − s]| < t 2 δ 2 /2]               (5.13)
                                                 4Var[r − s]
                                               ≤              ,                                        (5.14)
                                                     t 4δ 4

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              327
                                    R EUT L EVI , DANA RON , AND RONITT RUBINFELD

and similarly, for the case kp − qk2 ≤ δ /2 (i. e., Exp[r − s] ≤ t 2 δ 2 /4) we have that

                             Pr[r − s ≥ t 2 δ 2 /2] ≤ Pr[|(r − s) − Exp[r − s]| < t 2 δ 2 /4]                        (5.15)
                                                      16Var[r − s]
                                                    ≤               .                                                (5.16)
                                                          t 4δ 4
That is, we want
                                                            Var[r − s]
                                                              t 4δ 4
which is of the order of
                                                        Var[r − s] · |R|2
                                                             t 4ε 4
to be a small constant. If we use

                                                       4t 2
                                   Var[r − s] =              (Var[r p ] + Var[rq ]) + Var[s] ,
                                                    (t − 1)2

then we need to ensure that each of

                                   Var[r p ] · |R|2     Var[rq ] · |R|2          Var[s] · |R|2
                                                    ,                     and
                                       t 4ε 4               t 4ε 4                  t 4ε 4

is a small constant, which by equations (5.5), (5.6), (5.7), and the premise that kqk22 ≤ 2b p , holds when
                                                                                  
                                                        1/2 2      2             4
                                           t = Ω |R| · b p /ε + |R| · bq · b p /ε ,                                  (5.17)

since both b p , bq ≥ 1/|R|, the bound in equation (5.17) dominates the bound in equation (5.12). Hence,
the bound on the sample complexity follows.

As a corollary of Theorem 5.3 we obtain:

Theorem 5.4. Let Q be a distribution over [m] × [n] such that Q satisfies: kπ1 Qk∞ ≤ b1 , kπ2 Qk∞ ≤ b2
                                                1/2 1/2
and b2 ≤ b1 . There is a test that takes O(nmb1 b2 /ε 2 + n2 m2 b1 b22 /ε 4 ) samples from Q, such that if Q
is independent, then the test accepts with probability at least 2/3 and if Q is ε-far from independent, then
the test rejects with probability at least 2/3.

Proof. By the premise of the theorem we have that kQk∞ ≤ b2 and that kπ1 Q × π2 Qk∞ ≤ b1 · b2 . Ap-
                                                                                                1/2 1/2
plying Theorem 5.3 we can test if Q is identical to π1 Q × π2 Q using sample of size O(nmb1 b2 /ε 2 +
n2 m2 b1 b22 /ε 4 ) from3 Q. If Q is independent, then Q equals π1 Q × π2 Q and the tester accepts with
probability at least 2/3. If Q is ε-far from independent, then in particular Q is ε-far from π1 Q × π2 Q and
the tester rejects with probability at least 2/3.
   3 We obtain a sample from π Q × π Q by simply taking two independent samples from Q, (i , j ) and (i , j ) and considering
                              1     2                                                     1 1          2 2
(i1 , j2 ) as a sample from π1 Q × π2 Q.


                           T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                       328
                           T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Applying Theorem 5.4 with b1 = 1/m, b2 = 1/n2/3 m1/3 , and combining that in the sample analysis of
TestLightIndependence [10], the following theorem is obtained:4

Theorem 5.5 ([10]). There is an algorithm that given a distribution Q over [m] × [n] and an ε > 0,

      • If Q is independent then the test accepts with high probability.

    • If Q is ε-far from independent then the test rejects with high probability.
                      e (n2/3 m1/3 + m)poly(ε −1 ) samples.
                                                  
The algorithm uses O

Finally, Theorem 5.1 follows by combining Theorem 5.5 with Lemma 3.18.


6     Algorithms for tolerant testing of equivalence in the sampling model
Given a list of distributions D, a tolerant equivalence tester is guaranteed to accept, with high probability,
if the distributions in D are close (and not necessarily identical), and reject D, with high probability, if
the distributions in D are far. In this section we prove Theorems 6.11 and 6.13. Theorem 6.11 states
that there is a tolerant equivalence tester taking Õ(n1/2 m1/2 + n) samples in the known-weights sampling
model. Theorem 6.13 states that there is a tolerant equivalence tester taking Õ(n2/3 m1/3 + m) samples in
the unknown-weights sampling model. A tolerant equivalence tester is also a non-tolerant equivalence
tester, so Theorems 6.11 and 6.13 are stronger versions of Theorems 5.2 and 5.1, respectively.

6.1     An algorithm for tolerant testing of identity in the sampling model
Consider the problem where given sample access to a distribution p and an explicit description of a
distribution q, the algorithm should accept ,with high probability, if p and q are identical, and should
reject, with high probability, if p and q are far. This is called Identity Testing and is defined in [10]. If the
algorithm is guaranteed to accept p and q that are close, and not necessarily identical, we refer to it as a
tolerant identity test. We will use the tolerant identity test as a subroutine in the algorithms for tolerant
testing of equivalence.
     We next present and prove Theorem 6.3, which states that there is a tolerant identity tester taking
   √
Õ( n) samples. The theorem is a restatement of theorems in [49] and [10]. The specific tolerance of
Theorem 6.3 is somewhat complex and in order to state it we introduce the following new definitions.

Definition 6.1. For two parameters α, β ∈ (0, 1), we say that a distribution p is an (α, β )-multiplicative
approximation of a distribution q (over the same domain R) if the following holds.

      • For every i ∈ R such that q(i) ≥ α we have that q(i) · (1 − β ) ≤ p(i) ≤ q(i) · (1 + β ).

      • For every i ∈ R such that q(i) < α we have that p(i) < α · (1 + β ).
    4 The procedure TestLightIndependence [10] is called by the TestIndependence [10] Algorithm. In Step (5) of TestLightIn-

dependence use Algorithm 1 to obtain the desired result.


                         T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                        329
                                 R EUT L EVI , DANA RON , AND RONITT RUBINFELD

Definition 6.2. For α ∈ (0, 1), we say that a distribution p is an α-additive approximation of a distribution
q (over the same domain R) if for every i ∈ R, |p(i) − q(i)| ≤ α.
Theorem 6.3 (Adapted from [49], [10]). Given sample access to p, a black-box distribution over a
finite domain R, and q, an explicitly specified distribution over R, for every 0 < ε ≤ 1/3, algorithm
Tolerant-Identity-Test(p, q, n, ε) has the following properties.
   1. If kp − qk1 > 13ε, the algorithm rejects with high constant probability.

   2. If q is an (ε/n, ε/24)-multiplicative approximation of some q0 such that

                                                             72ε 2
                                                   p − q0 1 ≤ √ ,
                                                              ` n
      where ` = log(n/ε)/ log(1 + ε), the algorithm accepts with high constant probability (in particular,
      if q is an (ε/n, ε/24)-multiplicative approximation of p or if

                                                           72ε 2
                                                 kp − qk1 ≤ √ ,
                                                            ` n
      the test accepts with high constant probability).
                        √
The algorithm takes Õ( n poly(ε −1 )) samples from p.

    Let p be a distribution over some finite domain R, and let R0 be a subset of R such that p(R0 ) > 0
where p(R0 ) = ∑i∈R0 p(i). Denote by p|R0 the restriction of p to R0 , i. e., p|R0 is a distribution over R0
such that for every i ∈ R0 , p|R0 (i) = p(i)/p(R0 ). In the proof of Theorem 6.3 we shall use the following
definitions and lemmas.
Definition 6.4 ([10]). Given an explicit distribution p over R, Bucket(p, R, α, β ) is the partition
{R0 , . . . , R` } of R with ` = log(1/α)/ log(1 + β ), R0 = {i : p(i) ≤ α}, such that for all j in [`],

                                   R j = i : α(1 + β ) j−1 < p(i) ≤ α(1 + β ) j .
                                        
                                                                                                         (6.1)

Definition 6.5 ([10]). Given a distribution p over R, and a partition R = {R1 , . . . , R` } of R, the coarsening
phRi is the distribution over [`] with distribution phRi (i) = p(Ri ).
Theorem 6.6 ([10]). Let p be a black-box distribution over a finite domain R and let S be a sample set
from p. coll(S)/ |S| approximates kpk22 to within a factor of (1 ± ε), with probability at least 1 − δ ,
                   2 p
provided that |S| ≥ c |R|ε −2 log(1/δ ) for some sufficiently large constant c.
Lemma 6.7 ([10]). Let p,q be distributions over R and let R0 ⊆ R, then kp|R0 − q|R0 k1 ≤ 2kp − qk1 /p(R0 ).
Lemma 6.8 ([10]). For any distribution p over R, kpk22 − kUR k22 = kp −UR k22 .
Lemma 6.9 (Based on [10]). Let p,q be distributions over R and let R0 ⊆ R, then

                          ∑ |p(i) − q(i)| ≤ |p(R0 ) − q(R0 )| + q(R0 )kp|R − q|R k1 .
                                                                            0     0
                          i∈R0


                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                 330
                          T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Proof.

                                                p(i)(p(R0 ) − q(R0 ))        p(i)q(R0 )
                 ∑ |p(i) − q(i)| ≤ ∑                   p(R0 )
                                                                      +  ∑     p(R0 )
                                                                                        − q(i)       (6.2)
                i∈R0                       i∈R0                         i∈R0
                                                                    p(i)q(R0 )
                                    = |p(R0 ) − q(R0 )| + ∑                    − q(i)                (6.3)
                                                               i∈R0   p(R0 )
                                                                                 p(i)   q(i)
                                    = |p(R0 ) − q(R0 )| + ∑ q(R0 ) ·                  −              (6.4)
                                                               i∈R0             p(R ) q(R0 )
                                                                                    0

                                    = |p(R0 ) − q(R0 )| + q(R0 ) · p|R0 − q|R0 1 ,                   (6.5)

and the lemma is established.

Lemma 6.10. Let p, q be distributions over a finite domain R and let R0 ⊆ R be a subset of R such that
for every i ∈ R0 it holds that
                                     q(i) ∈ p(i) · [1 − ε, 1 + ε] ,                              (6.6)
Then for every i ∈ R0 ,                                                         
                                                    (1 − ε) (1 + ε)
                                    q (i) ∈ p (i) ·
                                     |R0         |R0       ,                                         (6.7)
                                                    (1 + ε) (1 − ε)

Proof. Equation (6.6) implies that q(R0 ) ∈ p(R0 ) [1 − ε, 1 + ε] and therefore

                                        p(R0 )
                                                                
                                                      1       1
                                               ∈         ,         .
                                        q(R0 )      1+ε 1−ε

Thus, we obtain that                                                
                                     q(i)   p(i)     (1 − ε) (1 + ε)
                                          ∈      ·          ,          ,
                                    q(R0 ) p(R0 ) (1 + ε) (1 − ε)
and the lemma follows.


Proof of Theorem 6.3. Let E1 be the event that for every i in [`] we have that mi approximates kp|Ri k22 to
                                                                                                √
within a factor of (1 ± ε 2 ). By Theorem 6.6 and the union bound, if Si is such that |Si | ≥ c nε −4 log `
then E1 occurs with probability at least 8/9. Let E2 be the event that for every i in [`] we have that
|(|Si |/|S|) − p(Ri )| ≤ ε/(2`). By Hoeffding’s inequality E2 occurs with probability at least 8/9 for
|S| = Ω̃(`2 ε −2 ). Let E3 be the event that pehRi and qehRi are ε/(2`)-additive approximations of phRi and
qhRi , respectively. By taking Θ(ε −2 `2 log `) samples, E3 occurs with probability at least 8/9.
     Let p and q be as described in Case 1, i. e., kp − qk1 > 13ε. Suppose the algorithm accepts p and q.
Conditioned on E1 ∩ E2 , this implies that for each partition Ri for which Steps 4a - 4b were performed,
which are those for which q(Ri ) ≥ ε/`, we have

                                                       (1 + ε 2 )2     1
                                       kp|Ri k22 ≤                 ·        ,
                                                          |Ri |      1 − ε2

                       T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                          331
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD


 Tolerant-Identity-Test (sampling access to p, explicit description of q, n, ε > 0)

           def
     1. R = {R0 , · · · , R` } = Bucket(q, n, ε/n, ε/24)
                                √
     2. Let S be a set of Θ̃( nε −5 log n) samples from p

     3. Let H be the set of all x such that q(x) > ε(1 + ε)/n

     4. For each Ri ⊆ H such that q(Ri ) ≥ ε/`
                         √
         (a) If |Si | < c nε −4 log `, where Si = S ∩ Ri and c is the constant from Theorem 6.6, then
             REJECT
                             2 )2
         (b) If mi > (1+ε                              |Si |
                                                            
                         |Ri | , where mi = coll(Si )/ 2 , then REJECT

     5. Take Θ(ε −2 ` log `) samples and obtain a ε/(4`)-additive approximations pehRi and qehRi of phRi
        and qhRi , respectively

     6. If k pehRi − qehRi k1 > 3ε/2 then REJECT

     7. ACCEPT

                       Algorithm 2: The algorithm for tolerant identity testing.

                       2
which is at most 1+4ε
                  |Ri | for 0 < ε ≤ 1/3. Thus, by Lemma 6.8 it follows that

                                                                            4ε 2
                                kp|Ri −U|Ri k22 = kp|Ri k22 − kU|Ri k22 ≤         .                   (6.8)
                                                                            |Ri |

For every i ∈ [`],

                                      q( j) q(Ri )/|Ri | 2     1           q(Ri ) 2    ε2
                 kq|Ri −U|Ri k22 = ∑         −             ≤         ∑   ε          ≤       ,         (6.9)
                                 j∈Ri q(Ri )   q(Ri )        q(Ri )2 j∈Ri |Ri |       |Ri |

where the second inequality follows from the bucketing definition. By the triangle inequality we obtain
from equations (6.8) and (6.9) that
                                                           9ε 2
                                        kp|Ri − q|Ri k22 ≤
                                                           |Ri |
and thus kp|Ri − q|Ri k1 ≤ 3ε. We also have that the sum of q(Ri ) over all Ri for which Steps 4a - 4b
were not preformed is at most ` · (ε/`) + n · (ε(1 + ε)2 /n) < 4ε. For those Ri we use the trivial bound
kp|Ri − q|Ri k1 ≤ 2. Also, kphRi − qhRi k1 ≤ 2ε by Step 6. So by Lemma 6.9 we get that kp − qk1 ≤ 13ε
in contradiction to our assumption. Therefore, the test accepts p and q with probability at most 1/3 (the
bound on the probability of E¯1 ∪ E¯2 ∪ E¯3 ).
    We next turn to proving the second item in the theorem. Suppose q is an (ε/n, (ε/24))-multiplicative
                                                  √
approximation of some q0 such that p is (72ε 2 /` n)-close to q0 . Conditioned on E2 , every Ri that enters

                       T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                          332
                       T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Step 4a also passes this step, since otherwise we get, in contradiction to our assumption, that q(Ri ) ≥ ε/`
while p(Ri ) ≤ 2ε/(3`). From the bucketing definition we have that for every i ∈ [`] and for every x ∈ Ri ,
                                           1
                    max(y ∈ Ri ) ·                ≤ q(x) ≤ min(y ∈ Ri ) · (1 + (ε/24)) ,
                                     (1 + (ε/24))
implying that                                                              
                                      q(Ri )           1
                               q(x) ∈        ·                , (1 + (ε/24)) .                       (6.10)
                                       |Ri |     (1 + (ε/24))
Since q is an (ε/n, ε/24)-multiplicative approximation of q0 , we get by Lemma 6.10 that for every Ri ⊆ H
and every x ∈ H,
                                   q0 (x)
                                                                         
                          q(x)                (1 − (ε/24)) (1 + (ε/24))
                                ∈         ·               ,                 .                       (6.11)
                          q(Ri ) q0 (Ri ) (1 + (ε/24)) (1 − (ε/24))
Combining equations (6.10) and (6.11) we get that
                                 q0 (Ri )      (1 − (ε/24)) (1 + (ε/24))2
                                                                         
                          0
                         q (x) ∈          ·                ,                ,                        (6.12)
                                  |Ri |       (1 + (ε/24))2 (1 − (ε/24))
and thus for 0 < ε ≤ 1/2,
                                 q0 (x)
                                                                     
                                              (1 − (ε/2)) (1 + (ε/2))
                                         ∈               ,              .                            (6.13)
                                q0 (Ri )          |Ri |       |Ri |
By equation (6.13) we obtain that for every Ri ⊆ H
                                      kq0|Ri −U|Ri k2 ≤ ε/(2 |Ri |) .
                                                            p
                                                                                                     (6.14)
For all subsets Ri ⊆ H with q(Ri ) ≥ ε/` we have that q0 (Ri ) ≥ ε/((1 + ε)`), combined with the fact that
                                                           72ε 2
                                               kp − q0 k1 ≤ √
                                                            ` n
we get by Lemma 6.7 (for sufficiently large n) that
                                                                 √
                                          kp|Ri − q0|Ri k1 ≤ ε/(2 n) .                               (6.15)
This implies that                                                √
                       kp|Ri − q0|Ri k2 ≤ kp|Ri − q0|Ri k1 ≤ ε/(2 n) < ε/(2 |Ri |) .
                                                                           p
                                                                                                     (6.16)
By the triangle inequality we get that
                       kp|Ri −U|Ri k2 ≤ kp|Ri − q0|Ri k2 + kq0|Ri −U|Ri k2 ≤ ε/
                                                                                    p
                                                                                     |Ri | .         (6.17)
Therefore, by Lemma 6.8 it follows that
                            kp|Ri k22 = kp|Ri −U|Ri k22 + kU|Ri k22 ≤ (1 + ε 2 )/|Ri | .             (6.18)
Therefore, conditioned on E1 ∩ E2 all such subsets will pass Step 4b. Since q is ε/2-close to q0 , by the
triangle inequality p is ε-close to q and thus conditioned on E3 the algorithm will pass Step 6 as well.
Thus the algorithm accepts with probability at least 2/3.
                                         √
    Finally, the sample complexity is Õ( nε −5 ) from Step 2, which dominates the sample complexity of
Step 5.

                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             333
                                 R EUT L EVI , DANA RON , AND RONITT RUBINFELD

6.2     An algorithm for tolerant testing of equivalence in the known-weights sampling
        model
In this section we prove Theorem 6.11. We note that in the proof of the theorem we essentially describe a
tolerant tester for the property of independence of two random variables.

Theorem 6.11. Let D be a list of m distributions over [n] and let ~w be a weight vector over [m]. Denote
by QD,~w the joint distribution over [m] × [n] such that QD,~w (i, j) = wi · Di ( j). There is a test that works
in the known-weights sampling model, which takes O((n  e 1/2 m1/2 + n)poly(1/ε)) samples from D, and
whose output satisfies the following:
                                √
     • If D is 24ε 2 (1 − 3ε)/(` n)-close to being in Peq , where ` = log(n/ε)/ log(1 + ε), or if QD,~w
       is an (ε/n, ε/120)-multiplicative approximation of π1 QD,~w × π2 QD,~w , then the test accepts with
       probability at least 2/3

      • If D is 19ε-far from being in Peq , then the test rejects with probability at least 2/3.

      In the proof of Theorem 6.11 we shall use the following lemma:

Lemma 6.12. Let Q be a joint distribution over [m] × [n]. Let Q    e1 be a (α1 , β1 )-multiplicative approxima-
                  e2 be a (α2 , β2 )-multiplicative approximation of π2 Q. Denote by A1 the set of all i ∈ [m]
tion of π1 Q. Let Q
            1
such that Q (i) ≥ α1 (1 + β1 ). Denote byA2 the set of all j ∈ [n] such that Q
           e                                                                   e2 ( j) ≥ α2 (1 + β2 ). For every
                                     1     2
                                                                                                
B1 ⊆ A1 and every B2 ⊆ A2 , Q       e ×Q e
                                             |B1 ×B2
                                                     is a 0, 2(β1 + β2 )/((1 − β1 ) · (1 − β2 )) -multiplicative
approximation of (π1 Q × π2 Q)|B1 ×B2 .

Proof. For every (i, j) ∈ B1 × B2 we have that

                   e1 (i) · Q
                   Q        e2 ( j) ∈ π1 Q(i) · π2 Q( j) · [(1 − β1 ) · (1 − β2 ), (1 + β1 ) · (1 + β2 )] .       (6.19)

From the facts that
      (1 + β1 ) · (1 + β2 )          2(β1 + β2 )                    (1 − β1 ) · (1 − β2 )          2(β1 + β2 )
                            = 1+                            and                           > 1−                       ,
      (1 − β1 ) · (1 − β2 )      (1 − β1 ) · (1 − β2 )              (1 + β1 ) · (1 + β2 )      (1 − β1 ) · (1 − β2 )

and from Lemma 6.10 the lemma follows.

Proof of Theorem 6.11. The test referred to in the statement of the theorem is Algorithm 3. Let E1 be
the event that Qe2 is an (ε/n, ε/120)-multiplicative approximation of π2 Q, as defined in Definition 6.1.
By applying Chernoff’s inequality and the union bound, E1 occurs with probability at least 8/9 (for a
sufficiently large constant in the Θ(·) notation for the sample size). By Lemma 6.12, conditioned on E1 ,
we have that                                           
                                               ~w × Qe2
                                                                  |[m]×H

is a (0, ε/24)-multiplicative approximation of

                                                  (π1 Q × π2 Q)|[m]×H .

                         T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                      334
                       T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS


  Equivalence-Known-Weights-Tolerant-Test (sampling access to a list of m distributions, D, over
  [n], in the known-weights sampling model, 0 < ε ≤ 1/3)

     1. Let Q denote QD,~w

     2. Take Θ(ε −3 n log n) samples and obtain a (ε/n, ε/120)-multiplicative approximation, Q
                                                                                             e2 , of
        π2 Q
                                                  e2 ( j) > ε(1 + ε)/n and let L be [n] \ H
     3. Let H be the set of all j ∈ [n] such that Q
                                                                       
     4. If Tolerant-Identity-Test with parameters: Q[m]×H , ~w × Q   e2        , |H| · m, ε and confidence
                                                                                  |[m]×H
        1/9, rejects then REJECT
          def
     5. I = {[m] × H, [m] × L}

     6. Take Θ(ε −2 ) samples and obtain a (ε/2)-additive approximations Q
                                                                         e1×2 and Q
                                                                          hIi
                                                                                  ehIi of (π1 Q ×
        π2 Q)hIi and QhIi , respectively

           e1×2 − Q
     7. If Q      ehIi        > 2ε then REJECT
             hIi          1

     8. ACCEPT

Algorithm 3: The algorithm for tolerant testing of equivalence in the known-weights sampling model.


Thus,
                                        
                                      e2
                                 ~w × Q            − (~w × π2 Q)|[m]×H       ≤ ε/24 .
                                          |[m]×H                         1
Let E2 be the event that the application of Tolerant-Identity-Test returned a correct answer, as defined by
Theorem 6.3. We run the amplified version of Tolerant-Identity-Test, therefore the additional parameter,
which is the confidence parameter, is set to 1/9, i. e., E2 occurs with probability at least 8/9.
    Let D be 19ε-far from being in Peq and assume the test accepts. Conditioned on E2 this implies that
                                                        
                                    Q|[m]×H − ~w × Q e2            ≤ 13ε .
                                                            |[m]×H 1

By the triangle inequality, we obtain that conditioned on E1 ∩ E2 ,

                              Q|[m]×H − (~w × π2 Q)|[m]×H       ≤ ε/24 + 13ε < 14ε .                  (6.20)
                                                            1

Conditioned on E1 we have that Q([m] × L) ≤ 3ε, and therefore

                              Q([m] × L) · Q|[m]×L − (~w × π2 Q)|[m]×L            ≤ 6ε .              (6.21)
                                                                              1

Let E3 be the event that Qe1×2 and Q ehIi are ε/2-additive approximations of (π1 Q × π2 Q)hIi and QhIi ,
                           hIi
respectively. By taking Θ(ε −2 ) samples, E3 occurs with probability at least 8/9. Conditioned on E3 , we

                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              335
                            R EUT L EVI , DANA RON , AND RONITT RUBINFELD

have that
                                      (π1 Q × π2 Q)hIi − QhIi 1 ≤ 3ε .                               (6.22)
Combining equations (6.20) - (6.22), by Lemma 6.9, we have that

                              k(π1 Q × π2 Q) − Qk1 ≤ 3ε + 14ε + 6ε = 23ε .                           (6.23)

Hence D is 23ε-close to being in Peq , in contradiction to our assumption. It follows that the test accepts
with probability at most 1/3.
                                                                            √
    On the other hand, consider the case that either D is 24ε 2 (1 − 3ε)/(` n)-close to being in Peq , or
that π1 QD,~w × π2 QD,~w is an (ε/n, ε/120)-multiplicative approximation of QD,~w , and assume that the test
                                                  e2 by A and (~w × π2 Q) by B. In case the test rejects in
                                                     
rejects. For the sake of simplicity, denote ~w × Q
Step (4) then conditioned on E2 , we get by Theorem 6.3 that A|[m]×H is not an (ε/n, ε/24)-multiplicative
approximation of any q0 such that
                                                             72ε 2
                                          Q|[m]×H − q0 1 ≤ √ .
                                                             ` n
Conditioned on E1 , we have that A|[m]×H is an (ε/n, ε/24)-multiplicative approximation of B|[m]×H . Thus,
conditioned on E1 ∩ E2 , we obtain that

                                                            72ε 2
                                       Q|[m]×H − B|[m]×H 1 > √ .
                                                             ` n

Based on the fact that Q([m] × H) = B([m] × H), it follows, conditioned E1 ∩ E2 , that

                                                                           72ε 2 (1 − 3ε)
                    kQ − Bk1 > Q([m] × H) · Q|[m]×H − B|[m]×H 1 >                 √       .
                                                                                ` n
                                                             √
By Proposition 3.17 this implies that D is 24ε 2 (1 − 3ε)/(` n)-far from being in Peq . By setting
q0 = Q|[m]×H we also have that A|[m]×H is not an (ε/n, ε/24)-multiplicative approximation of Q|[m]×H .
Hence, there exists (i, j) ∈ [m] × H that satisfies either

                                 A|[m]×H (i, j) > (1 + (ε/24))Q|[m]×H (i, j)                         (6.24)

or
                                A|[m]×H (i, j) < (1 − (ε/24))Q|[m]×H (i, j) .                        (6.25)
By Lemma 6.12, we get that A|[m]×H is a (0, ε/30)-multiplicative approximation of B|[m]×H . Therefore,
by equations (6.24) and (6.24), either it holds that

                                                    1 + (ε/30)
                                 Q|[m]×H (i, j) <              B      (i, j)                         (6.26)
                                                    1 + (ε/24) |[m]×H

or that
                                                    1 − (ε/30)
                                 Q|[m]×H (i, j) >              B      (i, j) .                       (6.27)
                                                    1 − (ε/24) |[m]×H

                     T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                             336
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

Since Q([m] × H) = B([m] × H), we obtain from equations (6.26) and (6.27) that either

                                 1 + (ε/30)                      1 − (ε/30)
                     Q(i, j) <              B(i, j) or Q(i, j) >            B(i, j) ,
                                 1 + (ε/24)                      1 − (ε/24)

which by a simple calculation implies that Q is not a (ε/n, ε/120)-multiplicative approximation of
~w × π2 Q.
      Alternatively, in case the test rejects in Step 7 then by the triangle inequality we get that conditioned
 on E3 , Q is ε-far from π1 Q × π2 Q. In both cases we get a contradiction to our assumption and therefore
 the algorithm accepts D with probability at most 1/3 (which is the upper bound on the probability of
 E¯1 ∪ E¯2 ∪ E¯3 ).
      The sample complexity of Step 4 is bounded by Õ(n1/2 m1/2 poly(ε −1 )) so the overall sample com-
 plexity is Õ((n1/2 m1/2 + n)poly(ε −1 )).

6.3   An algorithm for tolerant testing of equivalence in the unknown-weights sampling
      model
In this section we prove the following theorem:
Theorem 6.13. Let D be a list of m distributions over [n]. It is possible to distinguish between the
                        √
case that D is 36ε 3 /(` n)-close to being in Peq , where ` = log(n/ε)/ log(1 + ε) and the case that D is
25ε-far from being in Peq in the unknown-weights sampling model using a sample of size Õ((n2/3 m1/3 +
m) · poly(1/ε)).

Proof of Theorem 6.13. The algorithm referred to in the statement of the theorem is Algorithm 4. We note
that we run the amplified version of Tolerant-Identity-Test and Bounded-`∞ -Closeness-Test and that
the additional parameter in the application of Tolerant-Identity-Test and Bounded-`∞ -Closeness-Test
                                                      e1 is an (ε/m, ε/250)-multiplicative approximation
is the confidence parameter. Let E1 be the event that Q
                                            −3
of π1 Q. By taking a sample of size Θ(ε m log m), E1 occurs with probability at least 20/21. Let E2
be the event that Qe2 is an (ε/n2/3 m1/3 , ε/250)-multiplicative approximation of π2 Q. For a sample of
size Θ(ε −3 n2/3 m1/3 log n), we get, by Chernoff’s inequality, that E2 occurs with probability at least
20/21. By Lemma 6.12, for every 0 < ε ≤ 1/3, we get, condition on E1 ∩ E2 , that Q   e1 × Q e2          is a
                                                                                                |H1 ×H2
(0, ε/24)-multiplicative approximation of (π1 Q × π2 Q)|H1 ×H2 . Thus, conditioned on E1 ∩ E2 , we have
that                                  
                               e1 × Q
                               Q    e2         − (π1 Q × π2 Q)         ≤ ε/24 .                      (6.28)
                                                                      |H1 ×H2
                                         |H1 ×H2                                 1
Let E3 be the event that the application of Tolerant-Identity-Test returned a correct answer, as defined
by Theorem 6.3. E3 occurs with probability at least 20/21.
    Let D be 25ε-far from being in Peq and assume the algorithm accepts. Then either Tolerant-
Identity-Test returns accept or γ < 3ε/2. Consider the case that Tolerant-Identity-Test returns accept.
Conditioned on E3 , by Theorem 6.3, we have that
                                            
                                    Qe1 × Q
                                          e2                 − Q|H1 ×H2       ≤ 13ε .
                                                   |H1 ×H2                1


                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                              337
                               R EUT L EVI , DANA RON , AND RONITT RUBINFELD




 Equivalence-Unknown-Weights-Tolerant-Test (sampling access to a list of m distributions, D,
 over [n], in the unknown-weights sampling model, 0 < ε ≤ 1/8)

    1. Let Q denote QD,~w

    2. Take Θ(ε −3 m log m) samples and obtain an (ε/m, ε/250)-multiplicative approximation Q
                                                                                            e1 of
       π1 Q
         def
    3. R = {R0 , · · · , R` } = Bucket(Q
                                       e1 , m, (1 + ε)ε/m, ε)

    4. Let L1 = R0 and let H1 = [m] \ L1

    5. Take Θ(ε −3 n2/3 m1/3 log n) samples and obtain an (ε/(n2/3 m1/3 ), ε/250)-multiplicative ap-
       proximation Qe2 of π2 Q

                                                  e2 ( j) > ε(1 + ε)/(n2/3 m1/3 ) and let L2 = [n] \ H2
    6. Let H2 be the set of all j ∈ [n] such that Q

    7. Take Θ(ε −2 ) samples and let γ be the fraction of samples in H1 × H2 . If γ < 3ε/2 then skip
       the next step.
                                                              e1 × Q
    8. If Tolerant-Identity-Test with parameters: Q|H1 ×H2 , (Q    e2 )|H ×H , |H1 | · |H2 | ,ε and confi-
                                                                         1  2
       dence 1/21 rejects then REJECT

    9. Let S be a set of Θ̃(`2 ε −2 ) independent samples. For each Ri such that |Si |/|S| ≥ ε/`,
       where Si = S ∩ (Ri × L2 ), if Bounded-`∞ -Closeness-Test with parameters: (π1 Q × π2 Q)|Ri ×L2 ,
       Q|Ri ×L2 , 4`/(εn2/3 m1/3 |Ri |), 2`/(εn2/3 m1/3 ), |L2 | · |Ri |, ε and confidence 1/(21`), rejects then
       REJECT
         def
   10. I = {H1 × H2 , L1 × H2 , R0 × L2 , · · · , R` × L2 }

   11. Take Θ(ε −2 `2 log `) samples and obtain an ε/(2`)-additive approximations Q
                                                                                  e1×2 and Q
                                                                                   hIi
                                                                                           ehIi of
       (π1 Q × π2 Q)hIi and QhIi , respectively

          e1×2 − Q
   12. If Q      ehIi          > 2ε then REJECT
            hIi            1

   13. ACCEPT

Algorithm 4: The algorithm for tolerant testing of equivalence in the unknown-weights sampling model.




                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                  338
                        T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

By the triangle inequality and equation (6.28) we obtain that

                           (π1 Q × π2 Q)|H1 ×H2 − Q|H1 ×H2       ≤ 13ε + ε/24 < 14ε .                    (6.29)
                                                             1

Consider the case γ < 3ε/2. Let E4 be the event that |γ − Q(H1 × H2 )| ≤ ε/2. By taking Θ(ε −2 ) samples,
E4 occurs with probability at least 20/21. Then we have that

                                             Q(H1 × H2 ) ≤ 2ε .                                          (6.30)

Let E5 be the event that all applications of Bounded-`∞ -Closeness-Test returned a correct answer, as
defined by Theorem 5.3. By the union bound, E5 occurs with probability at least 20/21. Conditioned on
E5 , we obtain that every Ri that passes Step 9 satisfies the following

                                     (π1 Q × π2 Q)|Ri ×L2 − Q|Ri ×L2 1 ≤ ε .                             (6.31)

Let E6 to be the event that for every i in [`] we have that |(|Si |/|S|) − Q(L2 × Ri )| ≤ ε/(2`). By Hoeffding’s
inequality E6 occurs with probability at least 20/21 for |S| = Ω̃(`2 ε −2 ). From the fact that for every Ri
that doesn’t enter Step 9 we have that |Si |/|S| < ε/`, we obtain, conditioned on E6 , that

                                           Q(Ri × L2 ) ≤ 3ε/(2`) .                                       (6.32)
                         e1×2 and Q
Let E7 be the event that Q          ehIi are ε/(2`)-additive approximations of (π1 Q × π2 Q)hIi and QhIi ,
                           hIi
                               −2 2
respectively. By taking Θ(ε ` log `) samples, E7 occurs with probability at least 20/21. Since we
assume that the algorithm accepts D then, in particular, D passes Step 12. Therefore, conditioned on E7 ,
we have that
                                      (π1 Q × π2 Q)hIi − QhIi 1 ≤ 3ε .                            (6.33)
Conditioned on E1 ∩ E2 , for 0 < ε ≤ 1/5 we have that

                                            Q(L1 × H2 ) ≤ 3ε/2 .                                         (6.34)

For every I ∈ I we have the following trivial bound

                                         (π1 Q × π2 Q)|I − Q|I 1 ≤ 2 .                                   (6.35)

Combining equations (6.29) - (6.35), by Lemma 6.9, we have that

                 k(π1 Q × π2 Q) − Qk1 ≤ 3ε + 14ε + 2ε + ` · 3ε/(2`) · 2 + 3ε/2 · 2 = 25ε .               (6.36)

Therefore, D is 25ε-close to being in Peq in contradiction to our assumption. It follows that the algorithm
accepts D with probability at most 1/3.
                                           √
    On the other hand, let D be 36ε 3 /(` n)-close to being in Peq and assume the algorithm rejects.
Conditioned on E1 ∩ E2 , we have that (Q  e1 × Q
                                               e2 )|H ×H is a (0, ε/24)-multiplicative approximation of
                                                     1   2
(π1 Q × π2 Q)|H1 ×H2 . Therefore, conditioned on E1 ∩ E2 ∩ E3 ∩ E4 , if we reject in Step 8, then we obtain
by Theorem 6.3 that
                                                                         ε2
                               Q|H1 ×H2 − π1 Q × π2 Q |H1 ×H2 1 > 72 · √ .                           (6.37)
                                                                         ` n

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                339
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

It follows, by Lemma 6.7, that

                                                π1 Q(H1 ) · π2 Q(H2 )        ε2 36ε 3
                      kπ1 Q × π2 Q − Qk1 >                            · 72 · √ ≥ √ .
                                                         2                  ` n  ` n

If we reject in Step 9, then conditioned on E5 ∩ E6 , there is Ri such that Q(Ri × L2 ) ≥ ε/` in which the
following holds,
                                                                        √
                               (π1 Q × π2 Q)|Ri ×L2 − Q|Ri ×L2 > ε/(2 n) .                          (6.38)
                                                                      1
Thus, by Lemma 6.7,

                                                 Q(Ri × L2 )       √             √
                        kπ1 Q × π2 Q − Qk1 >                 · ε/(2 n) ≥ ε 2 /(4` n) .
                                                     2
If we reject in Step 12, then conditioned on E7 it follows that kπ1 Q × π2 Q − Qk1 > ε . Thus we get a
contradiction to our assumption (that the algorithm rejects), which implies that the algorithm accepts D
with probability at least 2/3. To achieve (1 − δ ) confidence, the amplified algorithm takes the majority
result of Θ(log 1/δ ) applications of the original algorithm. In addition, both algorithms are applied
on restricted domains (H1 × H2 in Tolerant-Identity-Test and Ri × L2 in Bounded-`∞ -Closeness-Test).
This affects the sample complexity only by a factor of poly(1/ε, log n). For every Ri that enters Step 9,
the number of required samples from the domain Ri × L2 in that step is bounded by

                           Õ((n2/3 · |Ri |1/2 /m1/6 + n2/3 · |Ri |/m2/3 ) · poly(1/ε)) .

Thus, since ` is logarithmic in n and 1/ε, the number of samples required by all the applications of
Bounded-`∞ -Closeness-Test is bounded by Õ(n2/3 m2/3 · poly(1/ε)). Therefore, the sample complexity
is Õ((n2/3 m1/3 + m) · poly(1/ε)) as required.


7    Testing (k, β )-clusterability in the query model
                                                                 eq
In this section we consider an extension of the property Pm,n studied in the previous sections. Namely,
rather than asking whether all distributions in a list D are the same, we ask whether there exists a partition
of D into at most k lists, such that within each list all distributions are the the same (or close). That is, we
are interested in the following a clustering problem:

Definition 7.1. Let D be a list of m distributions over [n]. We say that D is (k, β )-clusterable if there
exists a partition of D to k lists ,{Di }ki=1 such that for every i ∈ [k] and every D, D0 ∈ Di , kD − D0 k1 ≤ β .
                                                                          eq
     In particular, for k = 1 and β = 0, we get the property Pm,n . We study testing (k, β )-clusterability
(for k ≥ 1) in the query model. The question for k > 1 in the (uniform) sampling model remains open.
     We start by noting that if we allow a linear (or slightly higher) dependence on n, then it is possible,
for any ε and β (by adapting the algorithm we give below), to obtain a tester that distinguishes collections
that are (k, β )-clusterable from collections that are ε-far from being (k, β )-clusterable. The complexity of
this tester is Õ(n · k · poly(1/ε))). However, if we want a dependence on n that grows slower than n1−o(1) ,
then it is not possible to get such a result even for m = 2 (and k = 1). This is true since distinguishing

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                                340
                           T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

between the case that a pair of distributions are β -close and the case that they are β 0 -far for constant β
and β 0 requires n1−o(1) samples [47]. We also note that for β = 0 the dependence on n must be at least
Ω(n2/3 ) (for m = 2 and k = 1) [47]. Our algorithm works for β = 0 and slightly more generally, for
           √
β = O(ε/ n), has no dependence on m, has almost linear dependence on k, and its dependence on n
grows like Õ(n2/3 ).

Theorem 7.2. Algorithm 5 Clusterability-Test is a testing algorithm for (k, β )-clusterability of a list of
distributions in the query model, which works for every ε > 8β n1/2 , and performs Õ(n2/3 · k · poly(1/ε))
sampling queries.

We build on the following theorem.

Theorem 7.3 ([11]). Given parameter δ , and sampling access to distributions p, q over [n], there is a
test, `1 -Distance-Test(p, q, ε, δ ), which takes O(ε −4 n2/3 log n log δ −1 ) samples from each distribution
and for which the following holds.

    • If kp − qk1 ≤ ε/(4n1/2 ), then the test accepts with probability at least 1 − δ .

    • If kp − qk1 > ε, then the test rejects with probability at least 1 − δ .

    Algorithm 5 is an adaptation of the diameter-clustering tester of [3], which applies to clustering
vectors in Rd . While often clustering algorithms rely on a method of evaluating distances between the
objects that they cluster, the algorithm from [11] only distinguishes pairs of distributions that are very
close from those that are ε-far (in `1 distance). Still, this is enough information in conjunction with the
algorithm of [3] to construct a good distribution (k, b)-clusterability tester.

  Clusterability-Test (query access to a list, D, of m distributions over [n] in the query model, k, β ,
  ε > 0)

     1. Pick the representative of the first cluster, rep1 , uniformly from D

     2. For i := 1, . . . , k

          (a) Uniformly and independently select a set, D0 , of 2 ln(6(k + 1))/ε distributions from D
          (b) If there exists D ∈ D0 such that `1 -Distance-Test(D, rep` , β /2, ε/12(k + 1) ln(6(k + 1)))
              rejects for every 1 ≤ ` ≤ i, then set repi+1 = D, otherwise ACCEPT

     3. REJECT

                           Algorithm 5: The algorithm for testing clusterability.


Proof of Theorem 7.2. Assume all applications of `1 -Distance-Test returned a correct answer, as defined
by Theorem 7.3. By the union bound, this happens with probability at least 5/6. Let us refer to this
event as E1 . Conditioned on E1 , the clustering algorithm rejects only if it finds k + 1 distributions in D

                         T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                          341
                             R EUT L EVI , DANA RON , AND RONITT RUBINFELD

such that the `1 distance between every two of them is greater than (ε/2)/(4n1/2 ) ≥ β . Thus, if D is
(k, β )-clusterable, then it will be accepted with probability at least 5/6.
     We thus turn to the case that D is ε-far from being (k, β )-clusterable. In this case we claim that
as long as there are t ≤ k representatives, rep1 , . . . , rept , the number of distributions Di ∈ D such that
kDi −rep` k1 > ε/2 for every 1 ≤ ` ≤ t, is at least εm/2. To verify this, assume for the sake of contradiction
that there are less than εm/2 such distributions. Then the sum of the distances of each of these distributions
to an arbitrary representative plus the sum over the other distributions of the distance between each
distribution and the closest representative is less than εm, implying that the collection is ε-close to being
(k, 0)-clusterable.
     Since in each iteration of the while loop, there are less than k + 1 representative distributions, at least
εm/2 of the distributions in D are ε/2-far from any of the former representative distributions. Therefore,
conditioned on E1 , for every iteration of the while loop, the probability that a new representative is not
found is less than
                                                                            1
                                (1 − ε/2)2 ln(6(k+1))/ε < eln(6(k+1)) =           .
                                                                         6(k + 1)
By applying the union bound, the algorithm rejects D with probability greater than 2/3. Since there are
O(log k/ε) iterations, and in each there is a single application of the `1 -distance test, by Theorem 7.3 the
total number of samples used is as stated.


Acknowledgements

We would like to thank the two anonymous referees for providing us with constructive comments.


References
 [1] M ICHAŁ A DAMASZEK , A RTUR C ZUMAJ , AND C HRISTIAN S OHLER: Testing monotone con-
     tinuous distributions on high-dimensional real cubes. In Proc. 21st Ann. ACM-SIAM Symp. on
     Discrete Algorithms (SODA’10), pp. 56–65. ACM Press, 2010. Short version in Property Testing.
     [ACM:1873601.1873607] 299

 [2] N OGA A LON , A LEXANDR A NDONI , TALI K AUFMAN , K EVIN M ATULEF, RONITT RUBINFELD ,
     AND N ING X IE: Testing k-wise and almost k-wise independence. In Proc. 39th STOC, pp. 496–505.
     ACM Press, 2007. [doi:10.1145/1250790.1250863] 299

 [3] N OGA A LON , S EANNIE DAR , M ICHAL PARNAS , AND DANA RON: Testing of cluster-
     ing. SIAM J. Discrete Math., 16(3):393–417, 2003. Preliminary version in FOCS’00.
     [doi:10.1137/S0895480102410973] 298, 341

 [4] N OGA A LON , YOSSI M ATIAS , AND M ARIO S ZEGEDY: The space complexity of approximating
     the frequency moments. J. Comput. System Sci., 58(1):137–147, 1999. Preliminary version in
     STOC’96. [doi:10.1006/jcss.1997.1545] 296

                      T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                               342
                      T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

 [5] A LEXANDR A NDONI , P IOTR I NDYK , K RZYSZTOF O NAK , AND RONITT RUBINFELD: External
     sampling. In Proc. 36th Internat. Colloq. on Automata, Languages and Programming (ICALP’09),
     pp. 83–94. Springer, 2009. [doi:10.1007/978-3-642-02927-1_9] 299

 [6] K HANH D O BA , H UY L. N GUYEN , H UY N. N GUYEN , AND RONITT RUBINFELD: Sublinear
     time algorithms for Earth Mover’s Distance. Theory Comput. Syst., 48(2):428–442, 2011. Preprint
     available at arXiv. [doi:10.1007/s00224-010-9265-8] 299

 [7] Z IV BAR -YOSSEF, T. S. JAYRAM , R AVI K UMAR , D. S IVAKUMAR , AND L UCA T REVISAN:
     Counting distinct elements in a data stream. In Proc. 6th Internat. Workshop on Randomization and
     Computation (RANDOM’02), pp. 1–10. Springer, 2002. [doi:10.1007/3-540-45726-7_1] 296

 [8] T U ǦKAN BATU: Testing properties of distributions. Ph. D. thesis, Computer Science depart-
     ment, Cornell University, 2001. http://www.maths.lse.ac.uk/Personal/batu/papers/t.
     ps. [ACM:934334] 299

 [9] T U ǦKAN BATU , S ANJOY DASGUPTA , R AVI K UMAR , AND RONITT RUBINFELD: The complexity
     of approximating the entropy. SIAM J. Comput., 35(1):132–150, 2005. Preliminary versions in
     CCC’02 and STOC’02. [doi:10.1137/S0097539702403645] 296, 299

[10] T U ǦKAN BATU , E LDAR F ISCHER , L ANCE F ORTNOW, R AVI K UMAR , RONITT RUBINFELD , AND
     PATRICK W HITE: Testing random variables for independence and identity. In Proc. 42nd FOCS,
     pp. 442–451. IEEE Comp. Soc. Press, 2001. [doi:10.1109/SFCS.2001.959920] 296, 298, 299, 318,
     325, 329, 330

[11] T U ǦKAN BATU , L ANCE F ORTNOW, RONITT RUBINFELD , WARREN D. S MITH , AND PATRICK
     W HITE: Testing that distributions are close. In Proc. 41st FOCS, pp. 259–269. IEEE Comp. Soc.
     Press, 2000. [doi:10.1109/SFCS.2000.892113] 296, 298, 310, 325, 326, 327, 341

[12] T U ǦKAN BATU , L ANCE F ORTNOW, RONITT RUBINFELD , WARREN D. S MITH , AND PATRICK
     W HITE: Testing closeness of discrete distributions. Technical report, 2010. Preliminary version in
     FOCS’00. Accepted for publication in JACM. [arXiv:1009.5397] 296, 297, 298, 299, 304, 325

[13] T U ǦKAN BATU , R AVI K UMAR , AND RONITT RUBINFELD: Sublinear algorithms for testing
     monotone and unimodal distributions. In Proc. 36th STOC, pp. 381–390. ACM Press, 2004.
     [doi:10.1145/1007352.1007414] 299

[14] V LADIMIR B RAVERMAN AND R AFAIL O STROVSKY: Measuring k-wise independence of streaming
     data. Technical report, 2008. [arXiv:0806.4790v1] 296

[15] V LADIMIR B RAVERMAN AND R AFAIL O STROVSKY: Measuring independence of datasets. In
     Proc. 42nd STOC, pp. 271–280. ACM Press, 2010. [doi:10.1145/1806689.1806728] 296

[16] V LADIMIR B RAVERMAN AND R AFAIL O STROVSKY: Zero-one frequency laws. In Proc. 42nd
     STOC, pp. 281–290. ACM Press, 2010. [doi:10.1145/1806689.1806729] 296

                    T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                          343
                           R EUT L EVI , DANA RON , AND RONITT RUBINFELD

[17] D ON C OPPERSMITH AND R AVI K UMAR: An improved data stream algorithm for frequency
     moments. In Proc. 15th Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA’04), pp. 151–156.
     ACM Press, 2004. [ACM:982815] 296

[18] G RAHAM C ORMODE , M AYUR DATAR , P IOTR I NDYK , AND S. M UTHUKRISHNAN: Comparing
     data streams using Hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529–540,
     2003. Preliminary version in VLDB’02. [doi:10.1109/TKDE.2003.1198388] 296

[19] I MRE C SISZÁR: Information-type measures of difference of probability distributions and indirect
     observations. Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. 296

[20] A RTUR C ZUMAJ AND C HRISTIAN S OHLER: Testing expansion in bounded-degree graphs. Com-
     binatorics, Probability & Computing, 19(5-6):693–709, 2010. Preliminary version in FOCS’07.
     [doi:10.1017/S096354831000012X] 296

[21] J OAN F EIGENBAUM , S AMPATH K ANNAN , M ARTIN J. S TRAUSS , AND M AHESH V ISWANATHAN:
     An approximate L1 -difference algorithm for massive data streams. SIAM J. Comput., 32(1):131–151,
     January 2003. Preliminary verison in FOCS’99. [doi:10.1137/S0097539799361701] 296

[22] W ILLIAM F ELLER: An introduction to probability theory and its applications. Volume 1. Wiley,
     3rd edition, 1967. 306

[23] E LDAR F ISCHER AND A RIE M ATSLIAH: Testing graph isomorphism. SIAM J. Comput., 38(1):207–
     225, 2008. Preliminary version in SODA’06. [doi:10.1137/070680795] 296

[24] J ESSICA H. F ONG AND M ARTIN S TRAUSS: An approximate L p difference algorithm for massive
     data streams. Discrete Mathematics & Theoretical Computer Science, 4(2):301–322, 2001. DMTCS.
     Preliminary version in STACS’00. 296

[25] O DED G OLDREICH AND DANA RON: On testing expansion in bounded-degree graphs. In Studies
     in Complexity and Cryptography, pp. 68–75. Springer, 2011. Preliminary version in ECCC.
     [doi:10.1007/978-3-642-22670-0_9] 296

[26] S UDIPTO G UHA , A NDREW M C G REGOR , AND S URESH V ENKATASUBRAMANIAN: Sublinear
     estimation of entropy and information distances. ACM Trans. Algorithms, 5(4):35, 2009. Preliminary
     version in SODA’06. [doi:10.1145/1597036.1597038] 296, 299

[27] B ERNARD H ARRIS: The statistical estimation of entropy in the non-parametric case. Colloquia
     Mathematica Societatis János Bolyai, 16:323–355, 1975. DTIC Online. 296

[28] P IOTR I NDYK AND A NDREW M C G REGOR: Declaring independence via the sketching of sketches.
     In Proc. 19th Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA’08), pp. 737–745. ACM Press,
     2008. [ACM:1347163] 296

[29] Y UVAL I SHAI , E YAL K USHILEVITZ , R AFAIL O STROVSKY, AND A MIT S AHAI: Extract-
     ing correlations.   In Proc. 50th FOCS, pp. 261–270. IEEE Comp. Soc. Press, 2009.
     [doi:10.1109/FOCS.2009.56] 296

                    T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                         344
                      T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

[30] S ATYEN K ALE AND C OMANDUR S ESHADHRI: An expansion tester for bounded degree
     graphs. SIAM J. Comput., 40(3):709–720, 2011. Preliminary versions in ICALP’08 and ECCC.
     [doi:10.1137/100802980] 296

[31] D ONALD K NUTH: The Art of Computer Programming: Seminumerical Algorithms. Volume 2.
     Addison Wesley, Phillipines, 1969. 296

[32] R EUT L EVI , DANA RON , AND RONITT RUBINFELD: Testing properties of collections of distri-
     butions. In Proc. 2nd Symp. Innovations in Comput. Sci. (ICS’11), pp. 179–194, 2011. ICS’11.
     295

[33] S HANG - KENG M A: Calculation of entropy from data of motion. J. Statistical Physics, 26(2):221–
     240, 1981. [doi:10.1007/BF01013169] 296

[34] D RAGOSLAV S. M ITRINOVI Ć AND P ETAR M. VASI Ć: Analytic Inequalities. Springer, 1970. 309

[35] A SAF NACHMIAS AND A SAF S HAPIRA: Testing the expansion of a graph. Inform. and Comput.,
     208(4):309–314, 2010. Preliminary version in ECCC. [doi:10.1016/j.ic.2009.09.002] 296

[36] I LYA N EMENMAN , W ILLIAM B IALEK , AND ROB DE RUYTER VAN S TEVENINCK: Entropy and
     information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69(5):056111,
     2004. Preprint available at arXiv. [doi:10.1103/PhysRevE.69.056111] 296

[37] L IAM PANINSKI: Estimation of entropy and mutual information. Neural Computation, 15(6):1191–
     1253, 2003. [doi:10.1162/089976603321780272] 296

[38] L IAM PANINSKI: Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inform.
     Theory, 50(9):2200–2203, 2004. [doi:10.1109/TIT.2004.833360] 296

[39] L IAM PANINSKI: A coincidence-based test for uniformity given very sparsely sampled discrete
     data. IEEE Trans. Inform. Theory, 54(10):4750–4755, 2008. [doi:10.1109/TIT.2008.928987] 296

[40] S OFYA R ASKHODNIKOVA , DANA RON , RONITT RUBINFELD , AND A DAM S MITH: Sublinear al-
     gorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. Preliminary
     version in RANDOM’07. [doi:10.1007/s00453-012-9618-6] 296

[41] S OFYA R ASKHODNIKOVA , DANA RON , A MIR S HPILKA , AND A DAM S MITH: Strong lower bonds
     for approximating distributions support size and the distinct elements problem. SIAM J. Comput.,
     39(3):813–842, 2009. Preliminary version in FOCS’07. [doi:10.1137/070701649] 296, 301

[42] B ERO ROOS: On the rate of multivariate Poisson convergence. J. Multivar. Anal., 69(1):120–134,
     1999. [doi:10.1006/jmva.1998.1789] 305, 306

[43] RONITT RUBINFELD AND ROCCO A. S ERVEDIO: Testing monotone high-dimensional distribu-
     tions. Random Structures & Algorithms, 34(1):24–44, 2009. Preliminary version in STOC’05.
     [doi:10.1002/rsa.20247] 299

                    T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                        345
                           R EUT L EVI , DANA RON , AND RONITT RUBINFELD

[44] RONITT RUBINFELD AND N ING X IE: Robust characterizations of k-wise independence over prod-
     uct spaces and related testing results. Random Structures & Algorithms, 2012 (online). Preliminary
     version in ICALP 2010. [doi:10.1002/rsa.20423] 299

[45] S TEVE P. S TRONG , ROLAND KOBERLE , ROB R. DE RUYTER VAN S TEVENINCK , AND W ILLIAM
     B IALEK: Entropy and information in neural spike trains. Phys. Rev. Lett., 80(1):197–200, 1998.
     [doi:10.1103/PhysRevLett.80.197] 296

[46] W OJCIECH S ZPANKOWSKI: Average Case Analysis of Algorithms on Sequences. John Wiley &
     Sons, New York, 2001. [ACM:517168] 301, 305

[47] PAUL VALIANT: Testing symmetric properties of distributions. Ph. D. thesis, CSAIL, MIT, 2008.
     DSpace@MIT. 296, 297, 298, 299, 301, 341

[48] PAUL VALIANT: Testing symmetric properties of distributions. SIAM J. Comput., 40(6):1927–1968,
     2011. Preliminary version in STOC’08. [doi:10.1137/080734066] 296, 301, 305, 310

[49] PATRICK W HITE: Testing random variables for independence and identity. Unpublished manuscript,
     2002. 329, 330

[50] DAVID H. W OLPERT AND DAVID R. W OLF: Estimating functions of probability distributions from
     a finite set of samples. Physical Review E, 52(6):6841–6854, 1995. [doi:10.1103/PhysRevE.52.6841]
     296

[51] K ENJI YAMANISHI: Probably almost discriminative learning. Machine Learning, 18(1):23–50,
     1995. Preliminary version in COLT’92. [doi:10.1007/BF00993820] 296


AUTHORS

      Reut Levi
      Ph. D. candidate
      School of Computer Science, Tel-Aviv University
      reuti levi gmail com


      Dana Ron
      Professor
      School of Computer Science, Tel-Aviv University
      danar eng tau ac il
      http://www.eng.tau.ac.il/~danar




                    T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                         346
                    T ESTING P ROPERTIES OF C OLLECTIONS OF D ISTRIBUTIONS

    Ronitt Rubinfeld
    Professor
    School of Computer Science, Tel-Aviv University and CSAIL, MIT
    ronitt csail mit edu
    http://people.csail.mit.edu/ronitt/


ABOUT THE AUTHORS

    R EUT L EVI received her M. Sc. from Tel-Aviv University and now is a Ph. D. candidate
       under the supervision of Dana Ron and Ronitt Rubinfeld. Her research focuses on testing
       properties of distributions.


    DANA RON received her Ph. D. from the Hebrew University, Jerusalem, in 1995. Between
      1995 and 1997 she was an NSF Postdoc at MIT. During the academic year 1997-98 she
      was a science scholar at the Mary Ingraham Bunting Institute (Radcliffe College) and
      at MIT. Since 1998 she has been a faculty member at the Faculty of Engineering, Tel
      Aviv University. During the academic year 2003-04 she was a fellow at the Radcliffe
      Institute for Advanced Study, Harvard University, Her research focuses on sublinear
      approximation algorithms and in particular on property testing.


    RONITT RUBINFELD received her undergraduate degree at the University of Michigan and
      Ph. D. at the University of California, Berkeley under the direction of Manuel Blum. She
      held postdoctoral positions at DIMACS and the Hebrew University at Jerusalem. After
      several years as a faculty member at Cornell University and NEC Research Institute, she
      currently is on the faculties of MIT and Tel Aviv University. She has been a recipient
      of the ONR Young Investigator award, NSF Career Award, Sloan Fellowship, Cornell
      Association for Computer Science Undergraduates Faculty of the Year award and a
      Cornell College of Engineering Teaching award. She was an invited speaker at the
      International Congress of Mathematicians in 2006. Her research focuses on sub-linear
      time algorithms for big datasets.




                  T HEORY OF C OMPUTING, Volume 9 (8), 2013, pp. 295–347                         347