A Chasm Between Identity and Equivalence Testing with Conditional Queries

Authors Jayadev Acharya, Cl{\'{e}}ment L. Canonne, Gautam Kamath,
Plaintext
                          T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46
                                       www.theoryofcomputing.org




 A Chasm Between Identity and Equivalence
     Testing with Conditional Queries
          Jayadev Acharya∗                     Clément L. Canonne†                      Gautam Kamath‡
            Received December 11, 2015; Revised February 14, 2018; Published December 19, 2018




       Abstract: A recent model for property testing of probability distributions (Chakraborty et
       al., ITCS 2013, Canonne et al., SICOMP 2015) enables tremendous savings in the sample
       complexity of testing algorithms, by allowing them to condition the sampling on subsets of
       the domain. In particular, Canonne, Ron, and Servedio (SICOMP 2015) showed that, in this
       setting, testing identity of an unknown distribution D (i. e., whether D = D∗ for an explicitly
       known D∗ ) can be done with a constant number of queries (i. e., samples), independent of the
                                                      √
       support size n—in contrast to the required Ω( n) in the standard sampling model. However,
       it was unclear whether the same stark contrast exists for the case of testing equivalence,
       where both distributions are unknown. Indeed, while Canonne et al. established a poly(log n)-
       query upper bound for equivalence testing, very recently brought down to Õ(log log n) by
       Falahatgar et al. (COLT 2015), whether a dependence on the domain size n is necessary was
      An extended abstract of this paper, with only some of the proofs and a subset of the results on non-adaptive algorithms,
appeared in the Proceedings of the 2015 Conference on Approximation, Randomization, and Combinatorial Optimization.
Algorithms and Techniques (APPROX/RANDOM) [3].
    ∗ Supported by a Cornell University Start Up, and NSF CRII-CIF-1657471. Part of this work was done when the author was

a postdoctoral researcher at MIT.
    † Supported by a Motwani Postdoctoral Fellowship. Part of this work was performed while the author was a graduate student

at Columbia University, and supported by NSF grants CCF-1115703 and NSF CCF-1319788.
    ‡ This work was supported by ONR N00014-12-1-0999, and NSF grants CCF-0953960 (CAREER) and CCF-1101491.



ACM Classification: F.2, G.2, G.3
AMS Classification: 68Q32, 68W20, 68Q25, 68Q17
Key words and phrases: property testing, distribution testing, conditional sampling, lower bounds,
equivalence testing, uniformity testing, support size estimation


© 2018 Jayadev Acharya, Clément L. Canonne, and Gautam Kamath
c b Licensed under a Creative Commons Attribution License (CC-BY)                              DOI: 10.4086/toc.2018.v014a019
                       JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

        still open, and explicitly posed by Fischer at the Bertinoro Workshop on Sublinear Algorithms
        (2014). In this article, we answer the question in theaffirmative, showing that any testing
                                                      √
        algorithm for equivalence must make Ω log log n queries in the conditional sampling
        model. Interestingly, this demonstrates a gap between identity and equivalence testing, absent
        in the standard sampling model (where both problems have sampling complexity nΘ(1) ).
             We also obtain results on the query complexity of uniformity testing and support-size
        estimation with conditional samples. In particular, we answer a question of Chakraborty et al.
        (ITCS 2013) showing that non-adaptive uniformity testing indeed requires Ω(log n) queries
        in the conditional model. This is an exponential improvement on their previous lower bound
        of Ω(log log n), and matches up to polynomial factors their poly(log n) upper bound. For the
        problem of support-size estimation, we provide both adaptive and non-adaptive algorithms,
        with query complexities poly(log log n) and poly(log n), respectively, and complement them
        with a lower bound of Ω(log n) conditional queries for non-adaptive algorithms.


1     Introduction
                                                           “No, Virginia, there is no constant-query tester.”1
     Understanding properties and characteristics of an unknown probability distribution is a fundamental
problem in statistics, and one that has been thoroughly studied. However, it is only since the work of
Goldreich and Ron [27] and Batu et al. [9] that the problem has been considered through the lens of
theoretical computer science, more particularly in the setting of property testing. In this framework, an
unknown “huge object”—here a probability distribution over a humongous domain—can be accessed only
by making a few local inspections, and the goal is to decide whether the object satisfies some prespecified
property. While most of the literature focuses on the large sample regime and studies error exponents and
rates of convergence, more recent testing algorithms look at these problems in the small-sample regime
focusing on the probability of errors and sample complexity. (We refer the reader to [25, 35, 36, 26] for
an introduction and surveys on the general field of property testing. Moreover, due to the specificities of
our model we will interchangeably use the terms sample and query complexity, referring to conditional
samples as “queries.”)
     Over the subsequent decade, a flurry of work explored this new area, resulting in a better and often
complete understanding of a number of questions in property testing of distributions, or distribution testing
(see, e. g., [27, 8, 11, 33, 39, 4, 12, 38, 29, 5, 18, 45] or [13] for a survey). In many cases, these culminated
in provably sample-optimal algorithms, all of which required at least an nΩ(1) dependence on the domain
size n in the sample complexity—a dependence which, while sublinear, can still be prohibitively large.
However, the standard setting of distribution testing, where one only obtains independent samples from an
unknown distribution D, does not encompass all scenarios one may encounter. In recent years, alternative
models have thus been proposed to capture more specific situations [28, 16, 14, 30, 15]. Among these
is the conditional oracle model [16, 14] which will be the focus of our work. In this setting, the testing
algorithm is given the ability to sample from conditional distributions: that is, to specify a subset S of
the domain and obtain samples from DS , the distribution induced by D on S (the formal definition of the
    1 The curious reader is referred to [47].



                             T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                             2
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

model can be found in Definition 2.1). The hope is that allowing a richer set of queries to the unknown
underlying distributions might significantly reduce the number of samples the algorithms need, thereby
sidestepping the strong lower bounds that hold in the standard sampling model.

1.1   Motivation for the conditional model
A recent trend in testing and learning circumvents these impossibility results by providing a more flexible
type of queries than independent samples. One example can be found in the recent paradigm of active
learning [20, 40] (and its testing counterpart, active testing [7]), which modifies and generalizes the usual
unsupervised learning paradigm. In this setting, the algorithm is provided with unlabeled examples only,
but can then adaptively request the label of any of these examples. While not directly comparable to
these two frameworks (which are not applicable to the study of probability distributions), the conditional
sampling model we shall work on shares some similarity in spirit. Specifically, it provides the algorithm
with some additional power, and the ability to perform (some type of) more powerful queries to the object
to be learned or tested.
    The setting of conditional sampling is related to that of group testing, where the objective is to identify
a set of defective individuals among a large population, by querying whether suitably chosen subsets
contain at least one defective individual. Group testing has been a field of interest since the 40’s and has
remained an active area of research since (see, e. g., [21, 22, 32, 17]). The type of queries allowed in this
framework is reminiscent of conditional sampling, where one obtains samples conditioned on a subset.
This connection between group testing and conditional sampling is explored in greater detail in [2].
    Of course, a crucial aspect of designing and studying these new models of learning and testing is to
understand how justified and natural they are, and argue that they do indeed capture natural situations.
In the case of distribution testing, the conditional access model does meet these criteria, as discussed
in [16] and [14]. Namely, besides the purely theoretical aspect of helping understand the key aspects and
limitations of the underlying statistical problems, this framework is characteristic of situations which arise
in natural and social sciences. At a very high-level, any scenario where an experimenter or practitioner
is able to restrict the set of possible outcomes of an experiment or poll—e. g., in chemistry, where one
might control some factor such as the acidity of a solution; or in sociology by performing stratified
polling—provides this experimenter with the sort of access granted by the conditional model.

1.2   Background and previous work
We focus in this paper on proving lower bounds for testing two extremely natural properties of distribu-
tions, namely equivalence testing (“are these two distributions identical?”) and support-size estimation
(“how many different outcomes can actually be observed?”). Along the way, we use some of the techniques
we develop to obtain an upper bound on the query complexity of the latter. We state below an informal def-
inition of these two problems, along with closely related ones (uniformity and identity testing). Hereafter,
“oracle access” to a distribution D over [n] = {1, . . . , n} means access to samples generated independently
from D, and “far” is with respect to the total variation distance dTV (D1 , D2 ) = supS⊆[n] (D1 (S) − D2 (S))
between probability distributions. Moreover, as in usual in property testing, in all the problems below we
allow the algorithms to be randomized. In the description of the results below, we will often consider the
distance parameter ε as a constant (and focus on the domain size n as the key parameter).

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                3
                       JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Uniformity testing: granted oracle access to D, decide whether D equals U[n] (the uniform distribution
     on [n]) or is far from it.

Identity testing: granted oracle access to D and the full description of a fixed D∗ , decide whether D
      equals D∗ or is far from it.

Equivalence (closeness) testing: granted independent oracle accesses to D1 , D2 (both unknown), decide
     whether D1 and D2 are equal or far from each other.

Support-size estimation: granted oracle access to D, return a multiplicative approximation of the size
     of the support2 supp(D) = { x : D(x) > 0 }.

It is not difficult to see that each of the second and third problems generalizes the previous, and therefore
has query complexity at least as big. All of these tasks are known to require sample complexity nΩ(1)
in the standard sampling model (SAMP); yet, as prior work [16, 14] shows, their complexity decreases
tremendously when one allows the more flexible type of access to the distribution(s) provided by a
conditional sampling oracle (COND). For the problems of uniformity testing and identity testing, the
sample complexity even becomes a constant provided the testing algorithm is allowed to be adaptive (i. e.,
when the next queries it makes can depend on the samples it previously obtained).


Testing uniformity and identity. The identity testing question is a generalization of uniformity testing,
where D∗ is taken to be the uniform distribution over [n]. The complexity of both tasks is well-understood
                                                                                                  √
in the sampling model; in particular, it is known that for both uniformity and identity testing Θ n/ε 2
samples are necessary and sufficient (see [27, 10, 33, 45] for the tight bounds on these problems).
     The uniformity testing problem exemplifies the savings granted  by conditional sampling—as Canonne,
Ron, and Servedio [14] showed, in this setting only Õ 1/ε adaptive queries3 are sufficient (and this
                                                              2
                                                                

is optimal, up to logarithmic factors).     They further prove that identity testing has constant  sample
                                   4                                                      2
                                                                                           
complexity as well, namely Õ 1/ε —very recently improved to a near-optimal Õ 1/ε by Falahatgar
et al. [24]. The power of the COND model is evident from the fact that a task requiring polynomially
many samples in the standard model can now be achieved with a number of samples independent of the
domain size n.
     The aforementioned algorithms crucially leverage the ability to make adaptive conditional queries to
the probability distributions. Restricting the study to non-adaptive algorithms, Chakraborty et al. [16]
describe a poly(log n, 1/ε)-query non-adaptive tester for uniformity, showing that even without the full
power of conditional queries one can still get an exponential improvement over the standard sampling
setting. They also obtain an Ω(log log n) lower bound for this problem, and leave open the possibility
of improving this lower bound to a logarithmic dependence. We answer this question, establishing
in Theorem 1.3 that any non-adaptive uniformity tester must perform Ω(log n) conditional queries.
    2 For this problem, it is typically assumed that all points in the support have probability mass at least Ω(1)/n, as without such

guarantee it becomes impossible to give any non-trivial estimate (consider for instance a distribution D such that D(i) ∝ 1/2in ).
    3 Here and throughout, we use the notation Õ( f ) to hide polylogarithmic dependencies on the argument, i. e., for expressions

of the form O( f logc f ) (for some absolute constant c).


                            T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                                  4
         A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

Testing equivalence. The equivalence testing problem has been extensively studied over the past
                                                                               √
decade, and its sample complexity is now known to be Θ(max(n2/3 /ε 4/3 , n/ε 2 )) in the sampling
model [10, 46, 18].
     In the COND setting, Canonne, Ron, and Servedio showed that equivalence testing is possible
with only poly(log n, 1/ε) queries.  Concurrent to our work, Falahatgar et al. [24]   brought this upper
                                  5
bound down to Õ (log log n)/ε , a doubly exponential improvement over the n      Ω(1) samples needed in
the standard sampling model. However, these results still left open the possibility of a constant query
complexity—given that both uniformity and identity testing admit constant-query testers, it is natural to
wonder where equivalence testing lies.4 This question was posed by Fischer at the Bertinoro Workshop
on Sublinear Algorithms 2014 [Sublinear.info, Problem 66]. We make decisive progress in answering
it, ruling out the possibility of any constant-query tester for equivalence. Along with the upper bound
of Falahatgar et al. [24], our results nearly settle the dependence on the domain size, showing that
(log log n)Θ(1) samples are both necessary and sufficient.

Support-size estimation. Raskhodnikova et al. [34] showed that obtaining additive estimates of the
support size requires sample complexity almost linear in n. Subsequent work by Valiant and Valiant [44,
42] settles the question, establishing that Θ(n/ log n) samples are both necessary and sufficient. Note
that the proof of the Valiants’ lower bound translates to multiplicative approximations as well, as it
hinges on the hardness of distinguishing a distribution with support s ≤ n from a distribution with support
s + εn ≥ (1 + ε)s. To the best of our knowledge, the question of getting a multiplicative-factor estimate
of the support size of a distribution given conditional sampling access has not been previously considered.
We provide upper and lower bounds for both the adaptive and non-adaptive versions of this problem.

1.3    Our results
We make significant progress in each of the problems introduced in the previous section, yielding a better
understanding of their query complexities. We prove four results pertaining to the sample complexity of
equivalence testing, support-size estimation, and uniformity testing in the COND framework.
   Our main result gives a lower bound on the sample complexity of testing equivalence with adaptive
queries under the COND model, resolving in the negative the question of whether constant-query
complexity was achievable [Sublinear.info, Problem 66].

Theorem 1.1 (Testing Equivalence). Any adaptive algorithm which, given COND access to unknown
distributions D1 , D2 on [n], distinguishes with probability at least 2/3 between (a) D1 = D2 and (b)
                                                    √
dTV (D1 , D2 ) ≥ 1/4, must have query complexity Ω log log n .

Combined with the recent Õ(log log n) upper bound of Falahatgar et al. [24], this almost settles the sample
complexity of this question. Furthermore, as the related task of identity testing can be performed with
a constant number of queries in the conditional sampling model, this demonstrates an intriguing and
intrinsic difference between the two problems. Our result also shows an interesting contrast with the
usual sampling model, where both identity and equivalence testing have polynomial sample complexity.
   4 It is worth noting that an Ω(logc n) lower bound was known for equivalence testing in a weaker version of the conditional

oracle, PAIRCOND (where the tester’s queries are restricted to being either [n] or subsets of size 2 [14]).


                           T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                            5
                    JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

    Next, we establish a logarithmic lower bound on non-adaptive support-size estimation, for any (large
enough) constant factor. This improves on the result of Chakraborty et al. [16], which gave a doubly
logarithmic lower bound for constant factor support-size estimation.
Theorem 1.2 (Non-Adaptive Support-Size Estimation). Any non-adaptive algorithm which, given√COND
access to an unknown distribution D on[n], estimates the size of its support up to a factor γ ≥ 2 must
have query complexity Ω (log n)/log2 γ .
Moreover, the approach used to prove this theorem also implies an analogous lower bound on non-
adaptive uniformity testing in the conditional model, answering a conjecture of Chakraborty et al. [16].

Theorem 1.3 (Non-Adaptive Uniformity Testing). Any non-adaptive algorithm which, given COND
access to an unknown distribution D on [n] and parameter ε ∈ (0, 1/4), distinguishes with probability at
least 2/3 between (a) D = U[n] and (b) dTV (D, U[n] ) ≥ ε, must have query complexity Ω((log n)/ε).
We note that these results complement polylog(n)-query upper bounds on non-adaptive support-size
estimation and uniformity testing, the former of which we sketch in this paper, and the latter obtained by
Chakraborty et al. [16]. This shows that both of these problems have query complexity logΘ(1) n in the
non-adaptive case.
    Finally, we conclude with an upper bound for adaptive support-size estimation. Specifically, we
provide a Õ(log log n)-query algorithm for support-size estimation. This shows that the question becomes
double exponentially easier when conditional samples are allowed.
Theorem 1.4 (Adaptive Support-Size Estimation). Let τ > 0 be any constant. There exists an adaptive
algorithm which, given COND access to an unknown distribution D on [n] (guaranteed to have prob-
ability mass at least
                    τ/n on every element      of its support) and accuracy parameter ε ∈ (0, 1), makes
                 3                        5
Õ (log log n)/ε queries to the oracle and returns a value ω̃ such that the following holds. With
probability at least 2/3, ω̃ ∈ [(1/(1 + ε)) · ω, (1 + ε) · ω], where ω = |supp(D)|.


1.3.1   Relation to the Ron-Tsur model
Recent work of Ron and Tsur [37] studies a model which is slightly different—and more favorable to
the—than ours. In their setting, the algorithm still performs queries consisting of a subset of the domain,
as in our case. However, the algorithm is also given the promise that the distribution is uniform on a
subset of the domain, and whenever a query set contains 0 probability mass the oracle explicitly indicates
this is the case. Their paper provides a number of results for support-size estimation in this model.
    We point out two connections between our work and theirs. First, our Ω(log n) lower bound for
non-adaptive support-size estimation (Theorem 1.2) holds in the model of Ron and Tsur. Although lower
bounds in the conditional sampling setting do not apply directly to their model, our construction and
analysis do carry over, and provide a nearly tight answer to a question left unanswered in their paper.
Also, our Õ(log log n)-query algorithm for adaptive support-size estimation (Theorem 1.4) can be seen as
generalizing their result to the weaker conditional sampling model (most significantly, when we are not
given the promise that the distribution be uniform).
  5 We remark that the constant in the Õ depends polynomially on 1/τ.



                         T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                           6
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES


            Problem                              COND model                     Standard model
                                                              
                                                     log log n
                                               Õ       ε5
                                                                 [24]              2/3 1/2 
          Testing equivalence               √                               Θ max εn4/3 , nε 2 [18]
                                         Ω log log n (Theorem 1.1)
                                                      
        Estimating support size           Õ logεlog
                                                  3
                                                     n
                                                          (Theorem 1.4)
              (adaptive)                       √               
                                            Ω log log n [16] (†)
                                                                                                  
                                                                                             n
                                                                                   Θ       log n       [42]
        Estimating support size        O(poly(log n, 1/ε)) (Section 5.4)
            (non-adaptive)                  Ω(log n) (Theorem 1.2)
                                                    5 
          Testing uniformity                    Õ logε 6 n [16]                     √ 
                                                                                  Θ ε 2n [33]
            (non-adaptive)                  Ω logε n (Theorem 1.3)

Table 1: Summary of results. Note that the lower bound (†) can also be easily derived from our lower
bound on testing equivalence.


1.4   Techniques and proof ideas

Lower bound on adaptive equivalence testing. In order to prove our main lower bound, Theorem 1.1,
we have to deal with one main conceptual issue: adaptivity. While the standard sampling model does
not, by definition, allow any choice on what the next query to the oracle should be, this is no longer the
case for COND algorithms. Quantifying the power that this grants an algorithm makes things much more
difficult. To handle this point, we follow the approach of Chakraborty et al. [16] and focus on a restricted
class of algorithms they introduce, called “core adaptive testers” (see Section 2.2 for a formal definition).
They show that this class of testers is equivalent to general algorithms for the purpose of testing a broad
class of properties, namely those which are invariant to any permutation of the domain. Using this
characterization, it remains for us to show that none of these structurally much simpler core testers can
distinguish whether they are given conditional access to (a) a pair of random identical distributions
(D1 , D1 ), or (b) two distributions (D1 , D2 ) drawn according to a similar process, which are far apart.
     At a high level, our lower bound works by designing instances where the property can be tested if
and only if the support size is known to the algorithm. Our construction randomizes the support size by
embedding the instance into a polynomially larger domain. Since the algorithm is only allowed a small
number of queries, Yao’s Minimax Principle allows us to argue that, with high probability, a deterministic
algorithm is unable to “guess” the support size. This separates queries into several cases. First, in a sense
we make precise, it is somehow “predictable” whether or not a query will return an element previously
observed. If this occurs, it is similarly predictable which element the query will return. On the other hand,
if a fresh element is observed, the query set is either “too small” or “too large.” In the former case, the
query will entirely miss the support, and the sampling process is identical for both types of instance. In
the latter case, the query will hit a large portion of the support, and the amount of information gleaned
from a single sample is minimal.
     At a lower level, this process itself is reminiscent of the “hard” instances underlying the lower
bound of Canonne, Ron, and Servedio [14] for testing identity (with a PAIRCOND oracle), with one

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                  7
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

pivotal twist. As in their work, both D1 and D2 are uniform within each of ω(1) “buckets” whose size
grows exponentially and are grouped into “bucket-pairs.” Then, D2 is obtained from D1 by internally
redistributing the probability mass of each pair of buckets, so that the total mass of each pair is preserved
but each particular bucket has mass going up or down by a constant factor (see Section 3.1 for details of
the construction). However, we now add a final step, where in both D1 and D2 the resulting distribution’s
support is scaled by a random factor, effectively reducing it to a (randomly) negligible fraction of the
domain. Intuitively, this last modification has the role of “blinding” the testing algorithm. We argue
that unless its queries are on sets whose size somehow match (in a sense formalized in Section 3.2) this
random size of the support, the sequences of samples it will obtain under D1 and D2 are almost identically
distributed. The above discussion crucially hides many significant aspects and technical difficulties which
we address in Section 3. Moreover, we observe that the lower bound we obtain seems to be optimal with
regard to our proofs techniques (specifically, to the decision tree approach), and not an artifact of our
lower bound instances. Namely, there appear to be conceptual barriers to strengthening our result, which
would require new ideas.

Lower bound on non-adaptive support-size estimation. Turning to the (non-adaptive) lower bound
of Theorem 1.2, we define two families of distributions D1 and D2 , where an instance is either a draw
(D1 , D2 ) from D1 × D2 , or simply (D1 , D1 ). Any distribution in D2 has support size γ times that of its
corresponding distribution in D1 . Yet, we argue that no non-adaptive deterministic tester making too
few queries can distinguish between these two cases, as the tuple of samples it will obtain from D1 or
(the corresponding) D2 is almost identically distributed (where the randomness is over the choice of the
instance itself). To show this last point, we analyze separately the case of “small” queries (conditioning
on sets which turn out to be much smaller than the actual support size, and thus with high probability will
not even intersect it) and the “large” ones (where the query set S is so big compared to the support T that a
uniform sample from S ∩ T is essentially indistinguishable from a uniform sample from S). We conclude
the proof by invoking Yao’s Principle, carrying the lower bound back to the setting of non-adaptive
randomized testers.
    Interestingly, this argument essentially gives us Theorem 1.3 “for free.” Indeed, the big-query-set case
above is handled by proving that the distribution of samples returned on those queries is indistinguishable,
both for D1 and D2 , from samples obtained from the actual uniform distribution. Considering again
the small-query-set case separately, this allows us to argue that a random distribution from (say) D1 is
indistinguishable from uniform.

Upper bound on support-size estimation. Our algorithm for estimating the support size within a
constant factor (Theorem 1.4) is simple in spirit, and follows a guess-and-check strategy. In more detail, it
first obtains a “reference point” outside the support, to check whether subsequent samples it may consider
belong to the support. Then, it attempts to find a rough upper bound on the size of the support, of the
         j
form 22 (so that only log log n many options have to be considered); by using its reference point to check
if a uniform random subset of this size contains, as it should, at least one point from the support. Once
such an upper bound has been obtained using this double-exponential strategy, a refined bound is then
obtained via a binary search on the new range of values for the exponent, {2 j−1 , . . . , 2 j }. Not surprisingly,
our algorithm draws on similar ideas as in [37, 41], with some additional machinery to supplement

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                    8
         A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

the differences in the models. Interestingly, as a side-effect, this upper bound shows our analysis of
Theorem 1.1 to be tight up to a quadratic improvement. Indeed, the lower bound construction we consider
(see Section 3.1) can be easily “defeated” if an estimate of the support size is known, and therefore cannot
yield better than a Ω(log log n) lower bound. Similarly, this further shows that the adaptive lower bound
for support-size estimation of Chakraborty et al. [16] is also tight up to a quadratic improvement.

Organization. The rest of the paper describes details and proofs of the results mentioned in the above
discussion. In Section 2, we introduce the necessary definitions and some of the tools we shall use.
Section 3 covers our main result on adaptive equivalence testing, Theorem 1.1. In Section 4 we prove our
lower bounds for support-size estimation and uniformity testing, and Section 5 details our upper bounds
for support-size estimation. The corresponding sections may be read independently.


2     Preliminaries
2.1    Notation and sampling models
All throughout this paper, we denote by [n] the set {1, . . . , n}, and by log the logarithm in base 2. A
probability distribution over a (countable) domain [n] is a non-negative function D : [n] → [0, 1] such that
∑x∈[n] D(x) = 1. We denote by US the uniform distribution on a set S. Given a distribution D over [n]
and a set S ⊆ [n], we write D(S) for the total probability mass ∑x∈S D(x) assigned to S by D. Finally,
for S ⊆ [n] such that D(S) > 0, we denote by DS the conditional distribution of D restricted to S, that is
DS (x) = D(x)/D(S) for x ∈ S and DS (x) = 0 otherwise.
    As is usual in distribution testing, in this paper the distance between two distributions D1 , D2 on [n]
will be the total variation distance.
                             def 1               1
              dTV (D1 , D2 ) = kD1 − D2 k1 = ∑ |D1 (i) − D2 (i)| = max(D1 (S) − D2 (S))                (2.1)
                                 2               2 x∈[n]                 S⊆[n]

which takes value in [0, 1].
    In this work, we focus on the setting of conditional access to the distribution, as introduced and
studied in [16, 14]. We reproduce below the corresponding definition of a conditional oracle, henceforth
referred to as COND.
Definition 2.1 (Conditional access model). Fix a distribution D over [n]. A COND oracle for D, denoted
CONDD , is defined as follows: the oracle takes as input a query set S ⊆ [n], chosen by the algorithm, that
has D(S) > 0. The oracle returns an element i ∈ S, where the probability that element i is returned is
DS (i) = D(i)/D(S), independently of all previous calls to the oracle.
    Note that as described above the behavior of CONDD (S) is undefined if D(S) = 0, i. e., the set S has
zero probability under D. Various definitional choices could be made to deal with this. These choices
do not make a significant difference in most situations, as adaptive algorithms can include in their next
   5 Recall that a non-adaptive tester is an algorithm whose queries do not depend on the answers obtained from previous ones,

but only on its internal randomness. Equivalently, it is a tester that can commit “upfront” to all the queries it will make to the
oracle.


                            T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                               9
                      JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

queries a sample previously obtained; while our lower bounds can be thought of as putting exponentially
small probability mass of elements outside the support. For this reason, and for convenience, we shall
hereafter assume, following Chakraborty et al., that the oracle returns in this case a sample uniformly
distributed in S. Furthermore, as in [14, 16] we do not take the complexity of specifying the set S to the
oracle into account, and indeed allow arbitrary sets as queries.6
    Finally, recall that a property P of distributions over [n] is a set consisting of all distributions that have
the property. The distance from D to a property P, denoted dTV (D, P), is then defined as infD0 ∈P dTV (D, P).
We use the standard definition of testing algorithms for properties of distributions over [n], tailored for
the setting of conditional access to an unknown distribution.
Definition 2.2 (Property tester). Let P be a property of distributions over [n]. A t-query COND testing
algorithm for P is a randomized algorithm T which takes as input n, ε ∈ (0, 1], as well as access to
CONDD . After making at most t(ε, n) calls to the oracle, T either returns ACCEPT or REJECT, such
that the following holds:
      • if D ∈ P, T returns ACCEPT with probability at least 2/3;
      • if dTV (D, P) ≥ ε, T returns REJECT with probability at least 2/3.
     We observe that the above definitions can be straightforwardly extended to the more general setting
of pairs of distributions, where given independent access to two oracles CONDD1 , CONDD2 the goal is to
test whether (D1 , D2 ) satisfies a property (now a set of pairs of distributions). This will be the case in
Section 3, where we will consider equivalence testing, that is the property Peq = { (D1 , D2 ) : D1 = D2 }.

2.2     Adaptive core testers
In order to deal with adaptivity in our lower bounds, we will use ideas introduced by Chakraborty et
al. [16]. These ideas, for the case of label-invariant properties7 allow one to narrow down the range of
possible testers and focus on a restricted class of such algorithms called adaptive core testers. These
core testers do not have access to the full information of the samples they draw, but instead only get to
see the relations (inclusions, equalities) between the queries they make and the samples they get. Yet,
Chakraborty et al. [16] show that any tester for a label-invariant property can be converted into a core tester
with same query complexity; thus, it is enough to prove lower bounds against this—seemingly—weaker
class of algorithms.
     We here rephrase the definitions of a core tester and the view they have of the interaction with the
oracle (the configuration of the samples), tailored to our setting.
Definition 2.3 (Atoms and partitions). Given a family A = (A1 , . . . , At ) ⊆ [n]t , the atoms generated by A
are the (at most) 2t distinct sets of the form tr=1 Cr , where Cr ∈ {Ar , [n] \ Ar }. The family of all atoms,
                                              T

denoted At(A), is the partition generated by A.
   6 We further observe that, besides the general COND
                                                           D oracle which allows these arbitrary query sets, the authors of [14]
introduce two weaker variants of the conditional model (the “pair-cond” PAIRCONDD and “interval-cond” INTCONDD
oracles) which restrict algorithms to “simple” queries.
    7 Recall that a property is label-invariant (or symmetric) if it is closed under relabeling of the elements of the support.

More precisely, a property of distributions (resp. pairs of distributions) P is label-invariant if for any distribution D ∈ P (resp.
(D1 , D2 ) ∈ P) and permutation σ of [n], one has D ◦ σ ∈ P (resp. (D1 ◦ σ , D2 ◦ σ ) ∈ P).


                           T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                                10
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

    This definition essentially captures “all sets (besides the Ai ) about which something can be learned
from querying the oracle on the sets of A.” Now, given such a sequence of queries A = (A1 , . . . , At )
                                  (1) (2)        (1) (2)
and pairs of samples s = ((s1 , s1 ), . . . , (st , st )) ∈ A21 × · · · × At2 , we would like to summarize “all
                                                                                   (1) (2)        (1) (2)
the label-invariant information available to an algorithm that obtains ((s1 , s1 ), . . . , (st , st )) upon
querying A1 , . . . , At for D1 and D2 .” This calls for the following definition.
                                                                                   (1)   (2)
Definition 2.4 (t-configuration). Given A = (A1 , . . . , At ) and s = ((s j , s j ))1≤ j≤t as above, the t-
configuration of s consists of the 6t 2 bits indicating, for all 1 ≤ i, j ≤ t, whether
       (k)    (`)
    • si = s j , for k, ` ∈ {1, 2}; and                                                  (relations between samples)
       (k)
    • si ∈ A j , for k ∈ {1, 2}.                                    (relations between samples and query sets)
                                                                                               (k)
In other terms, it summarizes which is the unique atom Si ∈ At(A) that contains si , and what collisions
between samples have been observed.

    As mentioned before the key idea is to argue that, without loss of generality, one can restrict one’s
attention to algorithms that only have access to t-configurations, and generate their queries in a specific
(albeit adaptive) fashion.

Definition 2.5 (Core adaptive tester). A core adaptive distribution tester for pairs of distributions is an
algorithm T that acts as follows.

    • In the i-th phase, based only on its own internal randomness and the configuration of the previous
                                                       (1) (2)             (1) (2)
      queries A1 , . . . , Ai−1 and samples obtained (s1 , s1 ), . . . , (si−1 , si−1 )—whose labels it does not
      actually know, T provides:

         – a number kiA for each A ∈ At(A1 , . . . , Ai−1 ), between 0 and
                                                        (1)   (2)
                                                  A \ {s j , s j }1≤ j≤i−1

             (How many fresh, not-already-seen elements of each particular atom A should be included in
             the next query.)
                    (1)    (2)                                               (k)         (k)
         – sets Ki , Ki ⊆ {1, . . . , i − 1} (Which of the samples s1 , . . . , si−1 will be included in the
           next query. The labels of these samples are unknown, but are indexed by the index of the
           query which returned them.)

    • based on these specifications, the next query Ai is drawn (but not revealed to T) by

         – drawing uniformly at random a set Λi in
                    n                                                                            o
                                  (1) (2)
                      Λ ⊆ [n] \ {s j , s j }1≤ j≤i−1 : ∀A ∈ At(A1 , . . . , Ai−1 ), |Λ ∩ A| = kiA .            (2.2)

             That is, among all sets, containing only “fresh elements,” whose intersection with each atom
             contains as many elements as T requires.

                          T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                  11
                     JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

            – adding the selected previous samples to this set:
                                      n                    o n               o
                                def       (1)        (1)       (2)       (2)               def
                             Γi =         s j : j ∈ Ki      ∪ s j : j ∈ Ki     ;        Ai = Λi ∪ Γi .           (2.3)


        This results in a set Ai , not fully known to T besides the samples it already got and decided to
        query again; in which the labels of the fresh elements are unknown, but the proportions of elements
        belonging to each atom are known.

                    (1)                     (2)
      • samples si ∼ (D1 )Ai and si ∼ (D2 )Ai are drawn (but not disclosed to T). This defines the
                                                 (1) (2)             (1) (2)
        i-configuration of A1 , . . . , Ai and (s1 , s1 ), . . . , (si , si ), which is revealed to T. Put differently,
        the algorithm only learns (i) to which of the A` the new sample belongs, and (ii) if it is one of the
        previous samples, in which stage(s) and for which of D1 , D2 it has already seen it.

After t = t(ε, n) such stages, T returns either ACCEPT or REJECT, based only on the configuration of
                      (1) (2)             (1) (2)
A1 , . . . , At and (s1 , s1 ), . . . , (st , st ) (which is all the information it ever had access to).

     Note that in particular, T does not know the labels of samples it got, nor the actual queries it makes:
it knows all about their sizes and sizes of their intersections, but not the actual “identity” of the elements
they contain.


2.3     On the use of Yao’s Principle in our lower bounds

We recall Yao’s Principle (e. g., see Chapter 2.2 of [31]), a technique which is ubiquitous in the analysis
of randomized algorithms. Consider a set S of instances of some problem: what this principle states is
that the worst-case expected cost of a randomized algorithm on instances in S is lower-bounded by the
expected cost of the best deterministic algorithm on an instance drawn randomly from S.
    As an example, we apply it in a standard way in Section 4: instead of considering a randomized
algorithm working on a fixed instance, we instead analyze a deterministic algorithm working on a random
instance. (We note that, importantly, the randomness in the samples returned by the COND oracle is
“external” to this argument, and these samples behave identically in an application of Yao’s Principle.)
    On the other hand, our application in Section 3 is slightly different, due to our use of adaptive
core testers. Once again, we focus on deterministic algorithms working on random instances, and the
randomness in the samples is external and therefore unaffected by Yao’s Principle. However, we stress
that the randomness in the choice of the set Λi is also external to the argument, and therefore unaffected—
similar to the randomness in the samples, the algorithm has no control here. Another way of thinking
about this randomness is via another step in the distribution over instances: after an instance (which is
a pair of distributions) is randomly chosen, we permute the labels on the elements of the distribution’s
domain uniformly at random. We note that since the property in question is label-invariant, this does not
affect its value. We can then use the model as stated in Section 2.2 for ease of analysis, observing that
this can be considered an application of the principle of deferred decisions (as in Chapter 3.5 of [31]).

                          T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                     12
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

2.4   Chernoff bounds for binomials and hypergeometrics
We will make extensive use of Chernoff-style bounds in this article. Recall that the Binomial(n, p)
distribution describes the distribution of the number of successes when we run n independent Bernoulli
trials, each with success probability p.
Lemma 2.6 (Chernoff Bound for Binomials). Let X ∼ Binomial(n, p) and µ = E[X] = np. Then

                                                                  δ 2µ
                                                                      
                     ∀δ ∈ (0, 1),   Pr[ |X − µ| ≥ δ µ ] ≤ 2 exp −        .
                                                                    3
    We will also need a similar Chernoff-style bound for the hypergeometric distribution. The
Hypergeometric(n, K, N) distribution describes the distribution of the number of successes when we
draw n times without replacement from a population of size N, in which K objects have the pertinent
feature (and thus count as successes). Note that if the drawing were done with replacement, and K/N = p,
then this would be equivalent to Binomial(n, p). Sampling without replacement introduces negative
correlation between the probability of each draw being successful. This type of negative correlation
generally “helps” with concentration, allowing one to prove similar concentration bounds (see, e. g.,
[19, 23], Theorem 1.17 of [6]).
Lemma 2.7 (Chernoff Bound for Hypergeometrics). Let X ∼ Hypergeometric(n, K, N) and µ = E[X] =
nK/N. Then,
                                                                   δ 2µ
                                                                       
                     ∀δ ∈ (0, 1),    Pr[ |X − µ| ≥ δ µ ] ≤ 2 exp −        .
                                                                     3

3     A lower bound for equivalence testing
We prove our main lower bound on the sample complexity of testing equivalence between unknown
distributions. We construct two priors Y and N over pairs of distributions (D1 , D2 ) over [n]. Y is a
distribution over pairs of distributions of the form (D, D), namely the case when the distributions are
identical. Similarly, N is a distribution over (D1 , D2 ) with dTV (D1 , D2 ) ≥ 1/4. We then show that no
                        √
algorithm T making O log log n queries to CONDD1 , CONDD2 can distinguish between a draw from
                                    

Y and N with constant probability (over the choice of (D1 , D2 ), the randomness in the samples it obtains,
and its internal randomness).
    We describe the construction of Y and N in Section 3.1, and provide a detailed analysis in Section 3.2.

3.1   Construction
We now summarize how a pair of distribution is constructed under Y and N. (Each specific step will be
described in more detail in the subsequent paragraphs.)

    1. Effective Support

        (a) Pick kb from the set {0, 1, . . . , 21 log n} at random.
                                 def
        (b) Let b = 2kb and m = b · n1/4 .

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                           13
                       JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

   2. Buckets

         (a) Choose ρ and r such that ∑2r   i
                                       i=1 ρ = n
                                                 1/4 .

         (b) Divide {1, . . . , m} into intervals B1 , . . . , B2r with |Bi | = b · ρ i .

   3. Distributions

         (a) For each i ∈ [2r], assign probability mass 1/(2r) uniformly over Bi to generate distribution
             D1 .
         (b) For each i ∈ [r] independently, pick πi to be a Bernoulli trial with Pr(πi = 0) = 1/2; if πi = 0
             then assign probability mass 1/(4r) and 3/(4r) over B2i−1 and B2i , respectively, else 3/(4r)
             and 1/(4r), respectively. This generates a distribution D2 .

   4. Support relabeling

         (a) Pick a permutation σ ∈ Sn of the total support n.
         (b) Relabel the symbols of D1 and D2 according to σ .

   5. Output: Generate (D1 , D1 ) for Y, and (D1 , D2 ) otherwise.


   D j (i)




      B1 B2 B3   B4   (. . . )                                                              m         n i

                                   Figure 1: A no-instance (D1 , D2 ) (before permutation).

We now describe the various steps of the construction in greater detail.

Effective support. Both D1 and D2 , albeit distributions on [n], will have (common) sparse support.
                                            def
The support size is taken to be m = b · n1/4 . Note that, from the above definition, m is chosen uniformly
at random from products of n1/4 with powers of 2, resulting in values in [n1/4 , n3/4 ].

                                 T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                      14
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

     In this step b will act as a random scaling factor. The objective of this random scaling is to induce
uncertainty in the algorithm’s knowledge of the true support size of the distributions, and to prevent it
from leveraging this information to test equivalence. In fact one can verify that the class of distributions
induced for a single value of b, namely all distributions have the same value of m, then one can distinguish
the Y and N cases with only O(1) conditional queries. The test would (roughly) go as follows. Since
|Bi | is known, one can choose a random subset S of the domain which (with high probability) has no
intersection with Bi for i ≤ 2r − 2, a O(1) size intersection with B2r−1 , and a O(ρ) size intersection with
B2r . Perform O(1) conditional queries over the set S, for both distributions. Given these queries, we can
then identify which elements of S belong to B2r−1 or B2r —namely, those which occur at most once belong
to B2r , and those which occur at least twice belong to B2r−1 . In a Y instance, then in both distributions, a
1/2 fraction of queries will belong to B2r−1 , whereas in a N instance, one distribution will have either a
1/4 or 3/4 fraction of queries in B2r−1 , allowing us to distinguish the two cases.


Buckets. Our construction is inspired by the lower bound of Canonne, Ron, and Servedio [14, Theorem
8] for the more restrictive PAIRCOND access model. We partition the support into 2r consecutive
intervals (henceforth referred to as buckets) B1 , . . . , B2r , where the size of the i-th bucket is bρ i . We note
that r and ρ will be chosen such that ∑2r      i
                                        i=1 bρ = bn
                                                           1/4 , i. e., the buckets fill the effective support.



Distributions. We output a pair of distributions (D1 , D2 ). Each distribution that we construct is uniform
within any particular bucket Bi . In particular, the first distribution assigns the same mass 1/2r to each
bucket. Therefore, points within Bi have the same probability mass 1/2rbρ i . For the Y case, the second
distribution is identical to the first. For the N case, we pair buckets in r consecutive bucket-pairs
Π1 , . . . , Πr , with Πi = B2i−1 ∪ B2i . For the second distribution D2 , we consider the same buckets as D1 ,
but repartition the mass 1/r within each Πi . More precisely, in each pair, one of the buckets gets now
total probability mass 1/4r while the other gets 3/4r (so that the probability of every point is either
decreased by a factor 1/2 or increased by 3/2). The choice of which goes up and which goes down is
done uniformly and independently at random for each bucket-pair determined by the random choices of
the πi .


Random relabeling. The final step of the construction randomly relabels the symbols, namely is a ran-
dom injective map from [m] to [n]. This is done to ensure that no information about the individual symbol
labels can be used by the algorithm for testing. For example, without this the algorithm can consider
a few symbols from the first bucket and distinguish the Y and N cases. As mentioned in Section 2.3,
for ease of analysis, the randomness in the choice of the permutation is, in some sense, deferred to the
randomness in the choice of Λi during the algorithm’s execution.


Summary. A no-instance (D1 , D2 ) is thus defined by the following parameters: the support size m,
the vector (π1 , . . . , πr ) ∈ {0, 1}r (which only impacts D2 ), and the final permutation σ of the domain. A
yes-instance (D1 , D1 ) follows an identical process, however, ~π has no influence on the final outcome.
See Figure 1 for an illustration of such a (D1 , D2 ) when σ is the identity permutation and thus the
distribution is supported over the first m natural numbers.

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                    15
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Values for ρ and r. By setting r = log n/(8 log ρ) + O(1), we have as desired ∑2r        i=1 |Bi | = m and there
is a factor (1 + o(1))n1/4 between the height of the first√bucket B1 and the one of the last, B2r . It remains
                                                                                         √
to choose the parameter ρ itself; we shall take it to be 2 log n , resulting in r = (1/8) log n + O(1). (Note
that for the sake of the exposition, we ignore technical details such as the rounding of parameters, e. g.,
bucket sizes; these can be easily taken care of at the price of cumbersome case analyses, and do not bring
much to the argument.)

3.2     Analysis
We now prove our main lower bound, by analyzing the behavior of core adaptive testers (as per Def-
inition 2.5) on the families Y and N from the previous section. In Section 3.2.1, we argue that, with
high probability, the sizes of the queries performed by the algorithm satisfy some specific properties.
Conditioned upon this event, in Section 3.2.2, we show that the algorithm will get similar information
from each query, whether it is running on a yes-instance or a no-instance.
     Before moving to the heart of the argument, we state the following fact, straightforward from the
construction of our no-instances.
Fact 3.1. For any (D1 , D2 ) drawn from N, one has dTV (D1 , D2 ) = 1/4.
Moreover, as allowing more queries can only increase the probability of success, we hereafter focus on a
                                                     √
core adaptive tester that performs exactly q = (1/10) log log n (adaptive) queries; and will show that it
can only distinguish between yes and no-instances with probability o(1).

3.2.1    Banning “bad queries”
As mentioned in Section 3.1, the draw of a yes or no-instance involves a random scaling of the size of
the support of the distributions, meant to “blind” the testing algorithm. Recall that a testing algorithm
is specified by a decision tree, which at step i, specifies how many unseen elements from each atom to
                                                                                                    (1)   (2)
include in the query ({kiA }) and which previously seen elements to include in the query (sets Ki , Ki ,
as defined in Section 2.2), where the algorithm’s choice depends on the observed configuration at that
time. Note that, using Yao’s Principle (as discussed in Section 2.3), these choices are deterministic for a
                                                                       (1)  (2)
given configuration—in particular, we can think of all {kiA } and Ki , Ki in the decision tree as being
fixed. In this section, we show that all kiA values satisfy with high probability some particular conditions
with respect to the choice of distribution, where the randomness is over the choice of the support size.
    First, we recall an observation from [16], though we modify it slightly to apply to configurations on
pairs of distributions and we apply a slightly tighter analysis. This essentially limits the number of states
an algorithm could be in by a function of how many queries it makes.
Proposition 3.2. The number of nodes in a decision tree corresponding to a q-sample algorithm is at
        2
most 26q +1 .
Proof. As mentioned in Definition 2.4, an i-configuration can be described using 6i2 bits, resulting
              2
in at most 26i i-configurations. Since each i-configuration leads us to some node on the i-th level of
the decision tree, the total number of nodes can be upper bounded by summing over the number of
i-configurations for i ranging from 0 to q, giving us the desired bound.

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                 16
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

     For the sake of the argument, we will introduce a few notions applying to the sizes of query sets:
namely, the notions of a number being small, large, or stable, and of a vector being incomparable.
Roughly speaking, a number is small if a uniformly random set of this size does not, in expectation, hit
the largest bucket B2r —in other words, the set is likely to be disjoint from the support. On the other hand,
it is large if we expect such a set to intersect many bucket-pairs (i. e., a significant fraction of the support).
     The definition of stable numbers is slightly more quantitative: a number β is stable if a random set
of size β , for each bucket Bi , is either disjoint from Bi or has an intersection with Bi of size close to the
expected value. In the latter case, we say the set concentrates over Bi . Finally, a vector of values (β j ) is
incomparable if the union ofrandom sets S1 , . . . , Sm of sizes β1 , . . . , βm contains (with high probability)
                          S
an amount of mass D j S j which is either much smaller or much larger than the probability D(s) of
any single element s.
We formalize these concepts in the definitions below. To motivate them, it will be useful to bear in mind
that, from the construction described in Section 3.1, the expected intersection of a uniform random set of
size β with a bucket Bi is of size β bρ i /n; while the expected probability mass from Bi it contains (under
either D1 or D2 ) is β /2rn.

Definition 3.3. Let q be an integer, and let ϕ = Θ(q5/2 ). A number β is said to be small if β < n/(bρ 2r );
it is large (with relation to some integer q) if β ≥ n/(bρ 2r−2ϕ ).

     Note that the latter condition equivalently means that, in expectation, a set of large size will intersect
at least ϕ + 1 bucket-pairs (as it hits an expected 2ϕ + 1 buckets, since β B2r−2ϕ /n ≥ 1). From the
above definitions we get that, with high probability, a random set of any fixed size will in expectation
either hit many or no buckets.
                                                                                       
                                                                                  log ρ
Proposition 3.4. A number is either small or large with probability 1 − O ϕlog       n .

Proof. A number β is neither large nor small if

                                              ρ 2ϕ n       n
                                                  2r
                                                     ≤b≤        .
                                              βρ         β ρ 2r

The ratio of the endpoints of the interval is ρ 2ϕ . Since b = 2kb , this implies that at most log ρ 2ϕ = 2ϕ log ρ
values of kb could result in a fixed number falling in this range. As there are Θ(log n) values for kb , the
proposition follows.

    The next definition characterizes the sizes of query sets for which the expected intersection with any
bucket is either close to 0 (less than 1/α, for some threshold α), or very big (more than α). (It will be
helpful to keep in mind that we will eventually use this definition with α = poly(q).)

Definition 3.5. A number β is said to be α-stable (for α ≥ 1) if, for each j ∈ [2r],
                                                         
                                                 n     αn
                                         β∈/         ,      .
                                               αbρ j bρ j

A vector of numbers is said to be α-stable if all numbers it contains are α-stable.

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                  17
                      JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH
                                                                                 
                                                                        r log α
Proposition 3.6. A number is α-stable with probability 1 − O             log n     .

Proof. Fix some j ∈ [2r]. A number β does not satisfy the definition of α-stability for this j if
                                               n         nα
                                                   j
                                                     ≤b≤      .
                                              αβ ρ       βρ j

Since b = 2kb , this implies that at most log 2α values of kb could result in a fixed number falling in this
range. Noting that there are Θ(log n) values for kb and taking a union bound over all 2r values for j, the
proposition follows.

    The following definition characterizes the sizes of query sets which have a probability mass far from
the probability mass of any individual element. (For the sake of building intuition, the reader may replace
ν in the following by the parameter b of the distribution.)

Definition 3.7. A vector of numbers (β1 , . . . , β` ) is said to be (α, τ)-incomparable with respect to ν (for
τ ≥ 1) if the two following conditions hold.

    • (β1 , . . . , β` ) is α-stable.

    • Let ∆ j be the minimum ∆ ∈ {0, . . . , 2r} such that

                                                      β j νρ 2r−∆  1
                                                                  ≤ ,
                                                           n       α
       or 2r if no such ∆ exists. For all i ∈ [2r],

                                           1 `
                                                                          
                                                             1         τ
                                              ∑ β j ∆ j 6∈ τ2rνρ i , 2rνρ i .
                                          2rn j=1

Recall from the definition of α-stability of a number that a random set of this size either has essentially
no intersection with a bucket or “concentrates over it” (i. e., with high probability, the probability mass
contained in the intersection with this bucket is very close to the expected value). The above definition
roughly captures the following. For any j, ∆ j is the number of buckets that will concentrate over a random
set of size β j . The last condition asks that the total probability mass from D1 (or D2 ) enclosed in the
union of m random sets of size β1 , . . . , β` be a multiplicative factor of τ from the individual probability
weight 1/2rbρ i of a single element from any of the 2r buckets.

Proposition 3.8. Given that a vector of numbers of length ` is α-stable, it is (α, q2 )-incomparable with
respect to b with probability at least                   
                                                  r log q
                                          1−O               .
                                                   log n
Proof. Fix any vector (β1 , . . . , β` ). By the definition above, for each value b such that (β1 , . . . , β` ) is
α-stable, we have
                                         αρ 2r   ρ∆j        αρ 2r+1
                                   βj ·        ≤     < βj ·         , j ∈ [`]
                                          n       b            n

                           T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                18
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

or, equivalently,
                          αβ                                   αβ
                     log n j        log b        log n j        log b
                             + 2r +       ≤ ∆j <         + 2r +       + 1,                 j ∈ [`].
                      log ρ         log ρ         log ρ         log ρ
Writing
                                                             αβ j
                                                      def log n
                                               λj =                 + 2r
                                                            log ρ
for j ∈ [`], we obtain that
                               `                  `
                                                                           b log b `
                               ∑ β j ∆ j b = b ∑ β j (λ j + O(1)) +               ∑ βj .
                                                                           log ρ j=1
                                                                                                       (3.1)
                               j=1            j=1

    • If it is the case that
                                              `                                  `
                                     log ρ · ∑ β j (λ j + O(1))  log b · ∑ β j .
                                              j=1                                j=1

      Then, for any fixed i ∈ [2r], to meet the second item of the definition of incomparability we need
                                         `
                                        ∑ β j ∆ j b ∈/ [n/(200qρ i ), 200qn/ρ i ] .
                                        j=1

      This is essentially, with the assumption equation (3.1) above, requiring that
                                               "                                 #
                                                     n log ρ        2q2 n log ρ
                                     b log b ∈
                                             /                    ,                .
                                                 2q2 ρ i ∑`j=1 β j ρ i ∑`j=1 β j

      Recalling that b log b = kb 2kb , this means that O(log q/ log log q) values of kb are to be ruled out.
      (Observe that this is the number of possible “bad values” for b without the condition from the
      case distinction above; since we add an extra constraint on b, there are at most this many values to
      avoid.)

    • Conversely, if log ρ · ∑`j=1 β j (λ j + O(1))  log b · ∑`j=1 β j the requirement becomes
                                "                                                           #
                                              n log ρ                    2q2 n log ρ
                           b∈/                                  ,                             ,
                                  2q2 ρ i ∑`j=1 β j (λ j + O(1)) ρ i ∑`j=1 β j (λ j + O(1))

      ruling out this time O(log q) values for kb .

    • Finally, the two terms are comparable only if
                                                                                   !−1 
                                                        `                        `
                               log b = Θ log ρ · ∑ β j (λ j + O(1)) ·          ∑ βj  ;
                                                       j=1                      j=1


      given that log b = kb , this rules out this time O(1) values for kb .

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                             19
                      JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

A union bound over the 2r possible values of i, and the fact that kb can take Θ(log n) values, complete the
proof.

We put these together to obtain the following lemma.
Lemma 3.9. With probability at least
                                                2                                  2
                                                                                              !
                                      26q +q (r log α + ϕ log ρ) + 26q (r log q)
                                  1−O                                                             ,
                                                        log n

the following holds for the decision tree corresponding to a q-query algorithm:
    • the size of each atom is α-stable and either large or small;

    • the size of each atom, after excluding elements we have previously observed,8 is α-stable and
      either large or small;

    • for each i, the vector (kiA )A∈At(A1 ,...,Ai ) is (α, q2 )-incomparable (with respect to b).
                                                                  2
Proof. From Proposition 3.2, there are at most 26q +1 tree nodes, each of which contains one vector
(kiA )A , and at most 2q atom sizes. The first point follows from Proposition 3.4 and Proposition 3.6 and
                                         2
applying the union bound over all 26q +1 · 2 · 2q sizes, where we note the additional factor of 2 comes
from either including or excluding the old elements. The latter point follows from Proposition 3.8 and
                                       2
applying the union bound over all 26q +1 nodes in the tree (each containing a single kiA vector).

3.2.2    Key lemma: Bounding the variation distance between decision trees
In this section, we prove a key lemma on the variation distance between the distribution on leaves of
any decision tree, when given access to either an instance from Y or N. This lemma will in turn directly
yield Theorem 1.1. Hereafter, we set the parameters α (the threshold for stability), ϕ (the parameter for
smallness and largeness) and γ (an accuracy parameter for how well things concentrate over their expected
                      def      def          def                                               √
value) as follows.9 α = q7 , ϕ = q5/2 and γ = 1/ϕ = q−5/2 . (Recall further that q = (1/10) log log n.)
Lemma 3.10. Conditioned on the events of Lemma 3.9, consider the distribution over leaves of any
decision tree corresponding to a q-query adaptive algorithm when the algorithm is given a yes-instance,
and when it is given a no-instance. These two distributions have total variation distance o(1).
Proof. This proof is by induction on 1 ≤ i ≤ q. We will have three inductive hypotheses, E 1 (t), E 2 (t),
and E 3 (t). Assuming all three hold for all t < i, we prove E 1 (i). Additionally assuming E 1 (i), we prove
E 2 (i) and E 3 (i).
     Roughly, the first inductive hypothesis states that the query sets behave similarly to as if we had
picked a random set of that size. It also implies that whether or not we get an element we have seen before
   8 More precisely, we mean to say that for each i ≤ q, for every atom A defined by the partition of (A , . . . , A ), the values kA
                                                                                                           1        i               i
           (1) (2)           (1) (2)
and |A \ {s1 , s1 , . . . , si−1 , si−1 }| − kiA are α-stable and either large or small.
   9 This choice of parameters is not completely arbitrary: combined with the setting of q, r and ρ, they ensure a total bound o(1)

on variation distance and probability of “bad events” as well as a (relative) simplicity and symmetry in the relevant quantities.


                            T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                                20
         A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

is “obvious” based on past observances and the size of the query we perform. The second states that we
never observe two distinct elements from the same bucket-pair. The third states that the next sample is
distributed similarly in either a yes-instance or a no-instance. Note that this distribution includes both
features which our algorithm can observe (i. e., the atom which the sample belongs to and if it collides
with a previously seen sample), as well as those which it can not (i. e., which bucket-pair the observed
sample belongs to). It is necessary to show the latter, since the bucket-pair a sample belongs to may
determine the outcome of future queries.
    More precisely, the three inductive hypotheses are as follows:
    • E 1 (i): In either a yes-instance or a no-instance, the following occurs: For an atom S in the partition
                                                    (1) (2)           (1) (2)                                 0
      generated by A1 , . . . , Ai , let S0 = S \ {s1 , s1 , . . . , si−1 , si−1 }. For every such S0 , let `S be the
      largest index ` ∈ {0, . . . , 2r} such that |S0 |bρ ` /n ≤ 1/α, or 0 if no such ` exists. We claim that
        0                                                                0
      `S ∈ {0, . . . , 2r − ϕ − 2} ∪ {2r}, and say S0 is small if `S = 2r and large otherwise. Additionally:
                           0
           – for j ≤ `S , S0 ∩ B j = 0;
                                                                      0   j
           – for j > `S , S0 ∩ B j lies in [1 − iγ, 1 + iγ] |S |bρ
                           0
                                                               n .

        Furthermore, let p1 and p2 be the probability mass contained in Λi and Γi , respectively. Then
                                                                          
                                  p1          1                  p2           1
                                        ≤O 2           or              ≤O 2
                               p1 + p2        q               p1 + p2        q
       (that is, either almost all the probability mass comes from elements which we have not yet observed,
       or almost all of it comes from previously seen ones).
                                                          (1)   (2)       (1)   (2)
    • E 2 (i): No two elements from the set {s1 , s1 , . . . , si , si } belong to the same bucket-pair.
    • E 3 (i): Let Tiyes be the random variable representing the atoms and bucket-pairs10 containing
        (1) (2)
      (si , si ), as well as which of the previous samples they intersect with, when the i-th query is
      performed on a yes-instance, and define Tino similarly for no-instances. Then

                              dTV Tiyes , Tino ≤ O 1/q2 + 1/ρ + γ + 1/ϕ = o(1) .
                                                                        


We will show that E 1 (i) holds with probability 1 − O 2i exp(−2γ 2 α/3) and E 2 (i) holds with probability
                                                                          

1 − O(i/ϕ). Let T yes be the random variable representing the q-configuration and the bucket-pairs
containing each of the observed samples in a yes-instance, and define T no similarly for a no-instance. We
note that this random variable determines which leaf of the decision tree we reach, which E 3 (q) bounds.
We can then take a union bound over all i ∈ [q] to upper bound the probability that E 1 (i) and E 2 (i) do
not hold, and use E 3 (i) and the coupling interpretation of total variation distance to upper bound the
probability that Tiyes and Tino ever differ. Any of these “failure” events happens with probability at most

                                         2γ 2 α     q2 1 q
                                                                       
                               q                                        q
                          O 2 exp −               + + + + qγ +               = o(1)
                                            3       ϕ q ρ               ϕ
   10 If a sample s(k) does not belong to any bucket (if the corresponding i-th query did not intersect the support), it is marked in
                   i
  yes
Ti with a “dummy label” to indicate so.


                               T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                              21
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

(from our choice of α, γ, ϕ). This upper bounds the total variation distance between T yes and T no , giving
the desired result.
     We proceed with the inductive proofs of E 1 (i), E 2 (i), and E 3 (i), noting that the base cases hold
trivially for all three of these statements. Throughout this proof, recall that Λi is the set of unseen support
elements which we query, and Γi is the set of previously seen support elements which we query.
Lemma 3.11. Assume that E 1 (t), E 2 (t), E 3 (t) hold for all 1 ≤ t ≤ i−1. Then E 1 (i) holds with probability
at least
                                                   2γ 2 α
                                                        
                                                                            2
                                        i
                            1 − O 2 exp −                      = 1 − 2i−Ω(q ) .
                                                     3
Proof. We start with the first part of E 1 (i) (the statement prior to “Furthermore”). Let S (and the
                                                                    0
corresponding S0 ) be any atom as in E 1 (i). First, we note that `S ∈ {0, . . . , 2r − ϕ − 2} ∪ {2r} since we
are conditioning on Lemma 3.9: |S0 | is α-stable and either large or small, which enforces this condition.
Next, suppose S0 is contained in some other atom T generated by A1 , . . . , Ai−1 , and let
                                                    (1)   (2)       (1)        (2)
                                      T 0 = T \ {s1 , s1 , . . . , si−1 , si−1 } .
                                           0   0
Since |S0 | ≤ |T 0 |, this implies that `T ≤ `S . We argue about |T 0 ∩ B j | for three regimes of j:
                                0
    • The first case is j ≤ `T . By the inductive hypothesis, T 0 ∩ B j = 0, so S0 ∩ B j = 0 with probability
      1.
                            0          0
    • The next case is `T < j ≤ `S . Recall from the definition of a core adaptive tester that S0 will be
      chosen uniformly at random from all subsets of T 0 of appropriate size. By the inductive hypothesis,
                                     T0 ∩Bj                                 bρ j
                                             ∈ [1 − (i − 1)γ, 1 + (i − 1)γ]      ,
                                      |T 0 |                                 n
      and therefore
                                                     |S0 | bρ j                                             2
           E S0 ∩ B j ∈ [1 − (i − 1)γ, 1 + (i − 1)γ]                                      E S0 ∩ B j ≤
                                                                                                 
                                                                ,              implying                      S0
                                                                                                                    ;
                                                          n                                              αρ ` − j
                                                                0
      where the inequality is by the definition of `S and using the fact that (i − 1)γ ≤ 1. Using a
                                                                                      def 
      Chernoff bound for hypergeometric random variables (Lemma 2.7), and writing µ = E S0 ∩ B j
                                                                                                  

      for conciseness,
                                                                           
                                                                        1−µ
                           Pr S0 ∩ B j ≥ 1 = Pr S0 ∩ B j ≥ 1 +
                                           
                                                                             µ
                                                                         µ
                                                       (1 − µ)2
                                                                
                                              ≤ exp −
                                                          3µ
                                                                   
                                                        1      0
                                                             `S − j
                                              ≤ exp − αρ              ,
                                                       12
                                                                          S0
      where the second inequality holds because µ ≤ 2/αρ ` − j and (1 − µ)2 ≥ 1/2 for n sufficiently
      large.

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                           22
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

                                0
   • The final case is j > `S . As in the previous one,
                                                                                                   0
                                                   |S0 | bρ j                        αρ j−`S −1
         E S0 ∩ B j ∈ [1 − (i − 1)γ, 1 + (i − 1)γ]
                                                                           0
                                                              ,   implying E S ∩ B j ≥           ;
                                                        n                                 2
                                                      0
      where the inequality is by the definition of `S , α-stability, and using the fact that (i − 1)γ ≤ 1/2.
      Using again a Chernoff bound for hypergeometric random variables (Lemma 2.7),

                                     |S0 | bρ j
                                               
              0
                         0
                                                  ≤ Pr S0 ∩ B j − E S0 ∩ B j ≥ γ2E S0 ∩ B j
                                                                                       
          Pr S ∩ B j − E S ∩ B j ≥ γ
                                          n
                                                                              !
                                                            (2γ)2 E S0 ∩ B j
                                                                   
                                                  ≤ 2 exp −
                                                                     3
                                                                          
                                                            2 2        0
                                                                   j−`S −1
                                                  ≤ 2 exp − γ αρ             ,
                                                            3

      where the first inequality comes from 2(1 − (i − 1)γ) ≥ 1, the second is from Chernoff bound, and
                                                   S0
      the third follows from E S0 ∩ B j ≥ αρ j−` −1 /2.
                                        


Since we wish to prove the statement for all buckets B j simultaneously, we take a union bound over all j.
                                                                                          0
Recall that we want to bound the probability that S0 satisfies two conditions, for j ≤ `S , S0 ∩ B j = 0;
              0
and for j > `S , S0 ∩ B j lies in
                                                         |S0 | bρ j
                                        [1 − iγ, 1 + iγ]            .
                                                              n
Using bounds from all three of these regimes of j, and a union bound, the probability that S0 does not
satisfy the conditions of E 1 (i) is at most
                                                                              
                                         1       0
                                               `S − j              2 2       0
                                                                         j−`S −1
                     ∑ 0 0 + 0 ∑ 0 exp − 12 αρ          + ∑ 2 exp − γ αρ
                                                                   3
                                                                                   .
                 j≤`  T      T   S
                            ` < j≤`                        S0
                                                            j>`

                                        0      0
This probability is maximized when `S = `T = 0, in which case it is

                2r                     ∞                             
                         2 2    j−1              2 2   j−1            2 2
               ∑ 2 exp − 3 γ αρ       ≤ ∑ 2 exp − γ αρ
                                                 3
                                                             ≤ 3 exp − γ α .
                                                                      3
               j=1                      j=1


Taking a union bound over at most 2i sets gives us the desired probability bound.
   Finally, we prove the remainder of E 1 (i) (the statement following “Furthermore”); this will follow
from the definition of incomparability (Definition 3.7).

   • First, we focus on Γi . Suppose that Γi contains at least one element with positive probability mass
     (if not, the statement trivially holds). Let p02 be the probability mass of the heaviest element in

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                            23
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

      Γi . Since our inductive hypothesis implies that Γi has no elements in the same bucket pair, the
      maximum possible value for p2 is

                               3p02 3p02               3p02 ∞ 1            3 ρ2
                                                                                   
                           0                       0
                    p2 ≤ p2 +
                                ρ
                                   + 3 + · · · ≤ p2 +
                                     ρ                      ∑ ρ 2k = 1 + ρ ρ 2 − 1 p02
                                                        ρ k=0
                         ≤ (1 + o(1))p02 .

      Therefore, p2 ∈ [p02 , (1 + o(1))p02 ]. Supposing this heaviest element belongs to bucket j, we can
      say that                                                  
                                                  1            3    1
                                           p2 ∈ , (1 + o(1))             .
                                                  2            2 2rbρ j

    • Next, we focus on Λi . Consider some atom A, from which we selected kA elements which have
      not been previously observed: call the set of these elements A0 . In the first part of this proof, we
      showed that for each bucket Bk , either |A0 ∩ Bk | = 0 or |A0 ∩ Bk | ∈ [1 − iγ, 1 + iγ] |A0 | bρ k /n. In the
      latter case, noting that iγ ≤ 1/2 and that the probability of an individual element in Bk is within

                                                                1
                                                     [1, 3]          ,
                                                              4rbρ k

      the probability mass contained by |A0 ∩ Bk | belongs to

                                                                |A0 |
                                                       [1, 9]         .
                                                                8rn
      Recalling the definition of ∆A as stated in Definition 3.7, as shown earlier in this proof, this non-
      empty intersection happens for exactly ∆A buckets. Therefore, the total probability mass in Λi is in
      the interval                           
                                          1 9 1
                                            ,
                                          4 4 2rn A∈At(A  ∑ kiA ∆A .
                                                            ,...,A )
                                                                   1      i



Recall that we are conditioning on Lemma 3.9 which states that the vector (kiA )A∈At(A1 ,...,Ai ) is (α, q2 )-
incomparable with respect to b. Applying this definition to the bounds just obtained on the probability
masses in Λi and Γi gives Lemma 3.11.

Lemma 3.12. Assume that E 1 (t), E 2 (t), E 3 (t) hold for all 1 ≤ t ≤ i − 1, and additionally E 1 (i) holds.
Then E 2 (i) holds with probability at least 1 − O(i/ϕ).
                       (1)     (1)                                               (1)
Proof. We focus on si . If si ∈ Γi , the conclusion is trivial, so suppose si ∈ Λi . From E 1 (i), no small
                                                                            (1)
atom intersects any of the buckets, so let us condition on the fact that si belongs to some large atom
                     (1)
S. Since we want si to fall in a distinct bucket-pair from 2(i − 1) + 1 other samples, there are at most
                             (1)
2i − 1 bucket-pairs which si should not land in. Using E 1 (i), the maximum probability mass contained
in the intersection of these bucket-pairs and S is (1 + iγ)(2i − 1)|S|/rn. Similarly, using the definition of
a large atom, the minimum probability mass contained in S is (1 − iγ)ϕ|S|/rn. Taking the ratio of these

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                   24
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

two terms gives an upper bound on the probability of breaking this invariant, conditioned on landing in S,
as O(i/ϕ), where we note that
                                           1 + iγ
                                                   = O(1) .
                                           1 − iγ
Since the choice of which large atom was arbitrary, we can remove the conditioning. Taking a union
           (1)     (2)
bound for si and si gives the result.

Lemma 3.13. Assume that E 1 (t), E 2 (t), E 3 (t) hold for all 1 ≤ t ≤ i − 1, and additionally E 1 (i) holds.
Then E 3 (i) holds.

Proof. We fix some setting of the intersection history, i. e., the configuration and the bucket-pairs the
past elements belong to, and show that the results of the next query will behave similarly, whether the
instance is a yes-instance or a no-instance. We note that, since we are assuming the inductive hypotheses
                                                                                          (1)
hold, certain settings which violate these hypotheses are not allowed. We also note that si is distributed
                                                 def (2)
identically in both instances, so we focus on si = si for the remainder of this proof.
First, we condition that, based on the setting
                                              of the past history, si will either come from Λi or Γi —this
event happens with probability 1 − O 1/q2 .

Proposition 3.14. In either a yes-instance or a no-instance, si will either come from Λi with probability
           2                                    2
                                                  
1 − O 1/q , or Γi with probability 1 − O 1/q , where the choice of which one is deterministic based
on the fixed configuration and choice for the bucket-pairs of previously seen elements.

Proof. This is simply a rephrasing of the portion of E 1 (i) following “Furthermore.”

    We now try to bound the total variation distance between Tiyes and Tino conditioning on this event. In
the casewhen it does not hold, we trivially bound the total variation distance by 1, incurring a cost of
O 1/q2 to the total variation distance between     the unconditioned variables. Since our target was for
this quantity was O 1/q2 + 1/ρ + γ + 1/ϕ , it remains to show, in the conditioned space, that the total
                                              

variation distance in either case is at most O(1/ρ + γ + 1/ϕ) = O(1/q5/2 ). We break this into two cases,
the first being when s comes from Γi . In this case, we incur a cost in total variation distance which is
O(1/ρ):

Proposition 3.15. In either a yes-instance or a no-instance, condition that si comes from Γi . Then one of
the following holds:

    • |Γi ∩ B j | = 0 for all j ∈ [2r], in which case si is distributed uniformly at random from the elements
      of Γi ;

    • or |Γi ∩ B j | =
                     6 0 for some j ∈ [2r], in which case si will be equal to some s ∈ Γi with probability
      1 − O(1/ρ), where the choice of s is deterministic based on the fixed configuration and choice for
      the bucket-pairs of previously seen elements.

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                              25
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Proof. The former case follows from the definition of the sampling model. For the latter case, let p be
the probability mass of the heaviest element in Γi . Since our inductive hypothesis implies that Γi has no
elements in the same bucket-pair, the maximum possible value for the rest of the elements is

                                                             3p ρ 2
                                                                              
                      3p 3p 3p                3p ∞ 1                           p
                          + 3 + 5 +··· ≤                   =
                                                   ∑ ρ 2k ρ ρ 2 − 1     =  O      .
                       ρ     ρ    ρ            ρ k=0                          ρ

Since the ratio of this value and p is O(1/ρ), with probability 1 − O(1/ρ) the sample returned is the
heaviest element in Γi .

Finally, we examine the case when s comes from Λi :
Proposition 3.16. Condition that si comes from Λi . Then either:

    • |Λi ∩ B j | = 0 for all j ∈ [2r], in which case dTV (Tiyes , Tino ) = 0;
                                                                                            
    • or |Λi ∩ B j | 6= 0 for some j ∈ [2r], in which case dTV (Tiyes , Tino ) ≤ O γ + ϕ1 = O q5/2
                                                                                               1
                                                                                                   .

Proof. The former case follows from the definition of the sampling model—since Λi does not intersect
any of the buckets, the sample will be labeled as such. Furthermore, the sample returned will be drawn
uniformly at random from Λi , and the probability of each atom will be proportional to the cardinality of
its intersection with Λi , in both the yes and the no-instances.
     We next turn to the latter case. Let X be the event that, if the intersection of Λi and some atom A has
a non-empty intersection with an odd number of buckets, then si does not come from the unpaired bucket.
Note that E 1 (i) and the definition of a large atom imply that an unpaired bucket can only occur if the
atom intersects at least ϕ bucket-pairs: conditioned on the sample coming from a particular atom, the
probability that it comes from the unpaired bucket is O(1/ϕ). Since the choice of A was arbitrary, we
may remove the conditioning, and note that Pr(X) = 1 − O(1/ϕ).
Since

                  dTV (Tiyes , Tino ) ≤ dTV (Tiyes , Tino | X) Pr(X) + dTV (Tiyes , Tino | X̄) Pr(X̄)
                                    ≤ dTV (Tiyes , Tino | X) + O(1/ϕ) ,                                 (3.2)

it remains to show that dTV (Tiyes , Tino | X) ≤ O(γ).
    First, we focus on the distribution over atoms, conditioned on X. Let N A be the number of bucket-pairs
with which A intersects both buckets, i. e., conditioned on X, the sample could come from 2N A buckets,
and let
                                               def
                                            N=        ∑ NA .
                                                     A∈At(A1 ,...,Ai )

By E 1 (i), the maximum amount of probability mass that can be assigned to atom A is

                                                 (1 + γ)|S|N A /rn
                                                                   ,
                                                  (1 − γ)|S|N/rn

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                             26
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

and the minimum is
                                             (1 − γ)|S|N A /rn
                                                               ,
                                              (1 + γ)|S|N/rn
so the total variation distance in the distribution incurred by this atom is at most O(γN A /N). Summing
over all atoms, we get the desired result of O(γ).
     Finally, we bound the distance on the distribution over bucket-pairs, again conditioned on X. By
E 1 (i) only large atoms will contain non-zero probability mass, so condition on the sample coming from
some large atom A. Let N A be the number of bucket-pairs with which A intersects both buckets, i. e.,
conditioned on X, the sample could come from 2N A buckets. Using E 1 (i), the maximum amount of
probability mass that can be assigned to any intersecting bucket-pair is

                                                          |A| A −1
                                                              
                                              |A|
                                      (1 + γ)      (1 − γ) N         ,
                                              rn          rn
and the minimum is
                                                          |A| A −1
                                                              
                                              |A|
                                      (1 − γ)      (1 + γ) N       ,
                                              rn          rn
so the total variation distance in the distribution incurred by this bucket-pair is at most O(γ/N A ). Summing
this difference over all N A bucket-pairs, we get 2γ/(1 − γ 2 ) = O(γ). Since the choice of large atom A
was arbitrary, we can remove the conditioning on the choice of atom. The statement follows by applying
the union bound on the distribution over bucket-pairs and the distribution over atoms. This concludes the
proof of Proposition 3.16.

   We note that in both cases, the cost in total variation distance which is incurred is O(1/ρ + γ + 1/ϕ),
which implies E 3 (i)—proving Lemma 3.13.

This concludes the proof of Lemma 3.10.

With Lemma 3.10 in hand, the proof of the main theorem is straightforward.

Proof of Theorem 1.1. Conditioned on Lemma 3.9, Lemma 3.10 implies that the distribution over the
leaves in a yes-instance vs. a no-instance is o(1). Since an algorithm’s choice to accept or reject depends
deterministically on which leaf is reached, this bounds the difference between the conditional probability
of reaching a leaf which accepts. Since Lemma 3.9 occurs with probability 1 − o(1), the difference
between the unconditional probabilities is also o(1).


4    Lower bounds for non-adaptive algorithms
In this section, we prove our lower bounds for non-adaptive support-size estimation and uniformity
testing, restated here.
Theorem 1.2 (Non-Adaptive Support-Size Estimation). Any non-adaptive algorithm which, given√COND
access to an unknown distribution D on[n], estimates the size of its support up to a factor γ ≥ 2 must
have query complexity Ω (log n)/log2 γ .

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                               27
                  JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Theorem 1.3 (Non-Adaptive Uniformity Testing). Any non-adaptive algorithm which, given COND
access to an unknown distribution D on [n] and parameter ε ∈ (0, 1/4), distinguishes with probability at
least 2/3 between (a) D = U[n] and (b) dTV (D, U[n] ) ≥ ε, must have query complexity Ω((log n)/ε).

These two theorems follow from the same argument, which we outline below before turning to the proof
itself. (Note that we will, in this section, establish Theorem 1.3 for constant ε, i. e., ε = 1/4; before
explaining in Section 4.1 how to derive the general statement as a corollary.)

Structure of the proof. By Yao’s Minimax Principle, we consider deterministic tests and study their
performance over random distributions, chosen to be uniform over a random subset of carefully picked
size. The proof of Theorem 1.2 then proceeds in 3 steps: in Lemma 4.2, we first argue that all queries
made by the deterministic tester will (with high probability over the choice of the support size s) behave
very “nicely” with regard to s, i. e., not be concentrated around it. Then, we condition on this to bound
the total variation distance between the sequence of samples obtained in the two cases we “oppose,” a
random distribution from a family D1 and the corresponding one from a family D2 . In Lemma 4.4 we
show that the part of total variation distance due to samples from the small queries is zero, except with
probability o(1) over the choice of s. Similarly, Lemma 4.4 allows us to say (comparing both cases to a
third “reference” case, a honest-to-goodness uniform distribution over the whole domain, and applying a
triangle inequality) that the remaining part of the total variation distance due to samples from the big
queries is zero as well, except again with probability o(1). Combining these three lets us to conclude by
properties of the total variation distance, as (since the queries are non-adaptive) the distribution over the
sequence of samples is a product distribution. (Moreover, applying Lemma 4.4 as a stand-alone enables
us, with little additional work, to obtain Theorem 1.3, as our argument in particular implies distributions
from D1 are hard to distinguish from uniform.)

                                √                                    def
The families D1 and D2 . Fix γ ≥ 2 as in Theorem 1.2; writing β = γ 2 , we define the set
                                            
                 def                  log n
               S = β n : 0≤k≤
                       k 1/4
                                               = {n1/4 , β n1/4 , β 2 n1/4 , . . . , , n3/4 } .        (4.1)
                                     2 log β

A no-instance (D1 , D2 ) ∈ D1 × D2 is then obtained by the following process:

    • Draw s uniformly at random from S.

    • Pick a uniformly random set S1 ⊆ [n] of size s, and set D1 to be uniform on S1 .

    • Pick a uniformly random set S2 ⊆ [n] of size β s, and set D2 to be uniform on S2 .

(Similarly, a yes-instance is obtained by first drawing a no-instance (D1 , D2 ), and discarding D2 to keep
only (D1 , D1 ) ∈ D1 × D1 .)
    We will argue that no algorithm can distinguish with high probability between the cases (D1 , D2 ) ∼
D1 × D2 and (D1 , D2 ) ∼ D1 × D1 , by showing that in both cases D1 and D2 generate transcripts indis-
tinguishable from those the uniform distribution U[n] would yield. This will imply Theorem 1.2, as any
algorithm to estimate the support within a multiplicative γ would imply a distinguisher between instances

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                              28
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

of the form (D1 , D1 ) and (D1 , D2 ) (indeed, the support sizes of D1 and D2 differ by a factor β = γ 2 ).
(As for Theorem 1.3, observe that any distribution D1 ∈ D1 has constant distance from the uniform
distribution on [n], so that a uniformity tester must be able to tell D1 apart from U[n] .)

Small and big query sets. Let T be any deterministic non-adaptive algorithm making

                                                             1    log n
                                              qT ≤ q =
                                                           80000 log2 β

queries. Without loss of generality, we can assume T makes exactly q queries, and denote them by
A1 , . . . , Aq ⊆ [n]. Moreover, we let ai = |Ai |, and (again without loss of generality) write a1 ≥ · · · ≥ aq .
      As a preliminary observation, note that for any A ⊂ [n] and 0 ≤ s ≤ n we have

                                                                   |A| s
                                                    ES |S ∩ A| =
                                                                     n

where the expectation is over a uniform choice of S among all ns subsets of size s. This observation will
                                                                   

lead us to divide the query sets Ai in two groups, depending on the expected size of their intersection with
the (random) support.
    With this in mind, the following definition will be crucial to our proof. Intuitively, it captures the
distribution of sizes of intersection of various query sets with the randomly chosen set S.

Definition 4.1. Let s ≥ 1, and A = {a1 , . . . , aq } be any set of q integers. For a real number t > 0, define

                                          def
                                                n               ai s              o
                                   Ct (s) =         i ∈ [q] :        ∈ β −t , β t                             (4.2)
                                                                 n
to be the number of t-hit points of A (for s).

The next result will be crucial to prove our lower bounds: it roughly states that if we consider the ai and
scale them by the random quantity s/n, then the distribution of the random variable generated has an
exponential tail with respect to t.

Lemma 4.2 (Hitting Lemma). Fix A as in the previous definition. If s is drawn uniformly at random
from S, then with probability at least 99/100.

                                                        Ct (s)    2
                                                sup            <     .                                        (4.3)
                                                    t>0    t     100

Proof. Without loss of generality, assume that the ai are in decreasing order. We will work in the
logarithmic domain: a number ai contributes to Ct (s) if and only if log s ∈ [log(n/ai ) −t log β , log(n/ai ) +
t log β ]. Indeed, we can restate the lemma in an additive form. Let A = {α1 , . . . , αm } be any set of numbers
in [0, log n]. These are defined as transformations of ai ’s: αi , log(n/ai ). In the additive restating, x will
play the role of log s, or equivalently, s = 2x . For a point x ∈ [0, log n], let `xj and rxj denote the distance of
x from the jth point to its left and right, respectively, from the set A. More precisely, if αγ ≤ x ≤ αγ+1 ,

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                   29
                    JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

`xj , x − αγ+1− j . If we consider only points to the left of x, `xj / log β is the least value of t such that
Ct (2x ) = j. Therefore, if
                                                  1
                                         t xj ,       min `xj , rxj ,
                                                         
                                                log β
then we are guaranteed that j ≤ Ct xj (2x ) ≤ 2 j.
    Observe that Ct (2x ) is a piecewise-constant function which is monotone non-decreasing in t. There-
fore, the supremum of Ct (2x )/t is attained at one of these discontinuities:

                                        Ct (2x )                                   Ct (2x )
                                    sup          =             max                          .
                                    t>0     t      {t : ∀δ >0,Ct−δ (2x )<Ct (2x )}     t

We can in turn upper bound this by looking at the set of all t xj . Note that this may ignore some discontinuous
points: for instance, suppose that `xj < rxj , then rxj / log β will not be considered. However, we note that in
this case, Crxj / log β (2x ) ≤ 2C`xj / log β (2x ) = 2Ct xj (2x ):

                          Crxj / log β (2x )       Crxj / log β (2x )       2C`xj / log β (2x )       2Ct xj (2x )
                                               ≤                        ≤                         =                  .
                            rxj / log β              `xj / log β               `xj / log β                t xj

Therefore,
                                              2Ct xj (2 )    x
                      Ct (2x )                                           4j                  4 j log β
                  sup          ≤ x max               x    ≤ x max          x = x max               x x .
                  t>0     t     {t j : j∈[q]}    tj        {t j : j∈[q]} t j  {t j : j∈[q]} min{` j , r j }

We would like to upper bound this term by 2/100. Equivalently, we satisfy the conditions of the Lemma
if                                          x
                                             tj
                                       min       ≥ 200 log β .
                                         j    j
    For a constant c > 0, let Sc be the set of all points x (where recall s = 2x is selected according to the
distribution S) such that violate this inequality for c:
                                                     x
                                                      tj
                                               min         ≤ c.
                                                 j     j

We would like to upper bound the probability that a randomly selected x violates this inequality for
c = 200 log β by 1/100. Equivalently, since x is selected uniformly at random from (log n)/(2 log β )
different values, we would like to upper bound the size of S200 log β by log n/200 log β . We do this with
the following claim.
Claim 4.3. |Sc | ≤ 2cq.
    Substituting in c = 200 log β and q = log n/80000 log2 β will give the desired result.

Proof of Claim 4.3. We consider the set of points in Sc,` ⊂ Sc that satisfy `xj / j < c for some j, and show
that their measure is at most cq. An identical bound holds for the set of points of Sc for which rxj / j < c.
      i ⊂S                                                         x
Let Sc,`    c,` be the set of points in Sc,` that satisfy min j {t j / j} < c with respect to the set {α1 , . . . , αi }.
We will show by induction that |Sc,`i | < ci.



                         T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                           30
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

     For the first point α1 , the set Sc,` 1 = [α , α + c]. Suppose by induction that |Si | < ci. Let x be
                                                   1 1                                       c,`             i
                                       i
the right-most point in the set Sc,` . Then it is clear that xi > αi , in fact xi ≥ αi + c. Furthermore, either
xi = log n, or `xji / j = c for some j. Moreover, we claim that [αi , xi ] ∈ Sc,`
                                                                                i . Indeed, for the same j that
 xi
` j / j < c, all points in [αi , xi ] satisfy the condition. If xi = log n, then the result holds trivially. We
therefore consider the point αi+1 and prove the inductive step for xi < log n. There are two cases:
                              i+1    i ∪ [α
If αi+1 ≥ xi : In this case, Sc,` = Sc,`    i+1 , xi+1 ]. We have to show that xi+1 ≤ αi+1 + c. Suppose to
      the contrary that xi+1 > αi+1 + c ≥ xi + c. Then there is a point αh for h ≤ i, such that
                                 xi+1 − αh                           αi+1 + c − αh
                                           < c,     and then                       < c,
                                 i+2−h                                 i+2−h
      so that
                                                   αi+1 − αh
                                                             < c,
                                                   i+1−h
                                         i , contradicting the assumption of this case.
      however, this implies that αi+1 ∈ Sc,`
                              i+1    i ∪ [x , x
If αi+1 < xi : In this case, Sc,` = Sc,`   i i+1 ]. We have to show that xi+1 ≤ xi + c. Suppose on the
      contrary that xi+1 > xi + c > αi+1 + c. Let h be the index such that
                                xi+1 − αh                                 xi + c − αh
                                          < c,     and therefore,                     < c,
                                i+2−h                                      i+2−h
      implying that
                                                     xi − αh
                                                             < c,
                                                    i+1−h
                                                       i .
      contradicting that xi is the rightmost point of Sc,`
    This concludes the proof of the hitting lemma.

     We proceed to show how to use this lemma to bound the contribution of various types of queries
to the distinguishability of D1 and D2 . In particular, we will apply Lemma 4.2 to the set of query sizes
{a1 , . . . , aq }.
                                                               def
Recall that the ai are non-increasing. If aq0 s/n ≤ 1 let q0 = q + 1, otherwise define q0 as the largest integer
such that aq0 s/n > 1. We partition the queries made by T in two: A1 , . . . , Aq0 are said to be big, while
Aq0 +1 , . . . , Aq are small queries.
Lemma 4.4. With probability at least 1 − 2−10 , a random distribution from D1 or from D2 does not
intersect with any small query.
Proof. Let s be the random size drawn for the definition of the instances. We first claim that

                                           E[ Aq0 + j ∩ S ] ≤ β −50 j

for all j ≥ 1, where the expectation is over the uniform choice of set S1 for D1 . Indeed, by contradiction
suppose there is a j ≥ 1 such that
                                                           aq0 + j s
                                      E[ Aq0 + j ∩ S ] =             > β −50 j .
                                                              n

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                 31
                     JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

By definition of q0 , for 1 ≤ i ≤ j,
                                                         aq0 +i s
                                                  1≥              > β −50 j .
                                                            n
Therefore, the queries Aq0 , Aq0 +1 , . . . , Aq0 + j contribute to C50 j , and we obtain

                                                  C50 j    j    2
                                                        ≥     =    ,
                                                  50 j    50 j 100

contradicting Lemma 4.2. Thus, the expected intersection can be bounded as follows:

            E[ (Aq0 +1 ∪ Aq0 +2 · · · ∪ Aq ) ∩ S ] ≤ E[ Aq0 +1 ∩ S ] + E[ Aq0 +2 ∩ S ] + · · · + E[ Aq ∩ S ]
                                                  ≤ β −50 + β −100 + . . .
                                                  ≤ 2−12 ,

since β ≥ 2. From this, we obtain the result holds for D1 by Markov’s inequality. The same applies to
D2 with probability of intersection at most 2−10 , proving the lemma.

    We now turn our attention to the sets with large intersections. We will show that under D1 and D2 ,
the results of querying the sets A1 , . . . Aq0 are indistinguishable from simply picking elements uniformly
from the sets A1 , . . . , Aq0 . More precisely, we establish the following.

Lemma 4.5. Let η ∗ = 2−10 and ηs = 1/100; and fix ` ∈ {1, 2}. At least an 1 − ηs fraction of elements
s1 , . . . , sq0 ∈ A1 × A2 , . . . , Aq0 satisfy

                                                                                          1
                              Pr (s1 , . . . , sq0 ) ∈ [1 − 5η ∗ , 1 + 5η ∗ ] ·
                                                   
                                                                                                  ,              (4.4)
                               `                                                  |A1 | . . . Aq0
                            
where Pr` (s1 , . . . , sq0 ) denotes the probability that s1 . . . sq0 are the results of the queries A1 , . . . , Aq0
under CONDD` .

    As this holds for most distributions in both D1 and D2 , this implies the queries are indistinguishable
from the product distribution over A1 × A2 , . . . , Aq0 (which is the one induced by the same queries on the
uniform distribution over [n]) in either case, with probability at least 1 − ηs − 5η ∗ .

Proof of Lemma 4.5. From standard Chernoff-like concentration bounds for hypergeometric random
variables (Lemma 2.7), we obtain the claim below.
Claim 4.6. Suppose A is a set of size a, and S is a uniformly chosen random set of size s. Then, for all
η ∈ (0, 1], we have
              h                  as i      2 as
                                                          h                as i        2 as
            Pr |A ∩ S| > (1 + η)      < e−η · 3n and Pr |A ∩ S| < (1 − η)       < e−η · 3n .
                                 n                                         n
   We use this to show that indeed all the |Ai ∩ S| concentrate around their expected values for 1 ≤
j ≤ q0 . First note that, as a consequence of Lemma 4.2, it is the case that these expected values satisfy

                         T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                      32
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

aq0 − j s/n ≥ β 50( j+1) for every 0 ≤ j ≤ q0 − 1 (with probability at least 99/100). Conditioning on this, we
first invoke Claim 4.6 on A j with η = 3 · β 20( j+1) , and then apply a union bound to obtain
                  h                          h                                        i a si       10
                                                                                         j
               Pr ∃ j ∈ [q0 ] s.t. A j ∩ S ∈
                                           / 1 − 4 · β −20( j+1) , 1 + 4 · β −20( j+1) ·     < e−β       (4.5)
                                                                                         n
i. e., with high probability all intersections simultaneously concentrate around their expected values.
Note that since s is at most n3/4 , each Ai under consideration has size at least nβ 50 /n3/4 > n1/4 . Therefore,
the probability that a random selection of elements from A1 , . . . , Aq0 exhibits no collision is at least

                            q0                0
                               |Ai | − q0   q0 q    (q0 )2    log2 n
                                          
                           ∏ |Ai | ≥ 1 − n1/4 ≥ 1 − n1/4 > 1 − n1/4 .
                           i=1

We henceforth condition on this event.
                                                                              10
   Let N = ns be the number of outcomes for the set S. We write N0 ≥ N(1 − e−β ) for the number
                                                          0
of such sets for which equation (4.5) holds. Let sq1 denote s1 . . . sq0 . For a set of distinct (s1 , . . . , sq0 ) ∈
                            0        0                                              q0               q0
A1 × · · · × Aq0 , let N(sq1 ) = n−q
                                  s−q0 be the number of sets of size s that contain s1 , and let N0 (s1 ) of them
satisfy equation (4.5).
                                                                      9                                0
    By Markov’s inequality, with probability at least 1 − e−β , for a randomly chosen sq1 we have
       0       0                9               0
N0 (sq1 )/N(sq1 ) > 1 − e2−β . For any such sq1 ,

         h 0 i N (sq0 ) q0                                                0                      q0
           q    0 1            1              2−β 9 s(s − 1) . . . (s − q + 1)              −19      n
       Pr s1 ≥         ·∏            ≥ (1 − e      )                       0
                                                                                · (1 − 4 · β ) ∏
                 N      i=1 |Ai ∩ S|                 n(n − 1) . . . (n − q + 1)                 i=1 ai s
                                       q0
                                 −19       1
                ≥ (1 − 6 · β           )∏     ,
                                       i=1 ai

for large n and as |S| > n1/4 . Since the sum of probabilities of elements is at most 1, the other side of the
inequality in Lemma 4.5 follows.

Proof of Theorem 1.2 and Theorem 1.3. Let T1 (resp. T2 , TU ) be the distribution over transcripts gen-
erated by the queries A1 , . . . , Aq when given conditional access to the distribution D1 from a no-
instance (resp. D2 , resp. uniform U[n] ); that is, a distribution over q-tuples in A1 × · · · × Aq . Since
                                                                                       big
the queries were non-adaptive, we can break T1 (and similarly for T2 , TU ) in T1 × T1small according to
q , and use Lemma 4.5 and Lemma 4.4 separately to obtain dTV (T1 , T2 ) ≤ ηs + η ∗ + 2−10 < 1/50 and
 0

dTV (T1 , TU ) ≤ ηs + η ∗ + 2−10 < 1/50 (for the latter, recalling that queries that do not intersect the support
receive samples exactly uniformly distributed in the query set)—thus establishing both theorems.

4.1    On the dependence on ε in Theorem 1.3
We remark that Theorem 1.3, by establishing a lower bound of Ω(log n) queries for non-adaptive testing
of uniformity with constant distance parameter 1/4, immediately implies, by a standard argument, an
Ω((log n)/ε) lower bound for distance parameter ε ∈ (0, 1/4). In more detail, this is a consequence

                         T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                     33
                       JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

of the following reduction: any q(n, ε)-query non-adaptive tester for uniformity T can be used, given
conditional access to some distribution D on [n], on the mixture distribution
                                                    def
                                               Dε = 4εD + (1 − 4ε)U[n] ,                                   (4.6)

for which a conditional oracle can be easily simulated given a conditional oracle for D. Moreover,
answering q(n, ε) to Dε can be done with an expected 4εq(n, ε) conditional queries to D. As it is
immediate to see that dTV (Dε , U[n] ) = 4ε dTV (D, U[n] ), we get that T can be used to obtain a tester for non-
adaptive testing of uniformity with constant distance parameter 1/4, with query complexity O(εq(n, ε))
for every ε < 1/4 (converting the expected query complexity to a worst-case one is straightforward via
Markov’s inequality followed by success probability amplification by a constant number of repetitions).
Therefore, the lower bound of Theorem 1.3 implies that q(n, ε) = Ω((log n)/ε), as claimed. It is also
worth noting that the above argument does not yield an analogue statement for support-size estimation
via Theorem 1.2. Indeed, mixing the distribution D with the uniform distribution does not preserve the
support size in that case (nor the guarantee that every point of the support has probability mass at least
τ/n).


5      An upper bound for support-size estimation
In this section, we prove our upper bound for constant-factor support-size estimation, reproduced below.
Theorem 1.4 (Adaptive Support-Size Estimation). Let τ > 0 be any constant. There exists an adaptive
algorithm which, given COND access to an unknown distribution D on [n] (guaranteed to have prob-
ability mass at least
                    τ/n on every element      of its support) and accuracy parameter ε ∈ (0, 1), makes
                 3                        11
Õ (log log n)/ε queries to the oracle and returns a value ω̃ such that the following holds. With
probability at least 2/3, ω̃ ∈ [(1/(1 + ε)) · ω, (1 + ε) · ω], where ω = |supp(D)|.
    Before describing and analyzing our algorithm, we shall need the following results, that we will
use as subroutines: the first one will help us detecting when the support is already dense. The second,
assuming the support is sparse enough, will enable us to find an element with zero probability mass,
which can afterwards be used as a “reference” to verify whether any given element is inside or outside
the support. Finally, the last one will use such a reference point to check whether a candidate support size
σ is smaller or significantly bigger than the actual support size.
Lemma 5.1. Given τ > 0 and COND access to a distribution D such that each support element has
probability at least τ/n, as well as parameters ε ∈ (0, 1/2), δ ∈ (0, 1), there exists an algorithm T EST S-
MALL S UPPORT (Algorithm 2) that makes
                                                     
                                              1    1
                                        Õ       +      · log(1/δ )
                                             τε 2 τ 2
queries to the oracle, and satisfies the following. (i) If supp(D) ≥ (1 − ε/2)n, then it returns ACCEPT
with probability at least 1 − δ ; (ii) if supp(D) ≤ (1 − ε)n, then it returns REJECT with probability at
least 1 − δ .
    11 We remark that the constant in the Õ depends polynomially on 1/τ.



                            T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                             34
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

Lemma 5.2. Given COND access to a distribution D, an upper bound m < n on supp(D), as well as
parameter δ ∈ (0, 1), there exists an algorithm G ET N ON S UPPORT (Algorithm 3) that makes
                                                            
                                                2 1     −2 n
                                          Õ log log
                                                  δ        m

queries to the oracle, and returns an element r ∈ [n] such that r ∈
                                                                  / supp(D) with probability at least 1 − δ .

Lemma 5.3. Given COND access to a distribution D, inputs σ ≥ 2 and r ∈            / supp(D), as well as
parameters ε ∈ (0, 1/2), δ ∈ (0, 1), there exists an algorithm I S AT M OST S UPPORT S IZE (Algorithm 4)
that makes Õ 1/ε 2 log(1/δ ) queries to the oracle, and satisfies the following. The algorithm returns
either yes or no, and (i) if σ ≥ supp(D), then it returns yes with probability at least 1 − δ ; (ii) if
σ > (1 + ε) supp(D), then it returns no with probability at least 1 − δ .

    We defer the proofs of these 3 lemmata to the next subsections, and now turn to the proof of the
theorem.

Proof. The algorithm is given in Algorithm 1, and at a high-level works as follows: if first checks whether
the support size is big (an 1 − O(ε) fraction of the domain), in which case it can already stop and return
a good estimate. If this is not the case, however, then the support is sparse enough to efficiently find
an element r outside the support, by taking a few uniform points, comparing and ordering them by
probability mass (and keeping the lightest). This element r can then be used as a reference point in a
(doubly exponential) search for a good estimate: for each guess ω̃, a random subset S of size roughly ω̃
is taken, a point x is drawn from DS , and x is compared to r to check if D(x) > 0. If so, then S intersects
the support, meaning that ω̃ is an upper bound on ω; repeating until this is no longer the case results in
an accurate estimate of ω.

Algorithm 1 E STIMATE S UPPORTD
                                     1
 1: if T EST S MALL S UPPORTD (ε, 10   ) returns ACCEPT then return ω̃ ← (1 − ε 2 )n
 2: end if
                                            1
 3: Call G ET N ON S UPPORTD ((1 − ε2 )n, 10  ) to obtain a non-support reference point r.
 4: for j from 0 to log1+ε log1+ε n do
                               j
 5:      Set ω̃ ← (1 + ε)(1+ε) .
 6:      Call I S AT M OST S UPPORT S IZED (ω̃, r, ε, 100·(1j+1)2 ) to check if ω̃ is an upper bound on ω.
 7:      if the call returned no then
 8:          Perform a binary search on {(1 + ε) j−1 , . . . , (1 + ε) j } to find i∗ , the smallest i ≥ 2 such that
    I S AT M OST S UPPORT S IZED ((1 + ε)i , r, ε, 10( 1j+1) ) returns no.
                                    ∗
 9:        return ω̃ ← (1 + ε)i −1 .
10:    end if
11: end for


    In the rest of this section, we formalize and rigorously argue the above. Conditioning on each of the
calls to the subroutines T EST S MALL S UPPORT, G ET N ON S UPPORT and I S AT M OST S UPPORT S IZE being

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                    35
                  JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

correct (which overall happens except with probability at most
                                            ∞
                                     1   1       1       1   1
                                       + +∑          2
                                                       +   <
                                     10 10 j=1 100 j     10 3

by a union bound), we show that the output ω̃ of E STIMATE S UPPORT is indeed within a factor (1 + ε) of
ω.
   • If the test on Step 1 passes, then by Lemma 5.1 we must have supp(D) > (1 − ε)n. Thus, the
     estimate we return is correct, as [(1 − ε)n, n] ⊆ [ω̃/(1 + ε), (1 + ε)ω̃].

   • Otherwise, if it does not then by Lemma 5.1 it must be the case that supp(D) < (1 − ε/2)n.
Therefore, if we reach Step 3 then (1 − ε/2)n is indeed an upper bound on ω, and G ET N ON S UPPORT
will return a point r ∈
                      / supp(D) as expected. The analysis of the rest of the algorithm is straightforward:
from the guarantee of I S AT M OST S UPPORT S IZE, the binary search will be performed for the first index
                              j−1            j
 j such that ω ∈ [(1 + ε)(1+ε) , (1 + ε)(1+ε) ]; and will be on a set of (1 + ε) j−1 values. Similarly, for
                                                                    ∗
the value i∗ eventually obtained, it must be the case that (1 + ε)i > ω (by contrapositive, as no was
                                        ∗
returned by the subroutine) but (1 + ε)i −1 ≤ (1 + ε)ω (again, as the subroutine returned yes). But then,
              ∗
ω̃ = (1 + ε)i −1 ∈ (ω/(1 + ε), (1 + ε)ω] as claimed.

Query complexity. The query complexity of our algorithm originates from the following different
steps:
    • the call to T EST S MALL S UPPORT, which from Lemma 5.1 costs Õ 1/ε 2 queries;
                                                                            

   • the call to G ET N ON S UPPORT, on Step 3, that from the choice of the upper bound also costs
     Õ 1/ε 2 queries;

   • the (at most) log1+ε log1+ε n = O((log log n)/ε) calls to I S AT M OST S UPPORT S IZE on Step 6.
     Observing that the query complexity of I S AT M OST S UPPORT S IZE is only Õ 1/ε 2 · log(1/δ ), and
     from the choice of δ = 1/( j + 1)2 at the j-th iteration this step costs at most
                log1+ε log1+ε n                                                  
               1                  2
                                         1                        1
            Õ 2 ·     ∑ O log( j ) = Õ ε 2 log1+ε log1+ε n = Õ ε 3 log1+ε log1+ε n
               ε       j=1

      queries.

   • Similarly, Step 8 results in at most j ≤ log log n calls to I S AT M OST S UPPORT S IZE with δ set to
     1/(10( j + 1)), again costing
                                                                                
                            1                 1                            1
                        Õ 2 · log j = Õ 2 log1+ε log1+ε n = Õ 3 log log n
                            ε                 ε                            ε
      queries.
                                                                         
                                                              log log n
Gathering all terms, the overall query complexity is Õ          ε3
                                                                           , as claimed.


                      T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                             36
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

5.1   Proof of Lemma 5.1
Hereafter, we assume without loss of generality that τ < 2: indeed, if τ ≥ 2 then the support is of size
at most n/2, and it suffices to return REJECT to meet the requirements of the lemma. We will rely on
the (easy) fact below, which ensures that any distribution with dense support and minimum non-zero
probability τ/n put significant mass on “light” elements.
Fact 5.4. Fix any ε ∈ [0, 1). Assume D satisfies both supp(D) ≥ (1−ε)n and D(x) ≥ τ/n for x ∈ supp(D).
                 def
Then, setting Lε = { x ∈ [n] : D(x) ∈ [τ/n, 2/n] }, we have |Lε | ≥ (1/2 − ε)n and D(Lε ) ≥ (1/2 − ε)τ.
Proof. As the second claim follows directly from the first and the minimum mass of elements of Lε , it
suffices to prove that |Lε | ≥ (1/2 − ε)n. This follows from observing that
                                                                  2             2 |Lε |
                   1 = D([n]) ≥ D([n] \ Lε ) ≥ (|supp(D)| − |Lε |) ≥ 2(1 − ε) −
                                                                  n                n
and rearranging the terms.

Description and intuition. The algorithm (as described in Algorithm 2) works as follows: it first takes
enough uniformly distributed samples s1 , . . . , sm to get (with high probability) an accurate enough fraction
of them falling in the support to distinguish between the two cases. The issue is now to detect those
m j which indeed are support elements; note that we do not care about underestimating this fraction in
case (b) (when the support is at most (1 − ε)n, but importantly do not want to underestimate it in case
(a) (when the support size is at least (1 − ε/2)n). To perform this detection, we take constantly many
samples according to D (which are therefore ensured to be in the support), and use pairwise conditional
queries to sort them by increasing probability mass (up to approximation imprecision), and keep only
the lightest of them, t. In case (a), we now from Fact 5.4 that with high probability our t has mass in
[1/n, 2/n], and will therefore be either much lighter than or comparable to any support element: this will
ensure that in case (a) we do detect all of the m j that are in the support.
    This also works in case (b), even though Fact 5.4 does not give us any guarantee on the mass of t.
Indeed, either t turns out to be light (and then the same argument ensures that our estimate of the number
of “support” elements m j is good), or t is too heavy—and then our estimate will end up being smaller
than the true value. But this is fine, as the latter only means we will reject the distribution (as we should,
since we are in the small-support case).

Correctness. Let η be the fraction of the elements s j that are in the support of the distribution. By a
multiplicative Chernoff bound and a suitable constant in our choice of m, we get that (i) if supp(D) ≥
1 − ε/2, then Pr[ η < 1 − 3ε/4 ] ≤ 1/12, while (ii) if supp(D) ≤ 1 − ε/2, then Pr[ η ≥ 1 − 3ε/4 ] ≤ 1/12.
We hereafter condition on this (i. e., η being a good enough estimate). We also condition on all calls
to C OMPARE yielding results as per specified, which by a union bound overall happens except with
probability 1/12 + 1/12 = 1/6, and break the rest of the analysis in two cases.

  (a) Since the support size ω is in this case at least (1 − ε/2)n, from Fact 5.4 we get that
                                                          1−ε   τ
                                             D(Lε/2 ) ≥       τ≥ .
                                                           2    4

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                37
                   JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Algorithm 2 T EST S MALL S UPPORTD
Require: COND access to D; accuracy parameter ε ∈ (0, 1/2), threshold τ > 0, probability of failure δ
 1: Repeat the following O(log(1/δ )) times and return the majority vote.
 2: loop
                 def
       Draw m = Θ ε12 independent samples s1 , . . . , sm ∼ U[n] .
                         
 3:
                def
       Draw k = Θ τ1 independent samples t1 , . . . ,tk ∼ D.
                       
 4:
 5:    for all 1 ≤ i < j ≤ k do                                                                  . Order the t j
                                            1           1                             D(t j )
 6:        Call C OMPARE({ti }, {t j }, η = 2 , K = 2, 4k2 ) to get a 2-approx. ρ of D(ti ) , High or Low.
 7:        if C OMPARE returned High or a value ρ then
 8:            Record ti  t j
 9:        else
10:            Record t j ≺ t j
11:        end if
12:    end for
13:    Set t to be (any of the) smallest elements t j according to .
14:    for all 1 ≤ j ≤ m do                         . Find the fraction of support elements among the m j
15:        Call C OMPARE({t}, {s j }, η = 2 , K = τ2 , 4m
                                            1           1
                                                           ) to get either a value ρ, High or Low.
16:        if C OMPARE returned High or a value ρ ≥ 1/2 then
17:            Record s j as “support.”
18:        end if
19:    end for
20:    if the number of elements s j marked “support” is at least (1 − 43 ε)m then return ACCEPT
21:    else return REJECT
22:    end if
23: end loop


      Therefore, except with probability at most (1 − τ/4)k < 1/12, at least one of the t j will belong to
      Lε/2 . When this happens, and by the choice of parameters in the calls to C OMPARE, we get that
      t ∈ Lε/2 ; that is D(t) ∈ [τ/n, 2/n]. But then the calls to the routine on Step 15 will always return
      either a value (since t is “comparable” to all x ∈ Lε/2 —i. e., has probability within a factor 2/τ of
      them) or High (possible for those s j that have weight greater than 2/n), unless s j has mass 0 (that
      is, is not in the support). Therefore, the fraction of points marked as support is exactly η, which by
      the foregoing discussion is at least 1 − 3ε/4: the algorithm returns ACCEPT at Step 20.
  (b) Conversely, if ω ≤ (1 − ε)n, there will be a fraction 1 − η > 3ε/4 of the s j having mass 0. However,
      no matter what t is it will still be in the support and therefore have D(t) ≥ τ/n: for these s j , the
      call to C OMPARE on Step 15 can thus only return Low. This means that there can only be less
      than (1 − (3/4)ε)m points marked “support” among the s j , and hence that the algorithm will return
      REJECT as it should.

Overall, the inner loop of the algorithm thus only fails with probability at most 1/12 + 1/6 + 1/12 = 1/3,
(where the 3 events contributing to the union bound are (i) when η fails to be a good estimate, (ii) when

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                 38
        A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

the calls to C OMPARE fail to yield results as claimed, and (iii) when no t j hits Lε/2 in case (a)). Repeating
independently log(1/δ ) times and taking the majority vote boosts the probability of success to 1 − δ .

Query complexity. The sample complexity comes from the k2 calls on Step 4 (each costing O(log k)
queries) and the m calls on Step 15 (each costing O((1/τ) log m) queries). By the setting of m and because
of the log(1/δ ) repetitions, this results in an overall query complexity
                                                                     
                                           1      1    1      1       1
                                    O         log + 2 log         log     .
                                          τ2      τ τε        ε       δ

5.2   Proof of Lemma 5.2
As described in Algorithm 3, the subroutine is fairly simple: using its knowledge of an upperbound on
the support size, it takes enough uniformly distributed samples to have (with high probability) at least one
falling outside the support. Then, it uses the conditional oracle to “order” these samples according to
their probability mass, and returns the lightest of them—i. e., one with zero probability mass.

Algorithm 3 G ET N ON S UPPORTD (m, δ )
Require: COND access to D; upper bound m on supp(D), probability of failure δ
Ensure: Returns r ∈ [n] such that, with probability at least 1 − δ , r ∈    / supp(D)
          def    2    −1 n
                              
 1: Set k = log δ log m .
 2: Draw independently k points s1 , . . . , sk ∼ U[n]
 3: for all 1 ≤ i < j ≤ k do
                                                                                     D(s )
 4:     Call C OMPARE({si }, {s j }, η = 21 , K = 2, 2kδ 2 ) to get a 2-approx. ρ of D(sij) , High or Low.
 5:     if C OMPARE returned High or a value ρ then
 6:         Record si  s j
 7:     else
 8:         Record s j ≺ s j
 9:     end if
10: end for
11: return arg min {s1 , . . . , sk }                                   . Return (any) minimal element for .



Correctness. It is straightforward to see that provided at least one of the s j falls outside the support and
that all calls to C OMPARE behave as expected, then the procedure returns one of the “lightest” elements
s j , i. e., a non-support element. By a union bound, the latter holds with probability at least 1 − δ /2; as for
the former, since m is by assumption an upper bound on the support size it holds with probability at least
1 − (m/n)k ≥ 1 − δ /2 (from our setting of k). Overall, the procedure’s output is correct with probability
at least 1 − δ , as claimed.

                                                                                k
                                                                                  
Query complexity. The query complexity     of G ET N ON S UPPORT is due to the 2 calls to
C OMPARE, and is therefore O k2 log k/δ because of our setting for η and K (which is in turn
                                       


                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                 39
                    JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

Õ log2 (1/δ ) log−2 (n/m) ). (In our case, we shall eventually
                          
                                                                take m = (1 − ε/2)n and δ = 1/10, thus
getting k = O(1/ε) and a query complexity of Õ 1/ε 2 .)
                                                        



5.3     Proof of Lemma 5.3
Our final subroutine, described in Algorithm 4, essentially derives from the following observation: a
random set S of size (approximately) σ obtained by including independently each element of the domain
with probability 1/σ will intersect the support on ω/σ points on expectation. What we can test given
our reference point r ∈/ supp(D), however, is only whether S ∩ supp(D) = 0. / But this is enough, as by
repeating sufficiently many times (taking a random S and testing whether it intersects the support at all)
we can distinguish between the two cases we are interested in. Indeed, the expected fraction of times
S includes
         a support element in either cases is known to the algorithm and differs by roughly Ω(ε), so
O 1/ε 2 repetitions are enough to tell the two cases apart.

Algorithm 4 I S AT M OST S UPPORT S IZED (σ , r, ε, δ )
Require: COND access to D; size σ ≥ 2, non-support element r, accuracy ε, probability of failure δ
Ensure: Returns, with
                    σ probability       at least 1 − δ , yes if σ ≤ |supp(D)| and no if σ > (1 + ε) |supp(D)|.
                                                    ε
 1: Set α ← 1 − σ1      ∈ [ 41 , e−1 ], τ ← α(α − 2 − 1) = Θ(ε).
 2: Repeat the following O(log(1/δ )) times and return the majority vote.
 3: loop
        for m = O τ12 times do
                      
 4:
 5:         Draw a subset S ⊆ [n] by including independently each x ∈ [n] with probability 1/σ .
 6:         Draw x ∼ DS .
 7:         Call C OMPARE({x}, {r}, η = 21 , K = 1, 100m      1
                                                                 )                              / ρ ∈ [ 21 , 2) o.w.
                                                                        . Low if S ∩ supp(D) 6= 0;
 8:         Record yes if C OMPARE returned Low, no otherwise.
 9:     end for                              
10:     return yes if at least m α + τ2 “yes”’s were recorded, no otherwise.                     . Thresholding.
11: end loop



Correctness. We condition on all calls to C OMPARE being correct: by a union bound, this overall
happens with probability at least 99/100. We shall consider the two cases σ ≤ ω and σ > (1 + ε)ω, and
focus on the difference of probability p of recording yes on Step 8 between the two, in any fixed of the m
iterations. In both cases, note p is exactly (1 − 1/σ )ω .
                                              σ
      • If σ ≤ ω, then we have p ≤ 1 − σ1          = α.
                                            σ /(1+ε)              σ (1−ε/2)
      • If σ > (1 + ε)ω, then p > 1 − σ1                > 1 − σ1                = α 1−ε/2 .

As α ∈ [ 14 , e−1 ], the difference between the two is τ = α(α −ε/2 − 1) = Θ(ε). Thus, repeating the atomic
test of Step 4 O 1/τ 2 before thresholding at Step 10 yields the right answer with constant probability,
                           

then brought to 1 − δ by the outer repeating and majority vote.

                        T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                                   40
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

Query complexity. Each call to C OMPARE at Step 7 costs O(log m) queries, and is in total repeated
O(m log(1/δ ) times. By the setting of m and τ, the overall query complexity is therefore
                                                        
                                              1   1    1
                                           O 2 log log     .
                                              ε   ε    δ


5.4   A non-adaptive upper bound
In this section, we sketch how similar—yet less involved—ideas can be used to derive a non-adaptive
upper bound for support-size estimation. For simplicity, we describe the algorithm for 2-approximation:
adapting it to general (1 + ε)-approximation is straightforward.
    The high-level idea is to perform a simple binary search (instead of the double exponential search
from the preceding section) to identify the greatest lower bound on the support size of the form k = 2 j .
For each guess k ∈ {2, 4, 8 . . . , n}, we pick uniformly at random a set S ⊆ [n] of cardinality k, and check
whether DS is uniform using the non-adaptive tester of Chakraborty et al. [16, Theorem 4.1.2]. If DS is
found to be uniform for all values of k, we return n as our estimate (as the distribution is close to uniform
on [n]); otherwise, we return n/k, for the smallest k on which DS was found to be far from uniform.
Indeed, DS can only be far from uniform if S contains points from the support of D, which intuitively
only happens if n/k = Ω(1).

To be more precise, the algorithm proceeds as follows, where τ > 0 is an absolute constant.
  for all k ∈ {2, 4, . . . , n} do
      Set a counter ck ← 0.
      for m = O(log log n) times do
           Pick uniformly at random a set S ⊆ [n] of k elements.
           Test (non-adaptively) uniformity of DS on S, with the tester of [16].
           if the tester rejects then increment ck .
           end if
      end for
      if ck > τ · m then return ω̃ ← nk .
      end if
  end for
  return ω̃ ← n.
The query complexity is easily seen to be poly log n, from the Õ(log n) calls to the poly(log n) tester
of [16, Theorem 4.1.2]. As for correctness, it follows from the fact that for any set S with mass D(S) > 0
which contains at least an η fraction of points outside the support, it holds that DS is η-far from US .


Acknowledgments. Clément Canonne would like to thank Dana Ron and Rocco Servedio for the many
helpful discussions and remarks that influenced the lower bound construction of Section 3. The authors
are grateful to László Babai, Robert Krauthgamer, and the anonymous reviewers for their helpful and
detailed comments.

                       T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                              41
                 JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

References
 [1] List of Open Problems in Sublinear Algorithms: Problem 66. Bertinoro Workshop on Sublinear
     Algorithms 2014 (suggested by Eldar Fischer). Available at http://sublinear.info/66. 5

 [2] JAYADEV ACHARYA , C LÉMENT L. C ANONNE , AND G AUTAM K AMATH: Adaptive estimation
     in weighted group testing. In IEEE Internat. Symp. Information Theory (ISIT’15), pp. 2116–2120,
     2015. [doi:10.1109/ISIT.2015.7282829] 3

 [3] JAYADEV ACHARYA , C LÉMENT L. C ANONNE , AND G AUTAM K AMATH: A chasm between
     Identity and Equivalence Testing with conditional queries. In Proc. 19th Internat. Workshop on
     Randomization and Computation (RANDOM’15), volume 40, pp. 449–466. Schloss Dagstuhl, 2015.
     [doi:10.4230/LIPIcs.APPROX-RANDOM.2015.449, arXiv:1411.7346, ECCC:TR14-156] 1

 [4] JAYADEV ACHARYA , H IRAKENDU DAS , A SHKAN JAFARPOUR , A LON O RLITSKY, AND
     S HENGJUN PAN: Competitive closeness testing. In Proc. 24th Ann. Conf. on Learning The-
     ory (COLT’11), volume 19 of JMLR Proceedings, pp. 47–68. JMLR.org, 2011. Accessible at JMLR.
     2

 [5] JAYADEV ACHARYA , H IRAKENDU DAS , A SHKAN JAFARPOUR , A LON O RLITSKY, S HENGJUN
     PAN , AND A NANDA T HEERTHA S URESH: Competitive classification and closeness testing. In
     Proc. 25th Ann. Conf. on Learning Theory (COLT’12), pp. 22.1–22.18, 2012. Accessible at JMLR.
     2

 [6] A NNE AUGER AND B ENJAMIN D OERR: Theory of Randomized Search Heuristics: Foundations
     and Recent Developments. Series on theoretical computer science. World Scientific, 2011. 13

 [7] M ARIA -F LORINA BALCAN , E RIC B LAIS , AVRIM B LUM , AND L IU YANG: Active property testing.
     In Proc. 53rd FOCS, pp. 21–30. IEEE Comp. Soc. Press, 2012. [doi:10.1109/FOCS.2012.64] 3

 [8] T U ĞKAN BATU , E LDAR F ISCHER , L ANCE F ORTNOW, R AVI K UMAR , RONITT RUBINFELD , AND
     PATRICK W HITE: Testing random variables for independence and identity. In Proc. 42nd FOCS,
     pp. 442–451. IEEE Comp. Soc. Press, 2001. [doi:10.1109/SFCS.2001.959920] 2

 [9] T U ĞKAN BATU , L ANCE F ORTNOW, RONITT RUBINFELD , WARREN D. S MITH , AND PATRICK
     W HITE: Testing that distributions are close. In Proc. 41st FOCS, pp. 189–197. IEEE Comp. Soc.
     Press, 2000. This is a preliminary version of [10]. [doi:10.1109/SFCS.2000.892113] 2, 42

[10] T U ĞKAN BATU , L ANCE F ORTNOW, RONITT RUBINFELD , WARREN D. S MITH , AND PATRICK
     W HITE: Testing closeness of discrete distributions. 2010. This is a long version of [9].
     [arXiv:1009.5397] 4, 5, 42

[11] T U ĞKAN BATU , R AVI K UMAR , AND RONITT RUBINFELD: Sublinear algorithms for testing
     monotone and unimodal distributions. In Proc. 36th STOC, pp. 381–390. ACM Press, 2004.
     [doi:10.1145/1007352.1007414] 2

                    T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                        42
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

[12] A RNAB B HATTACHARYYA , E LDAR F ISCHER , RONITT RUBINFELD , AND PAUL VALIANT: Testing
     monotonicity of distributions over general partial orders. In Proc. 2nd Conf. on Innovations in
     Theoret. Comput. Sci. (ITCS’11), pp. 239–252, 2011. [ECCC:TR10-027] 2

[13] C LÉMENT L. C ANONNE: A survey on distribution testing: Your data is Big, but is it Blue? Electron.
     Colloq. on Comput. Complexity (ECCC), April 2015. [ECCC:TR15-063] 2

[14] C LÉMENT L. C ANONNE , DANA RON , AND ROCCO A. S ERVEDIO: Testing probability distribu-
     tions using conditional samples. SIAM J. Comput., 44(3):540–616, 2015. Preliminary version in
     SODA’14. [doi:10.1137/130945508, arXiv:1211.2664, ECCC:TR12-155] 2, 3, 4, 5, 7, 9, 10, 15

[15] C LÉMENT L. C ANONNE AND RONITT RUBINFELD: Testing probability distributions underlying
     aggregated data. In Proc. 41st Internat. Colloq. on Automata, Languages and Programming
     (ICALP’14), pp. 283–295. Springer, 2014. [doi:10.1007/978-3-662-43948-7_24, arXiv:1402.3835,
     ECCC:TR14-021] 2

[16] S OURAV C HAKRABORTY, E LDAR F ISCHER , YONATAN G OLDHIRSH , AND A RIE M ATSLIAH: On
     the power of conditional samples in distribution testing. SIAM J. Comput., 45(4):1261–1296, 2016.
     Preliminary version in ITCS’13. [doi:10.1137/140964199, arXiv:1210.8338, ECCC:TR12-154] 2,
     3, 4, 6, 7, 9, 10, 16, 41

[17] C HUN L AM C HAN , S IDHARTH JAGGI , V ENKATESH S ALIGRAMA , AND S AMAR AGNIHOTRI:
     Non-adaptive group testing: Explicit bounds and novel algorithms. IEEE Trans. Inform. The-
     ory, 60(5):3019–3035, 2014. Preliminary version in ISIT’12. [doi:10.1109/TIT.2014.2310477,
     arXiv:1202.0206] 3

[18] S IU -O N C HAN , I LIAS D IAKONIKOLAS , G REGORY VALIANT, AND PAUL VALIANT: Optimal al-
     gorithms for testing closeness of discrete distributions. In Proc. 25th Ann. ACM-SIAM Symp. on Dis-
     crete Algorithms (SODA’14), pp. 1193–1203. ACM Press, 2014. [doi:10.1137/1.9781611973402.88,
     arXiv:1308.3946] 2, 5, 7

[19] VAŠEK C HVÁTAL: The tail of the hypergeometric distribution. Discrete Mathematics, 25(3):285 –
     287, 1979. [doi:10.1016/0012-365X(79)90084-0] 13

[20] S ANJOY DASGUPTA: Analysis of a greedy active learning strategy. In Advances in Neural Informa-
     tion Processing Systems 17, pp. 337–344. MIT Press, 2004–2005. Available at NIPS. 3

[21] ROBERT D ORFMAN: The detection of defective members of large populations. Ann. Math. Statist.,
     14(4):436–440, 1943. [doi:10.1214/aoms/1177731363] 3

[22] D INGZHU D U AND F RANK K. H WANG: Combinatorial Group Testing and Its Applications.
     Applied Mathematics. World Scientific, 2000. 3

[23] D EVDATT P. D UBHASHI AND D ESH R ANJAN: Balls and Bins: A study in negative de-
     pendence. Random Structures Algorithms, 13(2):99–124, 1996. [doi:10.1002/(SICI)1098-
     2418(199809)13:2<99::AID-RSA1>3.0.CO;2-M] 13

                     T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                           43
                 JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

[24] M OEIN FALAHATGAR , A SHKAN JAFARPOUR , A LON O RLITSKY, V ENKATADHEERAJ P ICHAP -
     ATHI , AND A NANDA T HEERTHA S URESH: Faster algorithms for testing under conditional sampling.
     In Proc. 28th Ann. Conf. on Learning Theory (COLT’15), JMLR Proceedings, pp. 607–636, 2015.
     [arXiv:1504.04103] 4, 5, 7

[25] E LDAR F ISCHER: The art of uninformed decisions: A primer to property testing. Bulletin of the
     European Association for Theoretical Computer Science, 75:97–126, 2001. 2

[26] O DED G OLDREICH, editor. Property Testing: Current Research and Surveys. Springer, 2010.
     LNCS 6390. [doi:10.1007/978-3-642-16367-8] 2

[27] O DED G OLDREICH AND DANA RON: On testing expansion in bounded-degree graphs. Technical
     report, Electron. Colloq. on Comput. Complexity (ECCC), 2000. [ECCC:TR00-020] 2, 4

[28] S UDIPTO G UHA , A NDREW M C G REGOR , AND S URESH V ENKATASUBRAMANIAN: Sublinear
     estimation of entropy and information distances. ACM Trans. Algorithms, 2009. Preliminary version
     in SODA’06. [doi:10.1145/1597036.1597038, arXiv:cs/0508122] 2

[29] P IOTR I NDYK , R EUT L EVI , AND RONITT RUBINFELD: Approximating and testing k-histogram
     distributions in sub-linear time. In Proc. 31st Symp. on Principles of Database Systems (PODS’12),
     pp. 15–22, 2012. [doi:10.1145/2213556.2213561, ECCC:TR11-171] 2

[30] R EUT L EVI , DANA RON , AND RONITT RUBINFELD: Testing properties of collections of
     distributions. Theory of Computing, 9(8):295–347, 2013. Preliminary version in ICS’11.
     [doi:10.4086/toc.2013.v009a008, ECCC:TR10-157] 2

[31] R AJEEV M OTWANI AND P RABHAKAR R AGHAVAN: Randomized Algorithms. Cambridge Univer-
     sity Press, New York, NY, USA, 1995. 12

[32] H UNG Q. N GO AND D ING -Z HU D U: A Survey on Combinatorial Group Testing Algorithms with
     Applications to DNA Library Screening. In DIMACS Ser. in Discr. Math. and Theoret. Comput. Sci.,
     number 2, 2000. 3

[33] L IAM PANINSKI: A coincidence-based test for uniformity given very sparsely sampled discrete
     data. IEEE Trans. Inform. Theory, 54(10):4750–4755, 2008. [doi:10.1109/TIT.2008.928987] 2, 4, 7

[34] S OFYA R ASKHODNIKOVA , DANA RON , A MIR S HPILKA , AND A DAM S MITH: Strong lower
     bounds for approximating distributions support size and the distinct elements problem. SIAM J.
     Comput., 39(3):813–842, 2009. Preliminary version in FOCS’07. [doi:10.1137/070701649] 5

[35] DANA RON: Property Testing: A Learning Theory Perspective. Foundations and Trends in Machine
     Learning, 1(3):307–402, 2008. Preliminary version in COLT’07. [doi:10.1561/2200000004] 2

[36] DANA RON: Algorithmic and analysis techniques in property testing. Found. Trends Theor. Comput.
     Sci., 5:73–205, 2010. [doi:10.1561/0400000029] 2

                     T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                          44
       A C HASM B ETWEEN I DENTITY AND E QUIVALENCE T ESTING WITH C ONDITIONAL Q UERIES

[37] DANA RON AND G ILAD T SUR: The power of an example: Hidden set size approximation using
     group queries and conditional sampling. ACM Trans. Comput. Theory, 8(4):15:1–15:19, 2016.
     [doi:10.1145/2930657, arXiv:1404.5568] 6, 8

[38] RONITT RUBINFELD: Taming Big Probability Distributions.           XRDS, 19(1):24–28, 2012.
     [doi:10.1145/2331042.2331052] 2

[39] RONITT RUBINFELD AND ROCCO A. S ERVEDIO: Testing monotone high-dimensional distributions.
     Random Structures Algorithms, 34(1):24–44, January 2009. Preliminary version in STOC’05.
     [doi:10.1002/rsa.20247] 2

[40] B URR S ETTLES: Active learning literature survey. CS Technical Reports 1648, University of
     Wisconsin–Madison, 2009. Available at MINDS@UW. 3

[41] L ARRY J. S TOCKMEYER: On approximation algorithms for #P. SIAM J. Comput., 14(4):849–861,
     1985. [doi:10.1137/0214060] 8

[42] G REGORY VALIANT AND PAUL VALIANT: A CLT and tight lower bounds for estimating entropy.
     Electron. Colloq. on Comput. Complexity (ECCC), 2010. [ECCC:TR10-179] 5, 7, 45

[43] G REGORY VALIANT AND PAUL VALIANT: Estimating the unseen: A sublinear-sample canonical
     estimator of distributions. Electron. Colloq. on Comput. Complexity (ECCC), 2010. [ECCC:TR10-
     180] 45

[44] G REGORY VALIANT AND PAUL VALIANT: The power of linear estimators. In Proc. 52nd FOCS,
     pp. 403–412, October 2011. See also [42] and [43]. [doi:10.1109/FOCS.2011.81] 5

[45] G REGORY VALIANT AND PAUL VALIANT: An automatic inequality prover and instance opti-
     mal identity testing. SIAM J. Comput., 46(1):429–455, 2017. Preliminary version in FOCS’14.
     [doi:10.1137/151002526] 2, 4

[46] PAUL VALIANT: Testing symmetric properties of distributions. SIAM J. Comput., 40(6):1927–1968,
     2011. Preliminary version in STOC’08. [doi:10.1137/080734066, ECCC:TR07-135] 5

[47] W IKIPEDIA: Yes, Virginia, there is a Santa Claus — Wikipedia, The Free Encyclopedia,
     2017. https://en.wikipedia.org/w/index.php?title=Yes,_Virginia,_there_is_a_
     Santa_Claus&oldid=805832621 [Online; accessed 05-November-2017]. 2

AUTHORS

     Jayadev Acharya
     Assistant professor
     Cornell University
     acharya cornell edu
     http://people.ece.cornell.edu/acharya/



                    T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                       45
               JAYADEV ACHARYA , C L ÉMENT L. C ANONNE , AND G AUTAM K AMATH

    Clément L. Canonne
    Postdoctoral researcher
    Stanford University
    ccanonne cs stanford edu
    http://www.cs.columbia.edu/~ccanonne/


    Gautam Kamath
    Microsoft research fellow
    Simons Institute for the Theory of Computing
    Berkeley, CA
    g csail mit edu
    http://www.gautamkamath.com/


ABOUT THE AUTHORS

    JAYADEV ACHARYA is an assistant professor at Cornell University, where he joined after a
       postdoc at MIT. He holds a Ph. D. from UC San Diego, where he was advised by Alon
       Orlitsky. His research interests are in algorithms, information theory, machine learning,
       and statistics. His awards include a Jack Wolf Student paper award at ISIT’10, and a
       Best Paper Honorable Mention at ICML’17.


    C LÉMENT L. C ANONNE is a Motwani Postdoctoral Fellow at Stanford University. He grad-
       uated from Columbia University in 2017, where he was advised by Rocco Servedio. His
       research focuses on the fields of property testing and sublinear algorithms; specifically,
       on understanding the strengths and limitations of the standard models in property and
       distribution testing, as well as in related areas. He also really likes elephants.


    G AUTAM K AMATH is a Microsoft Research Fellow at the Simons Institute for the Theory of
       Computing. He graduated from MIT in 2018, where he was advised by Constantinos
       Daskalakis. His research focuses on algorithms for statistical tasks (such as distribution
       testing and parameter estimation), particularly with considerations which arise in modern
       settings for data analysis (including high dimensions, robustness, and privacy). He was
       awarded the inaugural STOC Best Student Presentation award at STOC 2012.




                    T HEORY OF C OMPUTING, Volume 14 (19), 2018, pp. 1–46                           46