Distribution-Free Testing Lower Bound for Basic Boolean Functions

Authors Dana Glasner, Rocco A. Servedio,
Plaintext
                              T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218
                                           www.theoryofcomputing.org




 Distribution-Free Testing Lower Bounds for
           Basic Boolean Functions
                                 Dana Glasner∗                   Rocco A. Servedio†
                               Received: September 18, 2008; published: October 17, 2009.




         Abstract: In the distribution-free property testing model, the distance between functions is
         measured with respect to an arbitrary and unknown probability distribution D over the input
         domain. We consider distribution-free testing of several basic Boolean function classes over
         {0, 1}n , namely monotone conjunctions, general conjunctions, decision lists, and linear
         threshold functions. We prove that for each of these function classes, Ω((n/ log n)1/5 )
         oracle calls are required for any distribution-free testing algorithm. Since each of these
         function classes is known to be distribution-free properly learnable (and hence testable)
         using Θ(n) oracle calls, our lower bounds are polynomially related to the best possible.

ACM Classification: F.2.2, G.3
AMS Classification: 68Q99, 68W20
Key words and phrases: property testing, distribution-free testing, decision list, conjunction, linear
threshold function

1      Introduction
The field of property testing deals with algorithms that decide whether an input object is in a certain
class or is far from being in the class after reading only a small fraction of the object. Property testing
was formally introduced in [22] after significant prior work [5, 4] and has evolved into a rich field of
study (see [3, 8, 11, 20, 21] for some surveys). A standard approach in property testing is to view the
input to the testing algorithm as a function over some finite domain; the testing algorithm is required
to distinguish functions that are in a certain class C from functions that are ε-far from being in class
     ∗ Supported in part by an FFSEAS Presidential Fellowship.
     † Supported in part by NSF award CCF-0347282, by NSF award CCF-0523664, and by a Sloan Foundation Fellowship.



    2009 Dana Glasner and Rocco A. Servedio
    Licensed under a Creative Commons Attribution License                                DOI: 10.4086/toc.2009.v005a010
                                DANA G LASNER AND ROCCO A. S ERVEDIO

C. In the most commonly considered property testing scenario, a function f is ε-far from class C if
 f disagrees with every function g that is in class C on at least an ε fraction of the points in the input
domain; equivalently, the distance between functions f and g is measured with respect to the uniform
distribution over the domain. The testing algorithm “reads” f by adaptively querying a black-box oracle
for f at points x of the algorithm’s choosing (such oracle calls are often referred to as “membership
queries” in computational learning theory). The main goal in designing property testing algorithms is to
use as few queries as possible to distinguish the two types of functions; ideally the number of queries
should depend only on ε and should be independent of the size of f ’s domain.
     In recent years there has been considerable work in the standard “uniform distribution” property
testing scenario on testing various natural Boolean function classes. Some classes for which uniform
distribution testing results have been obtained are monotone functions [7, 10, 12]; Boolean literals,
monotone conjunctions, general conjunctions and s-term monotone DNFs [19]; J-juntas [9]; parity func-
tions (which are equivalent to degree-1 polynomials) [5]; degree-d polynomials [2]; decision lists, s-term
DNFs, size-s decision trees and s-sparse polynomials [6]; and linear threshold functions [18].


Distribution-free property testing A natural generalization of the notion of property testing is to
consider a broader notion of the distance between functions. Given a probability distribution D over the
domain, we may define the distance between f and g as the probability that an input x drawn from D
satisfies f (x) 6= g(x); the “standard” notion of property testing described above corresponds to the case
where D is the uniform distribution. Distribution-free property testing is the study of property testers
in a setting where distance is measured with respect to a fixed but unknown and arbitrary probability
distribution D. Since the distribution D is unknown, in this scenario the testing algorithm is allowed to
draw random samples from D in addition to querying a black-box oracle for the value of the function.
     Distribution-free property testing is well-motivated by very similar models in computational learning
theory (namely the model of distribution-free PAC learning with membership queries, which is closely
related to the well-studied model of exact learning from equivalence and membership queries), and by
the fact that in various settings the uniform distribution may not be the best way to measure distances.
Distribution-free property testing has been considered by several authors [1, 13, 16, 14, 15]; we briefly
describe some of the most relevant prior work below.
     Goldreich et al. [13] introduced the model of distribution-free property testing, and observed that
any proper distribution-free PAC learning algorithm (a learning algorithm for a class of functions that
always outputs a hypothesis function that itself belongs to the class) can be used to directly obtain a
distribution-free property testing algorithm. They also showed that several graph properties that have
testing algorithms with query complexity independent of input size in the uniform-distribution model
(such as bipartiteness, k-colorability, ρ-clique, ρ-cut and ρ-bisection) do not have distribution-free test-
ing algorithms with query complexity independent of input size. In contrast, Halevy and Kushilevitz [14]
gave a distribution-free algorithm for testing connectivity in sparse graphs that has poly(1/ε) query com-
plexity independent of input size.
     A range of positive and negative results have been established for distribution-free testing of Boolean
functions over {0, 1}n . Halevy and Kushilevitz [15] showed that any distribution-free monotonicity test-
ing algorithm over {0, 1}n must make 2Ω(n) queries; this is in contrast with the uniform distribution
setting, where monotonicity testing algorithms are known that have query complexity poly(n, 1/ε) [7,

                        T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                             192
           D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

10, 12]. On the other hand, [16] showed that for several important function classes over {0, 1}n such
as juntas, parities, low-degree polynomials and Boolean literals, there exist distribution-free testing al-
gorithms with query complexity poly(1/ε) independent of n; these distribution-free results match the
query bounds of uniform distribution testing algorithms for these classes.
    To sum up, the current landscape of distribution-free property testing is intriguingly varied. For
some testing problems (juntas, parities, Boolean literals, low-degree polynomials, connectivity in sparse
graphs) the complexity of distribution-free testing is known to be essentially the same as the complexity
of uniform-distribution testing; but for other natural testing problems (monotonicity, bipartiteness, k-
colorability, ρ-clique, ρ-cut, ρ-bisection), distribution-free testing provably requires many more queries
than uniform-distribution testing.

This work Our work is motivated by the fact that for many Boolean function classes over {0, 1}n that
are of fundamental interest, a very large gap exists between the query complexities of the best known
distribution-free property testing algorithms (which typically follow trivially from learning algorithms
and have query complexity Ω(n)) and the best known uniform distribution property testing algorithms
(which typically have query complexity poly(1/ε) independent of n). A natural goal is to try to close
this gap, either by developing efficient distribution-free testing algorithms or by proving lower bounds
for distribution-free testing for these classes.
     We study distribution-free testability of several fundamental classes of Boolean functions that have
been previously considered in the uniform distribution testing framework, and have been extensively
studied in various distribution-free learning models. More precisely, we consider the following classes
(in order of increasing generality): monotone conjunctions, arbitrary conjunctions, decision lists, and
linear threshold functions. Each of these four classes is known to be testable in the uniform distribution
setting using poly(1/ε) queries, independent of n (see [19] for monotone and general conjunctions, [6]
for decision lists, and [18] for linear threshold functions). On the other hand, for each of these classes
the most efficient known distribution-free testing algorithm is simply to use a proper learning algorithm.
Using the fact that each of these classes has Vapnik-Chervonenkis dimension Θ(n), standard results
in learning theory yield well-known algorithms that use O(n/ε) random examples and no membership
queries (see, e. g., Chapter 3 of [17]), and known results also imply that any learning algorithm must
make Ω(n) oracle calls (see [23]).
     Our main results are strong distribution-free lower bounds for testing each of these four function
classes. As our first main result, we prove:
Theorem 1.1. Let T be any algorithm which, given oracle access to an unknown function f : {0, 1}n →
{0, 1} and (sampling) oracle access to an unknown distribution D over {0, 1}n , outputs “yes” with
probability at least 2/3 if f is a monotone conjunction, and outputs “no” with probability at least 2/3 if
 f is 1/6-far from every decision list with respect to D. Then T must make Ω((n/ log n)1/5 ) oracle calls
in total.
    Since every monotone conjunction and conjunction can be expressed as a decision list, we have the
following corollary:
Corollary 1.2. Distribution-free testing of monotone conjunctions, conjunctions, and decision lists all
require Ω((n/log n)1/5 ) queries.

                        T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                            193
                                     DANA G LASNER AND ROCCO A. S ERVEDIO

    Additionally, for the case of linear threshold functions we have:

Theorem 1.3. Let T be any algorithm which, given oracle access to an unknown function f : {0, 1}n →
{0, 1} and (sampling) oracle access to an unknown distribution D over {0, 1}n , tests whether f is a
linear threshold function or is 1/4-far from every linear threshold function with respect to D. Then T
must make Ω((n/ log n)1/5 ) oracle calls in total.

     These results show that for these function classes, distribution-free testing is nearly as difficult (from
a query perspective) as distribution-free learning, and is much more difficult than uniform-distribution
testing. We remind the reader that unlike several models of learning, testing is not monotone with
respect to inclusion: in other words, if C0 is a subclass of C, an efficient algorithm for testing C does
not immediately imply an efficient algorithm for testing C0 . As an example of this, the empty class
consisting of no functions is trivially testable, and the “complete” class consisting of all functions is also
trivially testable, but many intermediate classes are not trivially testable.

Our Approach For simplicity, we discuss here only the construction for the lower bound for monotone
conjunctions. The basic idea is to construct two distributions YES and NO over pairs (h, D) where h
is a Boolean function and D is a distribution over the boolean cube such that for every pair (g, Dg ) in
the support of YES, the function g is a monotone conjunction and for every pair ( f , D f ) in the support
of NO, the function f is far from every monotone conjunction with respect to the distribution D f .
We then show that any algorithm that makes fewer than Ω((n/ log n)1/5 ) queries can only distinguish
between a draw from the YES-distribution and the NO-distribution with probability at most 1/4. Since
a distribution-free tester must distinguish between the two with probability at least 2/3, this implies that
any distribution-free tester must make at least Ω((n/ log n)1/5 ) queries.
      For both the YES and NO distributions, a draw from the distribution is obtained by first randomly
choosing m triples of n-bit strings (a1 , b1 , c1 ), . . . , (am , bm , cm ). The 2m strings a1 , b1 , a2 , b2 , . . . , am , bm
each have (disjoint) sets of ` bits set to 0 and n − ` bits set to 1; each string ci is the bitwise-and of ai
and bi . (See the description of distribution H in Section 3.1.1 for a more detailed description.) A draw
of (g, Dg ) from YES is constructed in such a way that g(ai ) = 0, g(bi ) = 1 and g(ci ) = 0 for each i; the
distribution Dg puts weight 2/(3m) on each bi point and weight 1/(3m) on each ci point. We define g
over the rest of the points in the boolean cube so that g is a monotone conjunction. (See Section 3.1.2
for details.)
      In contrast, a draw of ( f , D f ) from NO is constructed such that f (ai ) = 1, f (bi ) = 1 but f (ci ) = 0,
and the distribution D f puts 1/(3m) weight on each ai , bi , or ci point. (See Section 3.1.3 for details.)
As noted in [19], any monotone conjunction h must satisfy h(x) ∧ h(y) = h(x ∧ y) (where x ∧ y denotes
the bitwise AND of x and y) for all x, y ∈ {0, 1}n , and so no monotone conjunction h can satisfy h(ai ) =
h(bi ) = 1 but h(ci ) = 0. Thus, every monotone conjunction must differ from f on at least one of ai , bi ,
ci , for each of the m triples, and f is at least 1/3-far from every monotone conjunction.
      We first show that with only random samples from the distribution, a tester cannot distinguish be-
tween a draw from the YES and NO distribution, with high probability. This is fairly easy and is argued
in Section 3.3.1. Intuitively, in the NO case the ai and bi points cannot be told apart, so in both the YES
case and the NO case 1/3 of the draws “look like” ci -type points and 2/3 of the draws “look like” the
other type(s).

                            T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                            194
            D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

    The bulk of our work is to additionally show that queries “do not help” the tester. While there are
various technical complications that need to be dealt with in the formal proof, the high-level idea is that
an algorithm can only distinguish between a draw from the YES-distribution and NO-distribution if it
manages to query certain types of points called “witnesses” (see Section 3.3). We show that witnesses
are very hard to find when allowed only o((n/ log n)1/5 ) queries since intuitively, a witness can only be
found by random guessing (see the proof of Lemma 3.11).


Organization After giving preliminaries in Section 2, in Section 3 we present our construction of
“yes” and “no” (function, distribution) pairs that are used in the lower bound for monotone conjunctions,
conjunctions and decision lists as well as the proof of the lower bound. In Section 4 we describe a variant
of the construction for linear threshold functions, and use it to prove the lower bound for linear threshold
functions.


2    Preliminaries
Throughout the paper we deal with Boolean functions over n input variables.

Definition 2.1. Let D be a probability distribution over {0, 1}n . Given Boolean functions f , g : {0, 1}n →
{0, 1}, the distance between f and g with respect to D is defined by distD ( f , g) = Prx∼D [ f (x) 6= g(x)].
    If C is a class of Boolean functions over {0, 1}n , we define the distance between f and C with respect
to D to be distD ( f ,C) = ming∈C distD ( f , g).
    We say that f is ε-far from C with respect to D if distD ( f ,C) ≥ ε.

    Now we can define the notion of a distribution-free tester for a class of functions C:

Definition 2.2. A distribution-free tester for class C is a probabilistic oracle machine T which takes as
input a distance parameter ε > 0 and is given access to

    • a black-box oracle to a fixed (but unknown and arbitrary) function h : {0, 1}n → {0, 1} (when
      invoked with input x, the oracle returns the value h(x)); and

    • a sampling oracle for a fixed (but unknown and arbitrary) distribution D over {0, 1}n (each time
      it is invoked this oracle returns a pair (x, h(x)) where x is independently drawn from D).

T must satisfy the following two conditions: for any h : {0, 1}n → {0, 1} and any distribution D,

    • if h belongs to C, then Pr[T h,D = Accept] ≥ 32 ; and

    • if h is ε-far from C w.r.t. D, then Pr[T h,D = Accept] ≤ 31 .

    This definition allows the tester to be adaptive and to have two-sided error; this is of course the
strongest version for proving lower bounds.

                        T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                              195
                                   DANA G LASNER AND ROCCO A. S ERVEDIO

Notation and Terminology For a string x ∈ {0, 1}n we write xi to denote the ith bit of x. For x, y ∈
{0, 1}n we write x ∧ y to denote the n-bit string z which is the bitwise AND of x and y, i. e., zi = xi ∧ yi
for all i. The string x ∨ y is defined similarly to be the bitwise OR of x and y.
    Recall that the total variation distance, or statistical distance, between two random variables X and
Y that take values in a finite set S is
                                                  1
                                  dTV (X,Y ) =          | Pr[X = ζ ] − Pr[Y = ζ ]| .
                                                  2 ζ∑
                                                     ∈S


The classes we consider For completeness we define here all the classes of functions that we will
consider: these are (in order of increasing generality) monotone conjunctions, general conjunctions,
decision lists, and linear threshold functions. We note that each of these function classes is quite basic
and natural and has been studied intensively in fields such as computational learning theory.

Definition 2.3. The class MCONJ consists of all monotone conjunctions of any subset of Boolean
variables from x1 , . . . , xn , i. e., all ANDs of (unnegated) Boolean variables.

Definition 2.4. The class CONJ consists of all conjunctions of any subset of Boolean literals over {0, 1}n
(a literal is a Boolean variable or the negation of a variable).

Definition 2.5. A decision list L of length k over the Boolean variables x1 , . . . , xn is defined by a list of k
pairs and a bit (`1 , β1 ), (`2 , β2 ), . . . , (`k , βk ), βk+1 where each `i is a Boolean literal and each βi is either
0 or 1. Given any x ∈ {0, 1}n , the value of L(x) is βi if i is the smallest index such that `i is made true by
x; if no `i is true then L(x) = βk+1 . Let DL denote the class of all decision lists of arbitrary length k ≥ 0
over {0, 1}n .

Definition 2.6. A linear threshold function is defined by a list of n + 1 real values w1 , . . . , wn , θ . The
value of the function on input x ∈ {0, 1}n is 1 if w1 x1 + · · · + wn xn ≥ θ and is 0 if w1 x1 + · · · + wn xn < θ .
We write LTF to denote the class of all linear threshold functions over {0, 1}n .

    It is well known and easy to see that MCONJ ( CONJ ( DL ( LTF.


3    The lower bound for monotone conjunctions
In this section we will prove the following theorem:

Theorem 3.1. Let q = (1/20)(n/log n)1/5 . There exist two distributions, YES and NO, over pairs (h, D)
where h : {0, 1}n → {0, 1} is a Boolean function and D is a distribution over the domain {0, 1}n that
have the following properties:

    1. For every pair (g, Dg ) in the support of YES, the function g is a monotone conjunction.

    2. For every pair ( f , D f ) in the support of NO, the function f is 1/6-far from DL with respect to D f .

                           T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                      196
             D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

   Moreover, for any probabilistic oracle algorithm T that, given a pair (h, D), makes at most q black-
box queries to h and samples D at most q times, we have

                                                                                                     1
                            Pr     [T g,Dg = Accept] −             Pr       [T f ,D f = Accept] ≤      .
                       (g,Dg )∼YES                            ( f ,D f )∼NO                          4

    Note that in the above theorem each probability is taken over the draw of the (function,distribution)
pair from the appropriate distribution YES or NO, over the random draws from the distribution D f or
Dg , and over any internal randomness of algorithm T .
    A simple corollary of Theorem 3.1 gives us our main result:

Corollary 3.2. Distribution-free testing of monotone conjunctions, conjunctions, and decision lists all
require Ω(n/log n)1/5 queries.

Proof. Since MCONJ ⊂ CONJ ⊂ DL, properties (1) and (2) imply that any distribution-free tester for
MCONJ, CONJ or DL that is run with distance parameter ε = 1/6 must accept a random pair (g, Dg )
drawn from YES with probability at least 2/3, and must accept a random pair ( f , D f ) drawn from
NO with probability at most 1/3. Therefore, by Theorem 3.1, any such tester must make at least q =
Ω(n/log n)1/5 queries.

3.1     The two distributions
We now define the two distributions, YES and NO, and prove that these distributions have properties (1)
and (2) required by Theorem 3.1. Our constructions are parameterized by three values `, m and s. As we
will see the optimal setting (up to multiplicative constants) of these parameters for our purposes is

                           ` = n2/5 (log n)3/5 ,      m = (n/ log n)2/5 ,           s = log n.                   (3.1)

To keep the different roles of these parameters clear in our exposition we will present our constructions
and analyses in terms of “`,” “m,” and “s” as much as possible and only plug in the values from (3.1)
toward the end of our analysis.
    We first define the distribution H over tuples

                                       (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ) ,

where A1 , B1 , . . . , Am , Bm are disjoint `-element subsets of [n], the set R equals i (Ai ∪ Bi ), and αi is an
                                                                                                 S

element of Ai for each i. It is useful to define this distribution since a draw from both the YES and NO
distributions begins with a draw from H.

3.1.1    The H distribution
A draw from the distribution H over (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ) tuples is obtained as follows:

      • Let R ⊂ [n] be a set of size 2`m selected uniformly at random. Randomly partition the set R into
        2m subsets A1 , B1 , . . . , Am , Bm , each of size `.

                          T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                      197
                                      DANA G LASNER AND ROCCO A. S ERVEDIO

    • For each i = 1, . . . , m let α(i) be an element chosen uniformly at random from the set Ai ; we say
      that α(i) is a representative of Ai .

    Given a draw (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ) from H, we define the strings a1 , b1 , c1 , . . . , am , bm ,
c in the following way: Let ai ∈ {0, 1}n be the string whose jth bit is 0 iff j ∈ Ai . The string bi is defined
 m

similarly. The string ci is defined to be ai ∧ bi . Similarly we define the set Ci = Ai ∪ Bi . We sometimes
refer to ai , bi , ci as the “points of the ith block.”
    Additionally, we define the conjunction g1 to be the conjunction of all variables in [n] \ R.

3.1.2    The YES distribution
A draw from the distribution YES over (g, Dg ) pairs is obtained as follows:

    • Make a draw from the H distribution to obtain (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ).

    • The distribution Dg puts weight 2/(3m) on bi and puts weight 1/(3m) on ci for each i = 1, . . . , m.

    • Define the function g2 as follows: g2 is a conjunction of length m formed by taking g2 (x) =
      xα(1) ∧ · · · ∧ xα(m) , i. e., g2 is an AND of the representatives from each of A1 , . . . , Am .

    • The function g is defined as: g = g1 ∧ g2 .

    It is clear that for every (g, Dg ) in the support of YES, the function g is a monotone conjunction that
contains exactly n − 2`m + m variables, so Property (1) of Theorem 3.1 indeed holds. We also note that
for each i we have g(ai ) = g(ci ) = 0 and g(bi ) = 1; see Figure 1.


                                                  1n 1                       xα(i)    1n 1
                                      s 0’s                                 s 0’s
                                                                              000000000
                                                                              111111111
                                                                              000000000
                                                                              111111111
                                                                              000000000
                                                                              111111111
                                                                              000000000
                                                                              111111111
                                              xα(i)        1                      1
                                                                              000000000
                                                                              111111111     1
                                                                               111111111
                                                                               000000000
                                                                               000000000
                                                                               111111111
                                                                               000000000
                                                                               111111111
                                                                               000000000
                                                                               111111111
                                                                               000000000
                                                                               111111111
                          ℓ 0’s                                     ℓ 0’s      000000000
                                                                               111111111
                                       ai                                     i111111111
                                                                               000000000
                                                               bi            a                  bi
                                       0                       1             1                  1



                                      2ℓ 0’s                                2ℓ 0’s
                                                      ci   0                           ci   0




Figure 1: The left figure shows how a yes-function g labels ci and the points above it (including ai and
bi ). Bold print indicate the label that g assigns. Note that every point above bi is labeled 1 by g, and
points above ai are labeled according to xα (i) . The right figure shows how a no-function f labels ci and
the points above it (including ai and bi ). Again, bold print indicates the label that f assigns. Note that
every point above bi is labeled 1 by f , and points above ai with less than s 0’s are labeled according to
xα (i) . The i-special points for block i (see Section 3.2) are shaded and are labeled 1 by f .



                             T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                            198
             D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

3.1.3   The NO distribution
A draw from the distribution NO of ( f , D f ) pairs is obtained as follows:

    • Make a draw from the H distribution to obtain (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ).
    • The distribution D f is uniform over the 3m points a1 , . . . , cm .
    • Define the function f 0 as follows: f 0 (x) = 0 if there exists some i ∈ [m] such that both the following
      conditions hold:
          – xα(i) = 0 and
          – {fewer than s of the elements j ∈ Ai have x j = 0)} or {x j = 0 for some j ∈ Bi }.
    • The final function f is defined as f = g1 ∧ f 0 .

    See Figure 1 for a pictorial representation of the yes-function g and the no-function f .
    With the definition of NO in place, we now establish Property (2) of Theorem 3.1:
Lemma 3.3. For any pair ( f , D f ) in the support of NO and any decision list h, the function f is at least
1/6-far from h w.r.t. D f .
Proof. Fix any ( f , D f ) in the support of NO and any decision list

                                      h = (`1 , β1 ), (`2 , β2 ), . . . , (`k , βk ), βk+1 .

We will show that at least one of the six points a1 , b1 , c1 , a2 , b2 , c2 is labeled differently by h and f .
Grouping all m blocks into pairs and applying the same argument to each pair gives the lemma.
     Let first(a1 ) be the index of the first literal, `first(a1 ) , in h that is satisfied by the point a1 , so the
value h(a1 ) equals βfirst(a1 ) . Define first(b1 ), first(c1 ), first(a2 ), first(b2 ), and first(c2 ) similarly. We will
assume that h and f agree on all six points, i. e., that βfirst(a1 ) = βfirst(b1 ) = βfirst(a2 ) = βfirst(b2 ) = 1 and
βfirst(c1 ) = βfirst(c2 ) = 0, and derive a contradiction.
     We may suppose w.l.o.g. that first(a1 ) = min{first(a1 ), first(b1 ), first(a2 ), first(b2 )}. We now con-
sider two cases depending on whether or not first(c1 ) < first(a1 ). (Note that first(a1 ) cannot equal
first(c1 ) since f (a1 ) = 1 but f (c1 ) = 0.)
     Suppose first that first(c1 ) < first(a1 ). No matter what literal `first(c1 ) is, since c1 satisfies `first(c1 ) at
least one of a1 , b1 must satisfy it as well. But this means that min{first(a1 ), first(b1 )} ≤ first(c1 ), which
is impossible since first(c1 ) < first(a1 ) and first(a1 ) ≤ min{first(a1 ), first(b1 )}.
     Now suppose that first(a1 ) < first(c1 ); then it must be the case that `first(a1 ) is a literal “x j ” for
some j ∈ B1 . (The only other possibilities are that `first(a1 ) is “x j ” for some j ∈ Ai or is “x j ” for some
 j ∈ ([n] \ C1 ); in either case, this would imply that f (c1 ) = 1, which does not hold.) Since f (c2 ) = 0
and (c2 ) j = 1, it must be the case that first(c2 ) < first(a1 ). But no matter what literal `first(c2 ) is, since c2
satisfies it at least one of a2 and b2 must satisfy it as well. This means that

                 min{first(a2 ), first(b2 )} ≤ first(c2 ) < first(a1 ) ≤ min{first(a2 ), first(b2 )} ,

which is a contradiction.

                           T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                       199
                                 DANA G LASNER AND ROCCO A. S ERVEDIO

      In fact, we have a slightly stronger result for the case of testing monotone conjunctions. For any
( f , D f ) drawn from NO, we have f (ai ) = f (bi ) = 1 and f (ci ) = 0 for each i = 1, . . . , m. It is noted
in [19] (and is easy to check) that any monotone conjunction h must satisfy h(x) ∧ h(y) = h(x ∧ y) for
all x, y ∈ {0, 1}n , and thus must satisfy h(ci ) = h(ai ) ∧ h(bi ). Thus any monotone conjunction h must
disagree with f on at least one of ai , bi , ci for all i, and consequently f is 1/3-far from any monotone
conjunction with respect to D f .
      Thus we have established properties (1) and (2) stated at the beginning of this section.


3.2   The idea
In this subsection we attempt to explain the main ideas of the proof of Theorem 3.1. Before we give the
main ideas we first provide some preliminary intuition and useful terminology.
       It is easy to see that in both the yes-case (when the (function, distribution) pair is drawn from YES)
and the no-case (when the (function, distribution) pair is drawn from NO), any black-box query that sets
any variable in [n] \ R to 0 will get a 0 response. To give some preliminary intuition for our construction,
let us explain here the role that the large conjunction g1 (over n − 2`m variables) plays in both the YES
and NO cases. The idea is that because of g1 , a testing algorithm that has obtained strings z1 , . . . , zq from
the distribution D will “gain nothing” by querying any string x that has any bit xi set to 0 that was set
to 1 in all of z1 , . . . , zq . This is because such a variable xi will with very high probability (over a random
choice of ( f , D f ) from NO or a random choice of (g, Dg ) from YES) be contained in [n] \ R, so in both
the “yes” and “no” cases the query will yield an answer of 0 with very high probability. Consequently
there is no point in making such a query in the first place. (We give a rigorous version of this argument
in Section 3.3.2.)
       The following terminology will be useful: we say that an input x ∈ {0, 1}n is i-special if (at least
s elements j ∈ Ai have x j = 0) and (x j = 1 for all j ∈ Bi ). Thus an equivalent way to define f 0 is that
 f 0 (x) = g2 (x) unless g2 (x) = 0 (because some xα(i) = 0) and x is i-special for each i such that xα(i) = 0;
in this case f 0 (x) = 1.
       Here is some high-level intuition for the proof. If T could find ai , bi and ci for some i then T would
know which case it is in (yes versus no), because h(ai ) ∧ h(bi ) = h(ci ) if and only if T is in the yes-case.
                                       √
Since T can only make q  m draws from D, the birthday paradox tells us that with high probability
the random sample that T draws contains at most one of ai , bi and ci for each i. The ci -type points (with
n − 2` ones) are labeled negative in both the yes- and no- cases, so these “look the same” to T in both
cases. Other than the ci -type points, in the yes-case the distributions Dg put weight only on the bi -type
points (with n − ` ones) which are positively labeled, and in the no-case the distributions D f put weight
only on the ai - and bi -type points (with n − ` ones) which are also positively labeled. So the draws with
n − ` ones “look the same” to T as well in both cases. So with high probability T cannot distinguish
between yes-pairs and no-pairs on the basis of the first q random draws alone. (Corollary 3.5 formalizes
this intuition.)
       Of course, though, T can also make q queries. Perhaps T can identify a triple (ai , bi , ci ) through
these queries, or perhaps T can otherwise determine which case it is in even without finding a triple?
The crux of the proof is to show that in fact queries actually cannot help T much; we now sketch the
idea.

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                  200
              D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

     Consider a single fixed block i ∈ [m]. If none of ai , bi or ci are drawn in the initial sample, then by
the argument at the start of this subsection, the tester will get no useful information about which case it
is in from this block. By the birthday paradox we can assume that at most one of ai , bi and ci is drawn
in the initial sample; we consider the three cases in turn.
     If bi is drawn, then again by the argument at the start of this subsection all query points will have all
the bits in Ai set to 1; such queries will “look the same” in both the yes- and no- cases as far as the ith
block is concerned.
     If ai is drawn (so we are in the no-case), then by the same argument all query points will have all the
bits in Bi set to 1. Using the definition of f 0 , as far as the ith block is concerned with high probability it
will “look like” the initial ai point was a bi -point from the yes-case. This is because the only way the
tester can tell that it is in the no-case is if it manages to query a point which has fewer than s bits from
Ai set to 0, but the representative α(i) is one of those bits. Such points are hard to find since α(i) is
randomly selected from Ai . (See the “a-witness” case in the proof of Lemma 3.11.)
     Finally, suppose that ci is drawn. The only way a tester can distinguish between the yes- and no-
cases is by finding an i-special point (or determining that no such point exists), but to find such a point
it must make a query with at least s 0’s in Ci , all of which lie in Ai . This is hard to do since the tester
does not know how the elements of Ci are divided into the sets Ai and Bi . (See the “c-witness” case in
the proof of Lemma 3.11.)

3.3     Proof of Theorem 3.1
Fix any probabilistic oracle algorithm T that makes at most q black-box queries to h and samples D
at most q times. Without loss of generality we may assume that T first makes exactly q draws from
distribution D, and then makes exactly q (adaptive) queries to the black-box oracle for h.
    It will be convenient for us to assume that algorithm T is actually given “extra information” on
certain draws from the distribution D. More precisely, we suppose that each time T calls the oracle for
D,

      • If a “ci -type” labeled example (ci , h(ci )) is generated by the oracle, algorithm T receives the triple
        (ci , h(ci ), α(i)) (recall that α(i) is the index of the variable from Ci that belongs to the conjunction
        g2 );

      • If a “non-ci -type” labeled example (x, h(x)) is generated by the oracle where x 6= ci for all i =
        1, . . . , m, algorithm T receives the triple (x, h(x), 0). (Thus there is no “extra information” given
        on non-ci points.)

It is clear that proving Theorem 3.1 for an arbitrary algorithm T that receives this extra information
establishes the original theorem as stated (for algorithms that do not receive the extra information).
     Following [15], we now define a knowledge sequence to precisely capture the notion of “what an
algorithm learns from its queries.” A knowledge sequence is a sequence of elements corresponding to
the interactions that an algorithm has with each of the two oracles. The first q elements of a knowledge
sequence are triples as described above; each corresponds to an independent draw from the distribution
D. The remaining elements of the knowledge sequence are input-output pairs corresponding to the

                           T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                201
                                  DANA G LASNER AND ROCCO A. S ERVEDIO

algorithm’s calls to the black-box oracle for h. (Recall that these later oracle calls are adaptive, i. e., each
query point can depend on the answers received from previous oracle calls.)


Notation For any oracle algorithm ALG, let PALG       yes denote the distribution over knowledge sequences
induced by running ALG on a pair (g, Dg ) randomly drawn from YES. Similarly, let PALG          no denote the
distribution over knowledge sequences induced by running ALG on a pair ( f , D f ) randomly drawn from
NO. Where there is no risk of confusion we will sometimes write PALG          yes to denote a random variable
drawn from this distribution, and similarly for PALG no  .
    For 0 ≤ i ≤ q we write PALGyes,i to denote the length-(q + i) prefix of a knowledge sequence drawn from
Pyes , and similarly for Pno,i .
  ALG                       ALG

    We will prove Theorem 3.1 by showing that the statistical distance dTV (PTyes , PTno ) between distribu-
tions PTyes and PTno is at most 1/4.


3.3.1   Most sequences of draws are “clean” in both the yes- and no- cases

The main result of this subsection is Corollary 3.5; intuitively, this corollary shows that given only q
draws from the distribution and no black-box queries, it is impossible to distinguish between the yes-
and no- cases with high accuracy. This is achieved via the notion of a “clean” sequence of draws from
the distribution, which we now explain.
      Let S = x1 , . . . , xq be a sequence of q examples drawn from distribution D, where D is either D f
for some ( f , D f ) ∼ NO or Dg for some (g, Dg ) ∼ YES. In either case there is a corresponding set of
points a1 , b1 , c1 , . . . , am , bm , cm as described in Section 3.1, corresponding to the initial draw from the
H distribution. We say that S is clean if S does not hit any block 1, . . . , m twice, i. e., if the number of
different blocks from i = 1, . . . , m for which S contains some point ai , bi or ci is exactly q.
      We note that strictly speaking, cleanliness is not a property of a sequence S drawn from D f or Dg ,
but rather a property of the way the sequence S is generated (since it depends on the random choice of
ai , bi , ci , i. e., on the draw from H). With this understood, for conciseness we shall speak of cleanliness
simply as a property of a sequence S of draws from D f or Dg . We further note that while a draw from
PTno,0 is, strictly speaking, a sequence of q triples (as described at the beginning of Section 3.3), without
risk of confusion we will simply write “S ∼ PTno,0 ” to indicate the corresponding sequence of draws from
D f (and likewise we write “S ∼ PTyes,0 ” to indicate the corresponding sequence of draws from Dg ).
      With this notion and these conventions in place, we have the following claim and its easy corollary:

Claim 3.4. We have
                               Pr [S is clean] =        Pr [S is clean] ≥ 1 − q2 /m .
                            S∼PTyes,0               S∼PTno,0

Furthermore, the conditional random variables

                            (S | S is clean)S∼PTyes,0    and (S | S is clean)S∼PTno,0

are identically distributed.

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                  202
            D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

Proof. We first show that

                                  Pr [S is clean] =      Pr [S is clean] ≥ 1 − q2 /m .
                              S∼PTyes,0                S∼PTno,0


Actually, we will prove something stronger, which is that regardless of which (g, Dg ) is drawn from
YES, the conditional probability of getting a clean sequence is at least 1 − q2 /m. Fix any (g, Dg ) in
the support of YES, and consider the outcomes of S ∼ PTyes,0 corresponding to this (g, Dg ) being drawn
from YES. Since each independent draw from Dg hits each block 1, . . . , m with probability 1/m, the
probability that S ∼ PTyes,0 is clean is
                                           q                
                                                       i−1
                                          ∏         1−            ≥ 1 − q2 /m .
                                          i=1           m

The same argument shows that PrS∼PTno,0 [S is clean] also equals ∏qi=1 1 − i−1
                                                                                 
                                                                              m .
    Now we show that the conditional random variables are identically distributed. It is not difficult to
see that for any 0 ≤ j ≤ q − 1, given any particular length- j prefix of a draw from PTyes,0 , conditioned on
the draw from PTyes,0 being clean, the ( j + 1)-st element of the draw from PTyes,0 has

    • a 2/3 chance of being a triple (x, 1, 0) where x ∈ {0, 1}n has ` zeros and the locations of the `
      zeros are selected uniformly at random (without replacement) from the set of those bit positions
      that had value 1 in all j of the previous draws;

    • a 1/3 chance of being a triple (x, 0, α) where x has 2` zeros, the locations of the 2` zeros are
      selected uniformly at random (without replacement) from the same set of bit positions described
      above, and α is an index drawn uniformly at random from the indices of the 2` zeros in x.

It is also not difficult to see that given any particular length- j prefix of a draw from PTno,0 , conditioned
on the draw from PTno,0 being clean, the ( j + 1)-st element of the draw from PTno,0 is distributed in the
exact same way.

Corollary 3.5. The statistical distance dTV (PTyes,0 , PTno,0 ) is at most q2 /m.

Proof. We can express the statistical distance between PTyes,0 and PTno,0 as

                  1
                      Pr S∼PTyes,0 [(S = ζ ) & (S is clean)] + Pr [(S = ζ ) & (S not clean)]
                  2∑ζ                                         S∼PTyes,0


                         −       Pr [(S = ζ ) & (S is clean)] − Pr [(S = ζ ) & (S not clean)] .
                             S∼PTno,0                                   S∼PTno,0


By parts (1) and (2) of the claim, we have that

                        Pr [(S = ζ ) & (S is clean)] =              Pr [(S = ζ ) & (S is clean)]
                     S∼PTyes,0                                    S∼PTno,0


                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                              203
                                 DANA G LASNER AND ROCCO A. S ERVEDIO

for all ζ . Thus we can reexpress the statistical distance as

                  1
                        Pr [(S = ζ ) & (S not clean)] − Pr [(S = ζ ) & (S not clean)] .
                  2∑     T
                    ζ S∼Pyes,0                         S∼PTno,0


This is at most
                               1                                     
                                   Pr [S not clean] + Pr [S not clean]
                               2 S∼PTyes,0           S∼PTno,0

which is at most q2 /m by part (1) of the claim.

3.3.2   Eliminating foolhardy queries

Let T 1 denote a modified version of algorithm T which works as follows: like T , it starts out by making
q draws from the distribution. Let Q be the set of all indices i such that all q draws from the distribution
have the ith bit set to 1. We note that for both the yes-distribution and the no-distribution, we have that
[n] \ R is contained in Q. Moreover, if i ∈ [m] is such that none of {ai , bi , ci } occur among the q draws,
then Ci is also contained in Q.
     We say that any query string x ∈ {0, 1}n that has x j = 0 for some j ∈ Q is foolhardy. After making
its q draws from D, algorithm T 1 simulates algorithm T for q black-box queries, except that for any
foolhardy query that T makes, T 1 “fakes” the query in the following sense: it does not actually make
the query but instead proceeds as T would proceed if it made the query and received the response 0.
     Our goal in this subsection is to show that in both the yes- and no- cases, the executions of T and T 1
are statistically close. (Intuitively, this means that we can w.l.o.g. assume that the testing algorithm T
does not make any foolhardy queries.) To analyze algorithm T 1 it will be useful to consider some other
algorithms that are intermediate between T and T 1 , which we now describe.
     For each value 1 ≤ k ≤ q, let Uk denote the algorithm which works as follows: Uk first makes q draws
from the distribution D, then simulates algorithm T for k queries, except that for each of the first k − 1
queries that T makes, if the query is foolhardy then Uk “fakes” the query as described above. Let Uk0
denote the algorithm which works exactly like Uk , except that if the kth query made by Uk is foolhardy
then Uk0 fakes that query as well. We have the following:

Lemma 3.6. For all k ∈ [q], the statistical distance

                                                               U0      Uk0
                            dTV (PUyesk | PUyes,0 is clean), (Pyesk | Pyes,0
                                             k
                                                                                      
                                                                             is clean)

is at most 2`m/n, and similarly

                                                              U0     Uk0
                             dTV (PUnok | PUno,0 is clean), (Pnok | Pno,0
                                             k
                                                                                   
                                                                          is clean)

is also at most 2`m/n.

Proof. We consider the yes-case; the no-case follows by an essentially identical argument.

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                             204
             D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

     The executions of Uk and Uk0 are identically distributed unless the kth query string (which we denote
z) is foolhardy and the black-box function g has g(z) = 1. Consequently the variation distance
                                                               U0      Uk0
                            dTV (PUyesk | PUyes,0 is clean), (Pyesk | Pyes,0
                                             k
                                                                                      
                                                                             is clean)

is at most
                   Pr[(z is foolhardy) & (g(z) = 1)] ≤ Pr[(g(z) = 1) | (z is foolhardy)] ,
where the probabilities are taken over a random draw of (g, Dg ) from YES conditioned on (g, Dg ) being
consistent with the q draws from the distribution and with the first k − 1 queries, and with the q draws
from the distribution being clean.
     Since the first k − 1 queries do not involve any variables in Q (because foolhardy queries are faked
for the first k − 1 queries) and our analysis will only concern variables in Q, to analyze this conditional
probability it is enough to consider (g, Dg ) drawn from YES conditioned on (g, Dg ) being consistent
with the q draws from the distribution and with these q draws being clean. Suppose that these draws
from the distribution yield r distinct ci -type points (each with 2` zeros) and (q − r) distinct bi -type points
(each with ` zeros). Then after these draws, the algorithm “knows” (r + q)` elements of R. Let Z denote
the set of these (r + q)` elements of R. The set R also contains 2`m − (r + q)` other “unknown” variables
from among the |Q| = n − (r + q)` variables in [n] \ Z.
     We would like to find the probability, over random (g, Dg ) drawn from YES consistent with the
draws from the distribution, that g(z) = 1 given that z is foolhardy. Since z is foolhardy there must be
at least one index j ∈ [n] \ Z such that z j = 0. So the desired probability is at most the probability that
 j belongs to R, since if j ∈ / R the conjunction g1 will evaluate to 0 on z. For a random (g, Dg ) that
is consistent with the draws from the distribution, the remaining 2`m − (r + q)` elements of R \ Z are
chosen randomly from the n − (r + q)` elements of [n] \ Z. Consequently the probability that j belongs
to R \ Z is
                                             2`m − (r + q)` 2`m
                                                            ≤       ,
                                              n − (r + q)`      n
and the lemma is proved.

    Now a hybrid argument using Lemma 3.6 lets us bound the statistical distance between the execu-
tions of T and T 1 .
                                                   1
Lemma 3.7. The statistical distance dTV (PTyes , PTyes ) is at most 2`mq/n + q2 /m, and the same bound
                  1
holds for dTV (PTno , PTno ).

Proof. We prove the yes-case; the no-case follows by an identical argument.
                                                   1
    By Claim 3.4, at the cost of q2 /m in dTV (PTyes , PTyes ) we may assume that the draws from the distribu-
tion are clean. So we henceforth in the proof always condition on the draws from the distribution being
                                   1
clean, and we will bound dTV (PTyes , PTyes ) by 2`mq/n under this conditioning on each argument to dTV .
                                                           1
    We use induction on i to show that dTV (PTyes,i , PTyes,i ) is at most 2`mi/n for all i. Once we have this,
                                                        1         1
taking i = q and recalling that PTyes,q = PTyes and PTyes,q = PTyes gives the desired bound.
    The base case i = 0 is clear since in this case no black-box queries are made by either T or T 1 .

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                205
                                      DANA G LASNER AND ROCCO A. S ERVEDIO

                                                                       1
     For the induction step we assume that dTV (PTyes,i , PTyes,i ) ≤ 2`mi/n, and we will show that

                                                               1            2`m(i + 1)
                                        dTV (PTyes,i+1 , PTyes,i+1 ) ≤                 .
                                                                                n
                                                       1               U0
We first note that the random variables PTyes,i+1 and Pyesi+1 are identically distributed, i. e., they have
                                                                               1       U
statistical distance zero. Lemma 3.6 now implies that dTV (PTyes,i+1 , Pyesi+1 ) is at most 2`m/n. Since
                              1                                    1
                    dTV (PTyes,i+1 , PTyes,i+1 ) ≤ dTV (PTyes,i+1 , PUyesi+1 ) + dTV (PUyesi+1 , PTyes,i+1 ) ,
                                  U
it is enough to bound dTV (Pyesi+1 , PTyes,i+1 ) by 2`mi/n. But since the first q + i elements of the sequence
 U                                          1
Pyesi+1 are distributed according to PTi (and the last element is obtained by performing the (i + 1)-st query
                              U                            1
of T ), we have that dTV (Pyesi+1 , PTyes,i+1 ) = dTV (PTyes,i , PTyes,i ), and by the induction hypothesis this is at
most 2`mi/n. This concludes the proof.

3.3.3   Bounding the probability of finding a witness
Let T 2 denote an algorithm that is a variant of T 1 , modified as follows. T 2 simulates T 1 except that
T 2 does not actually make queries on non-foolhardy strings. Instead T 2 simulates the answers to those
queries “in the obvious way” that they should be answered if the target function were a yes-function and
hence all of the draws from D that yielded strings with ` zeros were in fact bi -type points. More precisely,
assume that there are r distinct ci -type points in the initial sequence of q draws from the distribution.
Since for each ci -type point the algorithm is given α(i), the algorithm “knows” r variables xα(i) that are
in the conjunction. The algorithm T 2 simulates an answer to a non-foolhardy query x ∈ {0, 1}n with 0
if any of the r xα(i) variables are set to 0 in x, and with 1 otherwise. Note that consequently T 2 does not
actually make any black-box queries at all.
     In this subsection we will show that in both the yes- and no- cases, the executions of T 1 and T 2
are statistically close; once we have this it is not difficult to complete the proof of Theorem 3.1. In the
yes-case these distributions are in fact identical (Lemma 3.8), but in the no-case these distributions are
not identical; we will argue that they are close using properties of the function f 0 from Section 3.1.3.
     We first address the easier yes-case:
                                                           1       2
Lemma 3.8. The statistical distance dTV (PTyes , PTyes ) is zero.

Proof. We argue for every query, the two algorithms T 1 and T 2 receive (or simulate) the same answer.
Fix any 1 ≤ i ≤ q and let z denote the ith query made by T .
    If z is a foolhardy query then both T 1 and T 2 simulate an answer of 0. So suppose that z is not a
foolhardy query. Then any 0’s that z contains must be in positions from points that were sampled in
the first stage. Consequently the only variables that can be set to 0 that are in the conjunction g are the
xα(i) variables from the Ci sets corresponding to the ci points in the draws. All the other variables that
were assigned value 0 in some string that was drawn from the distribution are not in the conjunction,
so setting them to 0 or 1 will not affect the value of g(z). Therefore, the value g(z) that T 1 receives in
response to the query string z is 0 if any of the xα(i) variables are set to 0 in z, and is 1 otherwise. This is
exactly how T 2 simulates the answer to a non-foolhardy query string z as well.

                           T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                    206
            D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

   To handle the no-case, we introduce the notion of a “witness” that the black-box function is a no-
function.

Definition 3.9. We say that a knowledge sequence contains a witness for ( f , D f ) if elements q + 1, q +
2, . . . of the sequence (i. e., the black-box queries) contain either of the following:

   1. A point z ∈ {0, 1}n such that for some 1 ≤ i ≤ m for which ai was sampled in the first q draws, the
      bit zα(i) is 0 and fewer than s of the elements j ∈ Ai have z j = 0. We refer to such a point as an
      a-witness for block i.

   2. A point z ∈ {0, 1}n such that for some 1 ≤ i ≤ m for which ci was sampled in the first q draws, z
      is i-special. We refer to such a point as a c-witness for block i.

    The following lemma implies that it is enough to bound the probability that PTno contains a witness:

Lemma 3.10. The statistical distance

                                1        1                                     1
                       dTV (PTno | PTno does not contain a witness and PTno,0 is clean),
                                2      2                                   2
                            (PTno | PTno does not contain a witness and PTno,0 is clean)
                                                                                        


is zero.
                                     1       1                     2       2
Proof. Claim 3.4 implies that (PTno,0 | PTno,0 is clean) and (PTno,0 | PTno,0 is clean) are identically dis-
tributed. We show that if there is no witness, then for every query the algorithms T 1 and T 2 receive (or
simulate) the same answer as each other; this gives the lemma. Fix any 1 ≤ i ≤ q and let z denote the ith
query.
    If z is a foolhardy query, then both T 1 and T 2 simulate an answer to z with 0. So suppose that z is
not a foolhardy query and not a witness. Then any 0’s that z contains must be in positions from points
that were sampled in the first stage.
    First suppose that one of the xα(i) variables from some ci that was sampled is set to 0 in z. Since z is
not a witness, either z has fewer than s zeros from Ai or some variable from Bi is set to 0 in z. So in this
case we have f (z) = g2 (z) = 0.
    Now suppose that none of the xα(i) variables from the ci ’s that were sampled are set to 0 in z. If no
variable xα(i) from any ai that was sampled is set to 0 in z, then clearly f (z) = g(z) = 1. If any variable
xα(i) from an ai that was sampled is set to 0 in z, then since z is not a witness there must be at least s
elements of Ai set to 0 and every element of Bi set to 1 for each such xα(i) . Therefore, f (z) = 1.
    Thus the value f (z) that T 1 receives on query z is 0 if any of the xσ (i) variables from the ci ’s that
were sampled is set to 0 in z, and the value f (z) is 1 otherwise. This is exactly how T 2 simulates its
answer to the query z as well.

    Let us consider a sequence of algorithms that hybridize between T 1 and T 2 , similar to the previous
section. For each value 1 ≤ k ≤ q, let Vk denote the algorithm which works as follows: Vk first makes
q draws from the distribution D, then simulates algorithm T 1 for k queries, except that each of the first

                        T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                              207
                                 DANA G LASNER AND ROCCO A. S ERVEDIO

k − 1 queries is faked (foolhardy queries are faked as described in the previous subsection, and non-
foolhardy queries are faked as described at the start of this subsection). Thus algorithm Vk actually
makes at most one query to the black-box oracle, the kth one (if this is a foolhardy query then this one
is faked as well). Let Vk0 denote the algorithm which works exactly like Vk , except that if the kth query
made by Vk is non-foolhardy then Vk0 fakes that query as well as described at the start of this subsection.

Lemma 3.11. For each value 1 ≤ k ≤ q, the statistical distance

                                                              V0     Vk0
                             dTV (PVnok | PVno,0 is clean), (Pnok | Pno,0
                                             k
                                                                                   
                                                                          is clean)

is at most max{ qs` , 2qs }. (This equals qs` provided that s2s ≥ `, as is the case in our parameter settings
given by (3.1).)

Proof. By Lemma 3.10, the executions of Vk and Vk0 are identically distributed unless the kth query string
(which we denote z) is a witness for ( f , D f ). Since neither Vk nor Vk0 makes any black-box query prior to
                                              V0
z, the variation distance between PVnok and Pnok is at most Pr[z is a witness], where the probability is taken
over a random draw of ( f , D f ) from NO conditioned on ( f , D f ) being consistent with the q draws from
the distribution and with those first q draws being clean. We bound the probability that z is a witness by
considering both possibilities for z (an a-witness or a c-witness) in turn.

   • We first bound the probability that z is an a-witness. So fix some i ∈ [m] and let us suppose that
     ai was sampled in the first stage of the algorithm. We will bound the probability that z is an a-
     witness for block i; once we have done this, a union bound over the (at most q) blocks such that
     ai is sampled in the first stage gives a bound on the overall probability that z is an a-witness.
      Fix any possible outcome for z. In order for z to be an a-witness for block i, it must be the
      case that fewer than s of the ` elements in Ai are set to 0 in z, but the bit zα(i) is set to 0. For a
      random choice of ( f , D f ) as described above, since we are conditioning on the q draws from the
      distribution being clean, the only information that these q draws reveal about the index α(i) is that
      it is some member of the set Ai . Consequently for a random ( f , D f ) as described above, each bit
      in Ai is equally likely to be chosen as α(i), so the probability that α(i) is chosen to be one of the
      at most s bits in Ai that are set to 0 in z is at most s/`. Consequently the probability that z is an
      a-witness for block i is at most s/`, and a union bound gives that the overall probability that z is
      an a-witness is at most qs/`.

   • Now we bound the probability that z is a c-witness. Fix some i ∈ [m] and let us suppose that ci
     was sampled in the first stage of the algorithm. We will bound the probability that z is a c-witness
     for block i and then use a union bound as above.
      Fix any possible outcome for z; let r denote the number of 0’s that z has in the bit positions in Ci .
      In order for z to be a c-witness for block i it must be the case that z is i-special, i. e., r ≥ s and all r
      of these 0’s in fact belong to Ai . For a random choice of ( f , D f ) conditioned on being consistent
      with the q samples from the distribution and with     those q samples being clean, the distribution
      over possible choices of Ai is uniform over all 2`` possibilities for selecting a size-` subset of Ci .

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                  208
              D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

        Consequently the probability that all r 0’s belong to Ai is at most
                                2`−r
                                      
                                 `−r         `(` − 1) · · · (` − r + 1)     1   1
                                  2`
                                      =                                  < r ≤ s.
                                   `
                                           2`(2`  − 1) · · · (2` − r + 1)  2   2

        So the probability that z is a c-witness for block i is at most 1/2s , and by a union bound the overall
        probability that z is a c-witness is at most q/2s .

So the overall probability that z is a witness is at most max{ qs` , 2qs }. Using (3.1) we have that the
maximum is qs/`, and the lemma is proved.

    Now similar to Section 3.3.2, a hybrid argument using Lemma 3.11 lets us bound the statistical
distance between the executions of T 1 and T 2 . The proof of the following lemma is entirely similar to
that of Lemma 3.7 so we omit it.
                                                            1   2
Lemma 3.12. The statistical distance dTV (PTno , PTno ) is at most q2 s/` + q2 /m.

3.3.4    Putting the pieces together
At this stage, we have that T 2 is an algorithm that only makes draws from the distribution and makes
                                                              2      2
no queries. It follows that the statistical distance dTV (PTyes , PTno ) is at most dTV (PTyes,0 , PTno,0 ). So we can
bound dTV (PTyes , PTno ) as follows (we write “d” in place of “dTV ” for brevity):
                                           1            1       2         2      2           2       1         1
     dTV (PTyes , PTno ) ≤ d(PTyes , PTyes ) + d(PTyes , PTyes ) + d(PTyes , PTno ) + d(PTno , PTno ) + d(PTno , PTno )
                                           1            1       2                                2       1         1
                         ≤ d(PTyes , PTyes ) + d(PTyes , PTyes ) + d(PTyes,0 , PTno,0 ) + d(PTno , PTno ) + d(PTno , PTno )
                         ≤ 4q2 /m + 4q`m/n + q2 s/`

where the final bound follows by combining Corollary 3.5, Lemma 3.7, Lemma 3.8 and Lemma 3.12.
Recalling the parameter settings ` = n2/5 (log n)3/5 , m = (n/ log n)2/5 , and s = log n from (3.1) and the
fact that q = (1/20)(n/log n)1/5 , this bound is less than 1/4. This concludes the proof of Theorem 3.1.


4    A lower bound for linear threshold functions
The basic approach is similar to that of Section 3, and indeed several ingredients from the earlier proof
can be directly reused; we focus our discussion on the points where the approaches differ. We shall
prove the following:
Theorem 4.1. Let q = (1/20)(n/log n)1/5 . There exist two distributions, YES and NO, over pairs (h, D)
where h : {0, 1}n → {0, 1} is a Boolean function and D is a distribution over the domain {0, 1}n that
have the following properties:
    1. For every pair (g, Dg ) in the support of YES, the function g is a linear threshold function.

    2. For every pair ( f , D f ) in the support of NO, the function f is 1/4-far from LTF with respect to
       Df .

                            T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                               209
                                    DANA G LASNER AND ROCCO A. S ERVEDIO

   Moreover, for any probabilistic oracle algorithm T that, given a pair (h, D), makes at most q black-
box queries to h and samples D at most q times, we have

                                                                                                1
                             Pr      [T g,Dg = Accept] −         Pr     [T f ,D f = Accept] ≤     .
                         (g,Dg )∼YES                        ( f ,D f )∼NO                       4

      Since any distribution-free tester for LTF that is run with distance parameter ε = 1/4 must accept
a random pair (g, Dg ) drawn from YES with probability at least 2/3, and must accept a random pair
( f , D f ) drawn from NO with probability at most 1/3, we have the following corollary of Theorem 4.1,
which gives us our main result:

Corollary 4.2. Distribution-free testing of linear threshold functions requires Ω( logn n )1/5 queries.


4.1     The two distributions for linear threshold functions
The construction from Section 3 is not suited for a lower bound on LTF (observe that for any ( f , D f ) in
the support of NO the function f is 0-far from the linear threshold function x1 + · · · + · · · xn ≥ n − 3`/2
with respect to D f ), so we need a different approach.
    In the rest of this section we define two distributions YES and NO over pairs (h, D) and prove that
these distributions have properties (1) and (2) described in Theorem 4.1.
    In Section 4.2 we use these distributions to prove a lower bound for LTF.
    Before giving the precise construction, here is a very rough first intuition for how it works. Recall
that in the earlier construction, we relied on the fact that no monotone conjunction h can satisfy h(1, 0) =
h(0, 1) = 1 but h(0, 0) = 0. For linear threshold functions, we will instead rely on the fact that no linear
threshold function can satisfy h(0, 0) = h(1, 1) = 0 but h(1, 0) = h(0, 1) = 1.

4.1.1     The YES distribution

As in Section 3 our constructions are parameterized by values `, m and s that are set according to
Equation (3.1) in Section 3. A draw from the distribution YES over (g, Dg ) pairs is obtained as follows:

      • As before, make a draw from the H distribution to obtain (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ).

      • The distribution Dg puts 1/4 weight on the point 1n , and puts weight 1/(2m) on bi and 1/(4m) on
        ci for all i = 1, . . . , m.

      • The function g is defined as follows:
        g(x) = 1 if and only if all three of the following conditions hold:

           1. x j = 1 for all j ∈ ([n] \ R);
           2. x j = 1 for all j = α(1), . . . , α(m); and
           3. ∑m
               i=1 ∑k∈Ci ,k6=α(i) xk ≤ m(2` − 1) − s.


                            T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                       210
             D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

        It is straightforward to show that an equivalent way to define g is as follows: g(x) equals 1 if
        u(x) ≥ θ and equals 0 if u(x) < θ , where
                                                                        m              m
                             u(x) = 10n2             ∑        x j + 5n ∑ xα(i) − ∑              ∑         xk ,          (4.1)
                                                  j∈([n]\R)           i=1             i=1 k∈Ci ,k6=α(i)

                                θ    = 10n2 (n − 2`m) + 5nm − m(2` − 1) + s.                                            (4.2)

    Fix any (g, Dg ) in the support of YES. It is clear that g is a linear threshold function. It is straight-
forward to check that

                        u(1n ) = 10n2 (n − 2`m) + 5nm − m(2` − 1) ,
                         u(ci ) = 10n2 (n − 2`m) + 5n(m − 1) − (m − 1)(2` − 1) , and
                         u(bi ) = 10n2 (n − 2`m) + 5nm − m(2` − 1) + ` ,

and consequently we have g(1n ) = g(ci ) = 0, g(bi ) = 1.

4.1.2    The NO distribution
A draw from the distribution NO of ( f , D f ) pairs is obtained as follows:

   • As in the yes-case, make a draw from the H distribution to obtain

                                          (R, A1 , B1 , . . . , Am , Bm , α1 , . . . , αm ) .

   • The distribution D f puts weight 1/4 on 1n , puts 1/4 weight uniformly over the m points c1 , . . . , cm ,
     and puts 1/2 weight uniformly over the 2m points a1 , b1 , . . . , am , bm .
   • The function f is defined as follows. Given input x, let J(x) ⊆ [m] be the set of those i such that
     x is i-special as defined in Section 3.2 (i. e., the ith block of x has no zeros in Bi but has at least s
     zeros in Ai .)
        f (x) = 1 if and only if all three of the following conditions hold:

          1. x j = 1 for all j ∈ ([n] \ R);
          2. For all i such that x is not i-special, xα(i) = 1; and
          3. |J(x)|(` − 1) + ∑i∈J(x) ∑k∈Ai xk + ∑i∈([m]\J(x)) ∑k∈Ci ,k6=α(i) xk ≤ m(2` − 1) − s.

        It is straightforward to show that an equivalent way to define f is as follows: f (x) equals 1 if
        v(x) ≥ θ and equals 0 if v(x) < θ . Here θ is defined as in (4.2) and v(x) is defined as follows:
                                                                                  !
                         v(x) = 10n2            ∑         x j + 5n |J(x)| +           ∑         xα(i)
                                              j∈([n]\R)                         i∈([m]\J(x))

                                     −|J(x)|(` − 1) − ∑              ∑ xk −           ∑             ∑            xk .   (4.3)
                                                              i∈J(x) k∈Ai        i∈([m]\J(x)) k∈Ci ,k6=α(i)



                          T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                            211
                                  DANA G LASNER AND ROCCO A. S ERVEDIO

      Here is some intuition for the definition of f . Suppose that testing algorithm T manages to query an
input string x which has x j = 1 for all j ∈ Bi but also has at least s bits in Ai set to 0. Then as we will
see, it must be the case that the algorithm actually drew the point ai in its sample from D f . So in order
for T to be “fooled” into thinking that the function is a YES function, we want the contribution from the
bits of Ai and Bi for this input to “look like” the function is a YES function for which the point ai that
was drawn from D f is actually a point bi drawn from Dg . This is the rationale behind the definition of
 f ; instead of computing u(x) and comparing it with θ , we compute v(x), which reverses the role of Ai
and Bi bits on those blocks.
      It is easy to see that in both the yes-case and the no-case, any black-box query that sets any variable
in [n] \ R to 0 will give a 0 response. As in the earlier construction, intuitively this will let us assume that
any testing algorithm that has obtained strings z1 , . . . , zq from the distribution D never queries any string
x that has any bit xi set to 0 that was set to 1 in all of z1 , . . . , zq .
      Finally, it is easy to check that for any ( f , D f ) drawn from NO, we have

                        v(1n ) = 10n2 (n − 2`m) + 5nm − m(2` − 1) ,
                         v(ci ) = 10n2 (n − 2`m) + 5n(m − 1) − (m − 1)(2` − 1) , and
                         v(ai ) = v(bi ) = 10n2 (n − 2`m) + 5nm − m(2` − 1) + ` .

(Note that these values on 1n , ci and bi are the same that the corresponding functions u(x) would take in
the yes-case.) Thus we have f (1n ) = f (ci ) = 0 and f (ai ) = f (bi ) = 1 for each i = 1, . . . , m. It is easy to
see that any linear threshold function must disagree with f on at least one of the four points 1n , ai , bi , ci
for each i. Consequently f is at least 1/4-far from any linear threshold function with respect to D f .
    Thus we have established Properties (1) and (2) as stated in Theorem 4.1.

4.2   Proof of Theorem 4.1
As in Section 3.3, let T be any fixed oracle algorithm that makes exactly q draws from the distribution
and then makes exactly q black-box queries. All notation such as PTno,0 and the like is defined anal-
ogously to the definitions given in Section 3.3. We again assume that T is given “extra information”
as described earlier when it draws ci -type examples from the distribution (so the testing algorithm is
assumed to actually receive triples instead of pairs, as in Section 3.3). Our definition of a knowledge
sequence and of a “clean” sequence of draws are the same as before.
    The following easy claim is an analogue of Claim 3.4:

Claim 4.3. We have Pr[PTyes,0 is clean] = Pr[PTno,0 is clean] ≥ 1 − q2 /m. Furthermore, the conditional
random variables (PTyes,0 | PTyes,0 is clean) and (PTno,0 | PTno,0 is clean) are identically distributed.

Proof. We first show that Pr[PTyes,0 is clean] = Pr[PTno,0 is clean] ≥ 1 − q2 /m. Fix any (g, Dg ) in the sup-
port of YES, and consider the outcomes of PTyes,0 corresponding to this (g, Dg ) being drawn from YES.
Since each independent draw from Dg hits each block 1, . . . , m with probability 3/4m, the probability
that PTyes,0 is clean is
                                 q 
                                                              3q2        q2
                                                    
                                           3(i − 1)
                                ∏      1 −            ≥  1 −       ≥ 1 −    .
                                i=1          4m                4m         m

                          T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                   212
            D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS
                                                                                   
The same argument shows that Pr[PTno,0 is clean] also equals ∏qi=1 1 − 3(i−1)4m      .
    Now we show that the conditional random variables are identically distributed. It is not difficult to
see that for any 0 ≤ j ≤ q − 1, given any particular length- j prefix of PTyes,0 , conditioned on PTyes,0 being
clean, the ( j + 1)-st element of PTyes,0 has

    • a 1/4 chance of being the triple (1n , 0, 0);
    • a 1/2 chance of being a triple (x, 1, 0) where x ∈ {0, 1}n has ` zeros and the locations of the `
      zeros are selected uniformly at random (without replacement) from the set of those bit positions
      that had value 1 in all j of the previous draws;
    • a 1/4 chance of being a triple (x, 0, α) where x has 2` zeros, the locations of the 2` zeros are
      selected uniformly at random (without replacement) from the same set of bit positions described
      above, and α is an index drawn uniformly at random from the indices of the 2` zeros in x.

It is also not difficult to see that given any particular length- j prefix of PTno,0 , conditioned on PTno,0 being
clean, the ( j + 1)-st element of PTno,0 is distributed in the exact same way. This proves the lemma.

    The proof of the following corollary is identical to the proof of Corollary 3.5:
Corollary 4.4. The statistical distance dTV (PTyes,0 , PTno,0 ) is at most q2 /m.

    Foolhardy queries can be handled just as before. Let the algorithms T 1 , Uk and Uk0 be defined
precisely as in Section 3.3.2. The arguments of Subsection 3.3.2 immediately yield:
Lemma 4.5. For all k ∈ [q], the statistical distance
                                                                U0      Uk0
                             dTV (PUyesk | PUyes,0 is clean), (Pyesk | Pyes,0
                                              k
                                                                                       
                                                                              is clean)
                                                                 U0     Uk0
is at most 2`m/n, and similarly dTV (PUnok | PUno,0 is clean), (Pnok | Pno,0
                                                k
                                                                                      
                                                                             is clean) is also at most 2`m/n.
                                                    1
Lemma 4.6. The statistical distance dTV (PTyes , PTyes ) is at most 2`mq/n + q2 /m, and the same bound
                  1
holds for dTV (PTno , PTno ).
    As we now describe, the details of how non-foolhardy queries are handled are different from Sec-
tion 3.3.3.
    Let T 2 denote an algorithm that is a variant of T 1 , modified as follows. T 2 simulates T 1 except that
T does not actually make queries on non-foolhardy strings; instead T 2 simulates the answers to those
  2

queries “in the obvious way” that they should be answered if the target function were a yes-function and
hence all of the draws from D that yielded strings with ` zeros were in fact bi -type points. More precisely,
assume that there are r distinct ci -type points in the initial sequence of q draws from the distribution.
Since for each ci -type point the algorithm is given α(i), the algorithm “knows” r variables xα(i) . The
algorithm T 2 simulates an answer to a non-foolhardy query z ∈ {0, 1}n as follows: T 2 computes

                        u0 (z) = 10n2 (n − 2`m) + 5n(m − |I 0 |) − m(2` − 1) + |K \ I 0 |

where:

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                                  213
                                    DANA G LASNER AND ROCCO A. S ERVEDIO

    • K is the set of all variables xi set to 0 in z, and
    • I 0 is the set of “known” xα(i) variables that are set to 0 in z
and answers 1 if u0 (z) ≥ θ , and answers 0 otherwise.
                                                  1    2
Lemma 4.7. The statistical distance dTV (PTyes , PTyes ) is zero.
Proof. We argue that T 1 and T 2 answer all queries in exactly the same way. Fix any 1 ≤ i ≤ q and let z
denote the ith query made by T .
   If z is a foolhardy query then both T 1 and T 2 answer z with 0. So suppose that z is not a foolhardy
query. By inspection of (4.1), we can reexpress u(z) as
                         u(z) = 10n2 (n − 2`m) + 5n(m − |I|) − m(2` − 1) + |K \ I|
where:
    • K is the set of all variables xi set to 0 in z, and
    • I is the set of xα(i) variables that are set to 0 in z
and g(z) equals 1 if u(z) ≥ θ and equals 0 if u(z) < θ .
    Since z is not foolhardy, the only zeros in z must be in positions from points that were sampled in
the first stage. Consequently the only xα(i) variables in I are the xα(i) variables set to 0 in z from the Ci
sets corresponding to the ci points in the draws (these are the “known” xα(i) variables). So I = I 0 and
K \ I = K \ I0.
    Therefore, u(z) = u0 (z) and hence T 1 ’s response is 0 if u0 (z) < θ , and is 1 otherwise. This is exactly
how T 2 answers non-foolhardy queries as well.

    We define witnesses in exactly the same way as before:
Definition 4.8. We say that a knowledge sequence contains a witness for ( f , D f ) if elements q + 1, . . .
of the sequence (the black-box queries) contain either of the following:

   1. A point z ∈ {0, 1}n such that for some 1 ≤ i ≤ m for which ai was sampled in the first q draws,
      the bit zα(i) is 0 but fewer than s of the elements j ∈ Ai have z j = 0. We refer to such a point as an
      a-witness for block i.
   2. A point z ∈ {0, 1}n such that for some 1 ≤ i ≤ m for which ci was sampled in the first q draws, z
      is i-special. We refer to such a point as a c-witness for block i.

    The following lemma is analogous to Lemma 3.10; it makes essential use of the way our no-functions
f are defined in Section 4.1.2.
Lemma 4.9. The statistical distance
                                1      1                                   1
                      dTV (PTno | PTno does not contain a witness and PTno,0 is clean),
                              2      2                                   2
                          (PTno | PTno does not contain a witness and PTno,0 is clean)
                                                                                      

is zero.

                         T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                               214
               D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

Proof. As in the proof of Lemma 3.10, we show that if there is no witness then T 1 and T 2 answer all
queries in exactly the same way. Fix any 1 ≤ i ≤ q and let z denote the ith query.
   If z is a foolhardy query then both T 1 and T 2 answer z with 0. So suppose that z is not a foolhardy
query and not a witness. By inspection of (4.3), for any non-foolhardy point z we can express v(z) as
                          v(z) = 10n2 (n − 2`m) + 5n(m − |L|) − m(2` − 1) + |K \ L|
where
    • K is the set of all variables xi set to 0 in z and
    • L is the set of xα(i) variables set to 0 in z such that i ∈
                                                                / J(z) (i. e., z is not i-special).
Recall that:
                         u0 (z) = 10n2 (n − 2`m) + 5n(m − |I 0 |) − m(2` − 1) + |K \ I 0 |
where
    • K is the set of all variables xi set to 0 in z and
    • I 0 is the set of “known” xα(i) variables that are set to 0 in z.
     We will show that if z is not a witness and not foolhardy then L = I 0 . This implies that v(z) = u0 (z),
so T 1 and T 2 will respond to such queries in exactly the same way.
     First we show that I 0 ⊆ L. Fix any xα(i) that belongs to I 0 ; such a variable is set to 0 in z and
is “known,” so ci must have been sampled in the first stage. Since z is not a witness, z must not be
i-special, i. e., i ∈
                    / J(z); so xα(i) must belong to L.
     Next we show that L ⊆ I 0 . Fix any xα(i) that belongs to L; such a variable is set to 0 in z and i ∈
                                                                                                         / J(z).
This means z is not i-special, so z either has a zero from Bi or fewer than s zeros from Ai . Since the
initial draws from D were clean and z is not foolhardy, it cannot be the case that ai was drawn in the
sample, for if it were drawn then z could not have a zero from Bi and also could not have fewer than s
zeros from Ai (since if it had fewer than s zeros from Ai then z would be an a-witness for block i, which
contradicts the fact that z is not a witness). Since ai was not drawn in the sample but the bit α(i) is set
to 0 in z and z is not foolhardy, it must be the case that ci was drawn in the sample. But this means that
xα(i) is “known,” and consequently xα(i) belongs to I.
     So we have shown that I 0 = L, and the lemma is proved.
    From this point on the rest of the argument from Section 3.3 can be used without modification. We
define hybrid algorithms Vk , Vk0 exactly as in Section 3.3, and the exact proof of Lemma 3.11 now yields:
Lemma 4.10. For each value 1 ≤ k ≤ q, the statistical distance
                                                                V0     Vk0
                               dTV (PVnok | PVno,0 is clean), (Pnok | Pno,0
                                               k
                                                                                     
                                                                            is clean)
is at most max{ qs` , 2qs } = qs/`.
    Exactly as in Section 3.3, we obtain:
                                                    1    2
Lemma 4.11. The statistical distance dTV (PTno , PTno ) is at most q2 s/` + q2 /m.
   With all the pieces in place, the arguments from Section 3.3.4 go through unchanged to complete the
proof of Theorem 4.1, and we are done.

                          T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                               215
                               DANA G LASNER AND ROCCO A. S ERVEDIO

5    Conclusion
We have given poly(n)-query lower bounds for distribution-free testing of various natural classes of
Boolean functions. One obvious goal for future work is to decrease, or, ideally, eliminate the polynomial-
factor gap which remains between our lower bounds and the best known upper bounds for these
distribution-free testing problems. In particular, it would be quite interesting if improved distribution-
free testing algorithms could be obtained which make fewer queries than the currently known testing
algorithms (which are straightforward transformations of learning algorithms, as described in the Intro-
duction).
    Another natural goal for future work is to extend our techniques to obtain distribution-free lower
bounds for richer classes of Boolean functions such as size-s decision trees, s-term DNF formulas, etc.
These and several similar classes are known to be testable in the standard uniform distribution model
with query complexity that depends on s but is independent of n [6]. In contrast, we suspect that for
distribution-free testing, in analogy with the results of this paper the query complexity must depend
(perhaps polynomially) on n.


Acknowledgements
We thank the anonymous referees for their detailed comments and suggestions which significantly im-
proved the presentation of the paper.


References
 [1] N. A ILON AND B. C HAZELLE: Information theory in property testing and mono-
     tonicity testing in higher dimension. Inform. and Comput., 204(11):1704–1717, 2006.
     [doi:10.1016/j.ic.2006.06.001]. 192

 [2] N. A LON , T. K AUFMAN , M. K RIVELEVICH , S. L ITSYN , AND D. RON: Testing Reed-Muller
     Codes. IEEE Trans. Inform. Theory, 51(11):4032–4039, 2005. [doi:10.1109/TIT.2005.856958].
     192

 [3] N. A LON AND A. S HAPIRA: Homomorphisms in Graph Property Testing - A Survey. In Topics in
     Discrete Mathematics, volume 26 of Algorithms and Combinatorics, pp. 281–313. Springer, 2006.
     191

 [4] L. BABAI , L. F ORTNOW, AND C. L UND: Non-Deterministic Exponential Time has Two-Prover
     Interactive Protocols. Comput. Complexity, 1(1):3–40, 1991. [doi:10.1007/BF01200056]. 191

 [5] M. B LUM , M. L UBY, AND R. RUBINFELD: Self-testing/correcting with applications to numerical
     problems. J. Comput. System Sci., 47(3):549–595, 1993. [doi:10.1016/0022-0000(93)90044-W].
     191, 192

                       T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                            216
          D ISTRIBUTION -F REE T ESTING L OWER B OUNDS FOR BASIC B OOLEAN F UNCTIONS

 [6] I. D IAKONIKOLAS , H. L EE , K. M ATULEF, K. O NAK , R. RUBINFELD , R. S ERVEDIO , AND
     A. WAN: Testing for Concise Representations. In Proc. 48th FOCS, pp. 549–558. IEEE Comp.
     Soc. Press, 2007. [FOCS:10.1109/FOCS.2007.32]. 192, 193, 216
 [7] Y. D ODIS , O. G OLDREICH , E. L EHMAN , S. R ASKHODNIKOVA , D. RON , AND A. S AMOROD -
     NITSKY : Improved Testing Algorithms for Monotonocity. In Proc. 3rd Intern. Workshop on Ran-
     domization and Approximation Techniques in Comp. Sci. (RANDOM’99), volume 1671 of Lecture
     Notes in Computer Science, pp. 97–108. Springer, 1999. [doi:10.1007/b72324]. 192, 193
 [8] E. F ISCHER: The Art of Uninformed Decisions: A Primer to Property Testing. Bull. Eur. Assoc.
     Theor. Comput. Sci., 75:97–126, 2001. 191
 [9] E. F ISCHER , G. K INDLER , D. RON , S. S AFRA , AND A. S AMORODNITSKY: Testing juntas. J.
     Comput. System Sci., 68(4):753–787, 2004. [doi:10.1016/j.jcss.2003.11.004]. 192
[10] E. F ISCHER , E. L EHMAN , I. N EWMAN , S. R ASKHODNIKOVA , R. RUBINFELD , AND
     A. S AMORODNITSKY: Monotonicity Testing Over General Poset Domains. In Proc. 34th STOC,
     pp. 474–483, 2002. [STOC:10.1145/509907.509977]. 192, 193
[11] O. G OLDREICH: Combinatorial Property Testing – a survey. In Randomization Methods in Algo-
     rithm Design, pp. 45–60. AMS-DIMACS, 1998. 191
[12] O. G OLDREICH , S. G OLDWASSER , E. L EHMAN , D. RON , AND A. S AMORDINSKY: Testing
     Monotonicity. Combinatorica, 20(3):301–337, 2000. [doi:10.1007/s004930070011]. 192, 193
[13] O. G OLDREICH , S. G OLDWASSER , AND D. RON: Property testing and its connection to learning
     and approximation. J. ACM, 45:653–750, 1998. [JACM:285055.285060]. 192
[14] S. H ALEVY AND E. K USHILEVITZ: Distribution-Free Connectivity Testing for Sparse Graphs.
     Algorithmica, 51(1):24–48, 2004. [doi:10.1007/s00453-007-9054-1]. 192
[15] S. H ALEVY AND E. K USHILEVITZ: A Lower Bound for Distribution-Free Monotonicity Testing.
     In Proc. 9th Intern. Workshop on Randomization and Approximation Techniques in Comp. Sci.
     (RANDOM’05), volume 3624 of Lecture Notes in Computer Science, pp. 330–341. Springer, 2005.
     [doi:10.1007/11538462 28]. 192, 201
[16] S. H ALEVY AND E. K USHILEVITZ: Distribution-Free Property Testing.        SIAM J. Comput.,
     37(4):1107–1138, 2007. [SICOMP:10.1137/050645804]. 192, 193
[17] M. K EARNS AND U. VAZIRANI: An Introduction to Computational Learning Theory. MIT Press,
     Cambridge, MA, 1994. 193
[18] K. M ATULEF, R. O’D ONNELL , R. RUBINFELD , AND R. S ERVEDIO: Testing Halfspaces. In
     Proc. 19th Annual ACM-SIAM Symp. Dicrete Algorithms (SODA), pp. 256–264. ACM Press, 2009.
     [ACM:1496770.1496799]. 192, 193
[19] M. PARNAS , D. RON , AND A. S AMORODNITSKY: Testing Basic Boolean Formulae. SIAM J.
     Discrete Math., 16(1):20–46, 2002. [doi:10.1137/S0895480101407444]. 192, 193, 194, 200

                     T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                      217
                             DANA G LASNER AND ROCCO A. S ERVEDIO

[20] D. RON: Property Testing (A Tutorial). In Handbook of Randomized Computing, volume 2, pp.
     597–649. Kluwer, 2000. 191

[21] R. RUBINFELD: Sublinear time algorithms. In Proc. Intern. Congress of Mathematicians, vol-
     ume 3, pp. 1095–1110. European Math. Soc., 2006. 191

[22] R. RUBINFELD AND M. S UDAN: Robust characterizations of polynomials with applications to
     program testing. SIAM J. Comput., 25(2):252–271, 1996. [doi:10.1137/S0097539793255151].
     191

[23] G. T UR ÁN: Lower bounds for PAC learning with queries. In Proc. 6th Ann. Conf. Comput. Learn-
     ing Theory, pp. 384–391. ACM Press, 1993. [ACM:10.1145/168304.168382]. 193


AUTHORS

      Dana Glasner
      Department of Computer Science
      Columbia University, New York, NY
      dglasner cs columbia edu


      Rocco A. Servedio
      Department of Computer Science
      Columbia University, New York, NY
      rocco cs columbia edu
      http://www.cs.columbia.edu/∼rocco


ABOUT THE AUTHORS

      DANA G LASNER is a Ph. D. candidate in the Department of Computer Science at Columbia
        University, supervised by Tal Malkin and Rocco Servedio. Her research interests in-
        clude property testing and cryptography. She grew up in Pennsylvania and enjoys life
        in New York City.


      ROCCO A. S ERVEDIO is an associate professor in the Department of Computer Science at
        Columbia University. He graduated from Harvard University in 2001 where his Ph. D.
        was supervised by Leslie Valiant. He is interested in computational learning theory,
        computational complexity, and property testing, with the study of Boolean functions
        as an underlying theme tying these topics together. He enjoys spending time with his
        family and hopes to drink a glass of port with Herman Melville in the afterlife.




                      T HEORY OF C OMPUTING, Volume 5 (2009), pp. 191–218                       218