A Monotone Function Given By a Low-Depth Decision Tree That Is Not an Approximate Junta

Authors Daniel Kane,
Plaintext
                         T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592
                                       www.theoryofcomputing.org

                                                         NOTE

                   S PECIAL ISSUE : A NALYSIS OF B OOLEAN F UNCTIONS



        A Monotone Function Given By a
      Low-Depth Decision Tree That Is Not an
               Approximate Junta
                                                  Daniel M. Kane ∗
                 Received April 23, 2012; Revised September 18, 2012; Published June 10, 2013




        Abstract: We present a family a monotone functions fd : {0, 1}n → {0, 1} so that fd can be
        computed as a depth-d decision tree and
                                              √ so that fd disagrees with any k-junta on a constant
        fraction of inputs for any k = exp(o( d)). This gives a negative answer to a problem
        circulated independently by Elad Verbin and by Rocco Servedio and Li-Yang Tan.

ACM Classification: F.2.3
AMS Classification: 06E30
Key words and phrases: Boolean functions, monotone functions, decision tree

1     Introduction
In [3], O’Donnell and Servedio show that any monotone function given by a depth-d decision tree can
be learned to constant accuracy from random samples in poly(n, 2d ) time. The impact of this result is
somewhat lessened by an apparent lack of interesting monotone functions given by low-depth decision
trees. In particular, Elad Verbin as well as Rocco Servedio and Li-Yang Tan independently suggested in
2010 that all such functions may essentially depend on few variables [1, page 10].
    ∗ Supported by an NSF postdoctoral fellowship



© 2013 Daniel M. Kane
c b Licensed under a Creative Commons Attribution License (CC-BY)                 DOI: 10.4086/toc.2013.v009a017
                                                      DANIEL M. K ANE

Conjecture 1.1. For every ε > 0 and every monotone function f : {0, 1}n → {0, 1} given by a depth-d
decision tree, there is a k-junta, g, for k = polyε (d) so that f and g agree on all but an ε-fraction of
inputs.

   In this note, we disprove the above conjecture, and in particular provide an example of a monotone
low-degree function that is not well approximated by any small junta. In particular we prove:

Theorem√ 1.2. There exists a constant ε > 0 son that for every positive integer d, there exists a k =
exp(Ω( d)) and a monotone function f : {0, 1} → {0, 1} given by a depth-d decision tree, so that for
every k-junta g, f and g disagree on at least an ε-fraction of inputs.

    In fact it is known that the bound on k in Theorem 1.2 is tight up to the constant in the exponent.
In particular, it is shown
                       √ in [3] that any monotone function given by a depth-d decision tree has total
influence I( f ) = O( d). We combine this with the main result of [2], which says that any Boolean
function f can be ε-approximated by a k-junta for k = exp O(I( f )/ε) . Combining these results we find
that:

Observation 1.3. If f is a monotone function given by a depth-d decision tree, √   and if ε > 0, then there
is a k-junta g that agrees with f on all but an ε fraction of inputs for k = exp O( d/ε) .

    The function we construct to show Theorem 1.2 will combine ideas from two previous constructions,
the monotone addressing function and Talagrand’s function.
    The monotone addressing function is defined by
                                                                   
                                                                   1
                                                                                 if ∑ xi > b(d − 1)/2c,
                   f (x1 , . . . , xd−1 , y0 , . . . , y2d−1 −1 ) = yx0 ...xd−1   if ∑ xi = b(d − 1)/2c,
                                                                   
                                                                    0             if ∑ xi < b(d − 1)/2c.
                                                                   

This is an example of a monotone function given by a depth-d decision tree that depends on exponentially
many variables, and thus provides us with a good starting point. The monotone addressing function fails
to provide a counter-example√ to Conjecture 1.1 though since it agrees with the majority function except
on a set of measure O(1/ d).
     Given the bound on the total sensitivity of a low-depth monotone function, we know that any f
satisfying the conditions of Theorem 1.2 must not only have near the maximum possible total influence
for a low-depth monotone function, but also must not be approximable by a function with much lower
total influence. Because of this restriction, our construction will look somewhat similar to a construction
of Talagrand in [4]. In particular, Talagrand constructs a monotone function f on {0, 1}d so that on a
constant fraction of inputs, f has sensitivity (i. e., the√number of coordinates such that changing the input
at that coordinate would change the output of f ) Ω( d). Since, as is easily seen, the average sensitivity
over all inputs is equal to the total influence, this is as large as possible. On the other hand, this condition
tells us that f retains large average sensitivity even after ignoring any ε-fraction of inputs for sufficiently
small constant ε. Talagrand’s function fails to provide a counter-example to Conjecture 1.1 on its own,
because it is already a d-junta.

                     T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592                               588
                                     A M ONOTONE L OW-D EPTH F UNCTION

2    The construction
In order to define the function f with the properties specified by √     Theorem 1.2, we first introduce some
background notation. We let d,t and m be integers with t =√Θ( d) and m = Θ(2t ). We furthermore
assume that 2−t m is sufficiently small given the value of t/ d. We let S = (S1 , . . . , Sm ) be a random
sequence of sets, where the Si are chosen independently and uniformly from the set of subsets of
{1, 2, . . . , d − 1} of size exactly t. Given this S, we define the function TS on {0, 1}d−1 as follows:

                           TS (x1 , . . . , xd−1 ) = {1 ≤ i ≤ m : x j = 1 for all j ∈ Si }.

We will hereafter abbreviate T by suppressing the explicit dependence on S, and abbreviate (x1 , . . . , xd−1 )
by x.
    We finally define f as
                                                                        
                                                                        1
                                                                             if |T (x)| ≥ 2,
                             fS (x1 , . . . , xd−1 , y1 , . . . , ym ) = 0    if |T (x)| = 0,
                                                                        
                                                                         yi   if T (x) = {i}.
                                                                        

Again, we will often suppress the dependence of f on S. It is clear that f is monotone. Furthermore, f
is given by a depth-d decision tree, since after fixing the values of the xi , the value of f depends on at
most one more coordinate. In the next section, we show that f cannot be approximated by any k-junta for
small k.
    Note that Talagrand’s function is given (for appropriately chosen S) by
                                                         (
                                                          1 if |T (x)| ≥ 1,
                                 G(x1 , . . . , xd−1 ) =
                                                          0 if |T (x)| = 0.


3    Approximation bounds
Theorem 1.2 will follow from the following proposition:

Proposition 3.1. There exists an ε > 0 so that for fS defined as above, with constant probability over the
choice of S, f is not ε-approximated by any k-junta for k = o(2t ).

    A key step in our proof will be to show that with constant probability f actually depends on one of
the yi .

Lemma 3.2. With T as above,
                                              Pr(|TS (x)| = 1) = Ω(1) .
                                              S,x

Proof. We will show the further claim that

                                         E [|TS (x)|(2 − |TS (x)|)] = Ω(1) .                               (1)

                     T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592                               589
                                                    DANIEL M. K ANE

Since the term in the expectation is positive only if |T | = 1, this will complete our proof. We note that
                                                          m
                                   E [|TS (x)|] = ∑ Pr(i ∈ TS (x))
                                                          i=1
                                                           m
                                                   = ∑ Pr(x j = 1 for all j ∈ Si )
                                                          i=1
                                                   = m2−t .

On the other hand, we have that
                                   
            E |TS (x)|(|TS (x)| − 1) = ∑ Pr(i, j ∈ TS (x))
                                          i6= j

                                       = ∑ Pr(i ∈ TS (x)) Pr( j ∈ TS (x) | i ∈ TS (x))
                                          i6= j

                                       = ∑ 2−t Pr(x` = 1 for all ` ∈ S j | x` = 1 for all ` ∈ Si ) .
                                          i6= j

To compute this conditional probability we let S j = {a1 , . . . , at } where the ai are picked randomly from
{1, 2, . . . , d − 1} without replacement. We compute it as the product
                       t
                      ∏ Pr(xa = 1 | xa = . . . = xa
                               k         1                       k−1
                                                                        = 1 and x` = 1 for all ` ∈ Si ) .
                      k=1

These probabilities are approximated by first fixing the values of Si and a1 , . . . , ak−1 . After additionally
fixing the value of ak , the probability in question becomes 1 if ak ∈ Si and 1/2 otherwise. Thus the
probability that xar = 1 is
                                                              |Si \{a1 ,...,ar−1 }|
                            1 + Pr(ar ∈ Si ) 1 +                     d−r
                                            =                                         = 1/2 + O(t/d) .
                                   2                                2
Hence the probability that j ∈ TS (x) given that i ∈ TS (x) is
                                                  t
                                   1/2 + O(t/d) = 2−t exp(O(t 2 /d)) .

Therefore, we have that

               E |TS (x)|(|TS (x)| − 1) = ∑ 2−2t exp(O(t 2 /d)) ≤ (2−t m)2 exp(O(t 2 /d))
                                      
                                                  i6= j

and hence

                      E [|TS (x)|(2 − |TS (x)|)] = E [|TS (x)|] − E [|TS (x)|(|TS (x)| − 1)]
                                                      ≥ (2−t m) − (2−t m)2 exp(O(t 2 /d))
                                                      = (2−t m) 1 − (2−t m) exp(O(t 2 /d)) .
                                                                                          

As long as 2−t m is bounded below by a constant and above by exp(−O(t 2 /d))/2, this is Ω(1).

                     T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592                                590
                                     A M ONOTONE L OW-D EPTH F UNCTION

   We are now ready to prove Proposition 3.1. By Lemma 3.2, we note that with constant probability
over S, that Prx (|T (x)| = 1) = Ω(1). For such S, we claim that f has the desired property. In particular
we claim the following:

Lemma 3.3. If f is as above and g is a k-junta, then

                                                      Prx (|T (x)| = 1) − k2−t
                               Pr f (x, y) 6= g(x, y) ≥                         .
                                                                    2

Proof. This follows from the simple observation that if, after fixing the value of x, we have that T = {i}
where g does not depend on yi , then Pry ( f (x, y) 6= g(x, y)) = 1/2. This is because after further conditioning
on the values of all y j for j 6= i, g becomes a constant function (by assumption) and f takes the values 0
and 1 each with probability 1/2. Therefore we have that

                                 Pr(T (x) = {i} for some i, and g does not depend on yi )
        Pr( f (x, y) 6= g(x, y)) ≥
                                                               2
                                 Pr(|T (x)| = 1) − Pr(T (x) = {i} for some i, and g depends on yi )
                               =
                                                                      2
                                 Pr(|T (x)| = 1) − ∑i:g depends on yi Pr(T (x) = {i})
                               =
                                                          2
                                 Pr(|T (x)| = 1) − ∑i:g depends on yi Pr(i ∈ T (x))
                               ≥
                                                        2
                                 Pr(|T (x)| = 1) − ∑i:g depends on yi 2−t
                               =
                                                    2
                                 Prx (|T (x)| = 1) − k2−t
                               ≥                          .
                                              2

    Proposition 3.1 and Theorem 1.2 now follow immediately.


Acknowledgements
I would like to thank Ryan O’Donnell for making me aware of this problem, and for his help with finding
appropriate references for this paper. This research was done with the support of an NSF postdoctoral
fellowship.


References
[1] Open Problems in Analysis of Boolean Functions, Compiled for the Simons Symposium, February
    5-11, 2012. Technical report, 2012. [arXiv:1204.6447] 587

[2] E HUD F RIEDGUT: Boolean functions with low average sensitivity depend on few coordinates.
    Combinatorica, 18(1):27–35, 1998. [doi:10.1007/PL00009809] 588

                      T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592                                591
                                          DANIEL M. K ANE

[3] RYAN O’D ONNELL AND ROCCO A. S ERVEDIO: Learning monotone decision trees in poly-
    nomial time. SIAM J. Comput., 37(3):827–844, 2007. Preliminary version in CCC’06.
    [doi:10.1137/060669309] 587, 588

[4] M ICHEL TALAGRAND: How much are increasing sets positively correlated?             Combinatorica,
    16(2):243–258, 1996. [doi:10.1007/BF01844850] 588


AUTHOR

     Daniel M. Kane
     Stanford University Department of Mathematics, Stanford, CA
     dankane math stanford edu
     http://math.stanford.edu/~dankane/


ABOUT THE AUTHOR

     DANIEL K ANE graduated from the mathematics department at Harvard in 2011; his advisor
       was Barry Mazur. His thesis work covered a number of topics mostly in the areas of
       theoretical computer science and analytic number theory. Although his interests are
       diverse, his recent work on computer science has tended to focus on polynomial threshold
       functions and derandomization. He is currently working as a postdoc in the mathematics
       department of Stanford University.
     Daniel’s early interest in mathematics was strongly influenced by his father, Jonathan Kane,
        who was himself a professor of mathematics and computer science at the University of
        Wisconsin-Whitewater.




                  T HEORY OF C OMPUTING, Volume 9 (17), 2013, pp. 587–592                           592