On Learning Random DNF Formulas Under the Uniform Distribution

Authors Jeffrey C. Jackson, Rocco A. Servedio,
Plaintext
                                 T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172
                                             http://theoryofcomputing.org




 On Learning Random DNF Formulas Under
         the Uniform Distribution
                                 Jeffrey C. Jackson                                     Rocco A. Servedio
                                    Received: March 21, 2006; published: September 19, 2006.




        Abstract: We study the average-case learnability of DNF formulas in the model of learn-
        ing from uniformly distributed random examples. We define a natural model of random
        monotone DNF formulas and give an efficient algorithm which with high probability can
        learn, for any fixed constant γ > 0, a random t-term monotone DNF for any t = O(n2−γ ).
        We also define a model of random non-monotone DNF and give an efficient algorithm
        which with high probability can learn a random t-term DNF for any t = O(n3/2−γ ). These
        are the first known algorithms that can learn a broad class of polynomial-size DNF in a
        reasonable average-case model of learning from random examples.

ACM Classification: I.2.6, F.2.2, G.1.2, G.3
AMS Classification: 68Q32, 68W20, 68W25, 60C05
Key words and phrases: computational learning theory, uniform-distribution learning, PAC learning,
DNF formulas, monotone DNF

1     Introduction
1.1     Motivation and background
A disjunctive normal form formula, or DNF, is an AND of ORs of Boolean literals. A question that has
been open since Valiant’s initial paper on computational learning theory [26] is whether or not efficient
algorithms exist for learning polynomial size DNF formulas in variants of the PAC (Probably Approx-
imately Correct) learning model introduced by Valiant. Roughly speaking, in these models a learning

Authors retain copyright to their work and grant Theory of Computing unlimited rights
to publish the work electronically and in hard copy. Use of the work is permitted as
long as the author(s) and the journal are properly acknowledged. For the detailed
copyright statement, see http://theoryofcomputing.org/copyright.html.

c 2006 Jeffrey C. Jackson and Rocco A. Servedio                                                    DOI: 10.4086/toc.2006.v002a008
                                       J. JACKSON AND R. S ERVEDIO

algorithm is required to generate a high-accuracy (error rate at most ε) hypothesis with high probabil-
ity (the algorithm must fail to generate such a hypothesis with probability at most δ ); we give a more
detailed explanation of our learning scenario in Section 2. The only positive result for learning general
DNF formulas in such frameworks to date is the Harmonic Sieve [12]. The Sieve is a membership-query
algorithm (i.e. it requires black-box query access to the unknown function f that is being learned) that
efficiently PAC learns DNF when the error rate is defined with respect to the uniform distribution over
the space of all possible n-bit example strings (and certain related distributions). The approximating hy-
pothesis produced by the Sieve is not itself represented as a DNF; thus, the Sieve is an improper learning
algorithm.
     There has been little progress on polynomial-time algorithms for learning arbitrary DNF since the
discovery of the Sieve. There are two obvious relaxations of the uniform distribution membership query
model that can be pursued. The first is to learn with respect to arbitrary distributions using membership
queries; in this setting, the learning algorithm is given black-box (membership query) access to the
unknown function f , and also access to a source of random labeled examples (x, f (x)) where each
example x is independently drawn from a fixed probability distribution which is arbitrary and not known
to the learning algorithm. The learner must generate a high-accuracy hypothesis with respect to this
unknown distribution. Given standard cryptographic assumptions, it is known that learning DNF in
this framework is essentially as difficult as learning DNF with respect to arbitrary distributions without
membership queries [4].
     The second obvious relaxation is to learn with respect to the uniform distribution without member-
ship queries. However, there are substantial known obstacles to learning DNF in the model of uniform
distribution without membership queries. In particular, no algorithm which can be recast as a Statisti-
cal Query algorithm can learn arbitrary polynomial-size DNF under the uniform distribution in no(log n)
time [8]. (Roughly speaking, a Statistical Query algorithm is an algorithm which is only allowed to
obtain statistical estimates of properties of the distribution over labeled example pairs (x, f (x)); such an
algorithm does not have access to actual labeled examples (x, f (x)). See [17] for a detailed description
of the Statistical Query model.) Since nearly all non-membership learning algorithms can be recast as
Statistical Query algorithms [17], a major conceptual shift seems necessary to obtain an algorithm for
efficiently learning arbitrary DNF formulas from uniform examples alone.
     An apparently simpler question is whether monotone DNF formulas, which contain only un-negated
variables, can be learned efficiently. Angluin showed that monotone DNF can be properly learned with
respect to arbitrary distributions using membership queries [3]. It has also long been known that with
respect to arbitrary distributions without membership queries, monotone DNF are no easier to learn than
arbitrary DNF [19]. This leaves the following enticing question (posed in [16, 7, 6]): are monotone DNF
efficiently learnable from uniform examples alone?
     In 1990, Verbeurgt [27] gave an algorithm that can properly learn any poly(n)-size (arbitrary)
                                                                                                  √       DNF
from uniform examples in time n     O(log n) . More recently, the algorithm of [25] learns any 2    log n -term
monotone DNF in poly(n) time. However, despite significant interest in the problem, no algorithm faster
than that of [27] is known for learning arbitrary poly(n)-size monotone DNF from uniform examples,
and no known hardness result precludes such an algorithm (the Statistical Query result of [8] is at its
heart a hardness result for low-degree parity functions, and thus does not apply to monotone DNF).
     Since worst-case versions of several DNF learning problems have remained stubbornly open for a

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                               148
                                  L EARNING R ANDOM DNF F ORMULAS

decade or more, it is natural to ask about DNF learning from an average-case perspective, i.e., about
learning random DNF formulas. In fact, this question has been considered before: Aizenstein and
Pitt [1] were the first to ask whether random DNF formulas are efficiently learnable. They proposed
a model of random DNF in which each of the t terms is selected independently at random from all
possible terms, and gave a membership and equivalence query algorithm which with high probability
learns a random DNF generated in this way. (See [1] or [3] for a description of the membership and
equivalence query learning framework.) However, as noted in [1], a limitation of this model is that with
very high probability all terms will have length Ω(n). The learning algorithm itself becomes quite simple
in this situation. Thus, while this is a “natural” average-case DNF model, from a learning perspective
it is not a particularly interesting model. To address this deficiency, they also proposed another natural
average-case model which is parameterized by the expected length k of each term as well as the number
of independent terms t, but left open the question of whether or not random DNF can be efficiently
learned in such a model.

1.2   Our results
We consider an average-case DNF model very similar to the latter Aizenstein and Pitt model, although
we simplify slightly by assuming that k represents a fixed term length rather than an expected length. We
show that, in the model of learning from uniform random examples only, random monotone DNF are
properly and efficiently learnable for many interesting values of k and t. In particular, for t = O(n2−γ )
where γ > 0, and for k = logt, our algorithm can achieve any error rate ε > 0 in poly(n, 1/ε) time with
high probability (over both the selection of the target DNF and the selection of examples). In addition,
we obtain slightly weaker results for arbitrary DNF: our algorithm can properly and efficiently learn
                                                3
random t-term DNF for t such that t = O(n 2 −γ ). This algorithm cannot achieve arbitrarily small error
but can achieve error ε = o(1) for any t = ω(1). For detailed result statements see Theorems 3.13
and 4.11.
    While our results would clearly be stronger if they held for any t = poly(n) rather than the specific
polynomials given, they are a marked advance over the previous state of affairs for DNF learning. (Recall
that in the standard worst-case model, poly(n)-time uniform-distribution learning of t(n)-term DNF for
any t(n) = ω(1) is an open problem with an associated cash prize [5].)
    At this point a word or two is in order to clarify the relationship between the random DNF model we
consider and the models of random CNF formulas that are often studied in the context of the Boolean
satisfiability problem. In the study of random k-CNF formulas, k is often taken to be a fixed constant
such as 3. In contrast with the satisfiability problem, in the learning arena taking k to be a fixed constant
such as 3 is not an interesting choice, since it is well known that k-CNF (or equivalently, DNF formulas
in which every term is of length at most k) can be easily learned with respect to any distribution in time
nO(k) [26]. Intuitively, the “interesting” values of k are different for the satisfiability problem and the
learning problem because in the satisfiability problem the interesting cases occur when there are only a
small number of satisfying assignments, whereas in the learning framework the interesting cases occur
when the target DNFs are roughly balanced between satisfying and unsatisfying assignments. (From
a learning perspective balanced functions are generally more interesting than unbalanced functions,
since a constant function is trivially a good approximator to a highly unbalanced function.) Thus, for
the learning problem, taking k = logt is a natural choice when learning with respect to the uniform

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                              149
                                      J. JACKSON AND R. S ERVEDIO

distribution. (We actually allow a somewhat more general choice of k, as is described in detail in the
paper.)
    Our results shed some light on which cases are not hard to learn in the worst-case uniform distri-
bution model. While “hard” cases were previously known for arbitrary DNF [5], our findings may be
particularly helpful in guiding future research on monotone DNF. In particular, our algorithm learns any
monotone t = O(n2−γ )-term DNF which (i) is near-balanced, (ii) has every term uniquely satisfied with
reasonably high probability, (iii) has every pair of terms jointly satisfied
                                                                     √        with much smaller probability,
and (iv) has no variable appearing in significantly more than a 1/ t fraction of the t terms (this is made
precise in Lemma 3.9). So in order to be “hard,” a monotone DNF must violate one or more of these
criteria.
    Our algorithms work in two stages: they first identify pairs of variables which co-occur in some
term of the target DNF, and then use these pairs to reconstruct terms via a specialized clique-finding
algorithm. (This is why our results do not extend to random DNF with more than n2−γ terms; for such
formulas the variable co-occurrence graph is with high probability dense or even complete, so we cannot
reconstruct terms from co-occurrence information.) For monotone DNF we can with high probability
determine for every pair of variables whether or not the pair co-occurs in some term. For non-monotone
DNF, with high probability we can identify most pairs of variables which co-occur in some term; as we
show, this enables us to learn to fairly (but not arbitrarily) high accuracy.
    We give preliminaries in Section 2. Sections 3 and 4 contain our results for monotone and non-
monotone DNF respectively. Section 5 concludes.
    A preliminary version of this work appeared in the proceedings of RANDOM 2005 [15]. The current
version of the paper gives a more thorough exposition and includes many proofs that were omitted from
the conference version due to space limitations.


2     Preliminaries
We first describe our models of random monotone and non-monotone DNF. Let Mt,k         n be the probability
distribution over monotone t-term DNF induced by the        following random   process:  each term is inde-
pendently and uniformly chosen at random from all nk monotone ANDs of size exactly k over variables
                                                        

v1 , . . . , vn . For non-monotone DNF, we write Dt,k
                                                  n to denote the following distribution over t-term DNF:
                                             t,k
first a monotone DNF is selected from Mn , and then each occurrence of each variable in each term is
independently negated with probability 1/2. (Equivalently, a draw from Dt,k   n is done by independently
selecting t terms from the set of all terms of length exactly k.)
      Given a Boolean function φ : {0, 1}n → {0, 1}, we write Pr[φ ] to denote Prx∼Un [φ (x) = 1], where Un
denotes the uniform distribution over {0, 1}n . We write log to denote log2 and ln to denote natural log.


2.1   Tail bounds
We use the following:

Chernoff bound (see [2, Theorem A.12]): Let B(t, p) denote the binomial distribution with parameter

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                             150
                                        L EARNING R ANDOM DNF F ORMULAS

p, i.e. a draw from B(t, p) is a sum of t independent p-biased 0/1 Bernoulli trials. Then for β > 1,
                                                            pt
                                Pr [S ≥ β pt] ≤ eβ −1 β −β       < (e/β )β pt .
                                S∼B(t,p)

The following bound will also be useful:
McDiarmid bound [24]: Let X1 , . . . Xm be independent random variables taking values in a set Ω. Let
F : Ωm → R be such that for all i ∈ [m] we have

                              |F(x1 , . . . , xm ) − F(x1 , . . . , xi−1 , xi0 , xi+1 , . . . , xm )| ≤ ci

for all x1 , . . . , xm and xi0 in Ω. Let µ = E[F(X1 , . . . , Xm )]. Then for all τ > 0,

                          Pr [|F(X1 , . . . , Xm ) − µ| > τ] < exp(−τ 2 /(c21 + · · · + c2m )) .



2.2    The learning model
In the uniform distribution learning model which we consider, the learner is given a source of labeled
examples (x, f (x)) where each x is uniformly drawn from {0, 1}n and f is the unknown function to be
learned. The goal of the learner is to efficiently construct a hypothesis h which with high probability
(over the choice of labeled examples used for learning) has low error relative to f under the uniform
distribution, i.e. Prx∼Un [h(x) 6= f (x)] ≤ ε with probability 1 − δ . This model has been intensively studied
in learning theory, see e.g. [11, 10, 13, 21, 22, 25, 27]. In our average case framework, the target function
 f will be drawn randomly from either Mt,k            t,k
                                              n or Dn , and (as in [14]) our goal is to construct a low-error
hypothesis h for f with high probability over both the random examples used for learning and the random
draw of f .


3     Learning random monotone DNF
3.1    Interesting parameter settings
Consider a random draw of f from Mt,k    n . It is intuitively clear that if t is too large relative to k then a
random f ∈ Mn will likely have Pr[ f ] ≈ 1; similarly if t is too small relative to k then a random f ∈ Mt,k
               t,k
                                                                                                              n
will likely have Pr[ f ] ≈ 0. Such cases are not very interesting from a learning perspective since a trivial
algorithm can learn to high accuracy. We are thus led to the following definition:
Definition 3.1. A pair of values (k,t) is said to be monotone α-interesting if

                                              α ≤ E f ∈Mt,k
                                                        n
                                                            [Pr[ f ]] ≤ 1 − α .

    Throughout the paper we will assume that 0 < α ≤ .09 is a fixed constant independent of n and that
t ≤ p(n), where p(·) is a fixed polynomial (and we will also make assumptions about the degree of p).
The following lemma gives necessary conditions for (k,t) to be monotone α-interesting. (As Lemma 3.2
indicates, we may always think of k as being roughly logt.)

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                               151
                                         J. JACKSON AND R. S ERVEDIO

Lemma 3.2. For n sufficiently large, if (k,t) is monotone α-interesting then α2k ≤ t ≤ 2k+1 ln α2 .
Proof. One side is easy: if t < α2k then each of the t terms of f is satisfied by a uniform random example
with probability at most α/t, and consequently Pr[ f (x) = 1] ≤ α. Note that by our assumptions on t and
α we thus have that k = O(log n) for any monotone α-interesting pair (k,t).
    We now show that if t > 2k+1 log α2 , then

                                              E f ∈Mt,k
                                                    n
                                                        [Pr[ f ]] > 1 − α .

Let us write |x| to denote x1 + · · · + xn for x ∈ {0, 1}n . It is easy to see that Pr[ f (x) = 1], viewed as a
random variable over the choice of f ∈ Mt,k  n , depends only on the value of |x|. We have

                                          n
                     E f ∈Mt,k
                           n
                               [Pr[ f ]] = ∑ E f ∈Mt,k
                                                   n
                                                       [Pr[ f (x) = 1 | |x| = r] · Pr[|x| = r] .
                                         r=0

A standard tail bound on the binomial distribution (which can be derived, e.g., from the results in [24])
implies that                      h                          i
                                                 p
                                Pr |x| ≤ n/2 − n log(2/α) < α/2 .
                                  x∈Un
                                                                p
Thus it suffices to show that for any x with |x| ≥ n/2 −           n log(2/α), we have

                                           Pr [ f (x) = 1] ≥ 1 − α/2 .
                                         f ∈Mt,k
                                             n

                                                   p
    Fix an x ∈ {0, 1}n with |x| = w ≥ n/2 −            n log(2/α). Let T1 be a random monotone term of length
k. We have
                                                   w(w − 1) · · · (w − k + 1)    1
                             Pr[T1 (x) = 1] =                                 ≥ k+1
                             T1                    n(n − 1) · · · (n − k + 1)  2
where the inequality holds for sufficiently large n using the fact that k = O(log n) and α = Θ(1). Since
the terms of f are chosen independently, this implies that

                                                     1 t
                                                                       
                                                                      −t
                            Pr[ f (x) = 0] ≤ 1 − k+1 ≤ exp k+1               .
                             f                     2                 2

If t/2k+1 > ln α2 then this bound is at most α/2.

     DNF expressions with either a constant number of terms or a constant number of variables per term
have long been known to be efficiently learnable [26] (this holds for non-monotone as well as monotone
DNF, and in fact holds for learning with respect to arbitrary distributions, not only uniform). So we
will assume throughout that both t and k are ω(1); many of our probability bounds are only meaningful
given such an assumption, and some of our lemmas explicitly depend on t and/or k being larger than a
certain (small) constant. While this assumption is sufficient for our purposes, we note briefly that in fact
a stronger assumption can be made concerning t. If t grows very slowly relative to n, say, t = O(n1/4 ),
then with high probability a random f drawn from Mt,k   n will have the property that every variable in f
appears in exactly one term. Such a read-once DNF, even if it is non-monotone, is learnable with respect
to the uniform distribution [18]. Thus, we can actually think of t as growing reasonably quickly with n.

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                               152
                                         L EARNING R ANDOM DNF F ORMULAS

3.2   Properties of random monotone DNF
Throughout the rest of Section 3 we assume that α > 0 is fixed and (k,t) is a monotone α-interesting
pair where t = O(n2−γ ) for some γ > 0. In this section we develop some useful lemmas regarding Mt,k
                                                                                                  n .
    We first prove the following lemma, which will be useful in subsequent proofs. This lemma does
not require that f be drawn from Mt,k
                                   n .

Lemma 3.3. Any monotone DNF f with t ≥ 2 terms each of size k has Pr[ f ] ≥ α 3 /4.

Proof. We write T1 , T2 , . . . , Tt to denote the terms of f . We have

         Pr[ f ] = Pr[T1 ∧ T2 ∧ · · · ∧ Tt ] = Pr[T1 | T2 ∧ · · · Tt ] Pr[T2 | T3 ∧ · · · Tt ] · · · Pr[Tt−1 | Tt ] Pr[Tt ]
                     t
                ≥ ∏ Pr[Ti ]                                                                                                   (3.1)
                   i=1
                               k+1        2 ln α2  
                     1 t    1 2 ln(2/α)               α 2 ln 4 α 3
                       
                                           1
                = 1− k ≥ 1− k           ≥           =         ≥    .
                    2      2               4          2         4

The first inequality (3.1) holds since Pr[ f (x) = 1 | g(x) = 1] ≥ Pr[ f (x) = 1] for any monotone Boolean
functions f , g on {0, 1}n (see e.g. Corollary 7, p. 149 of [9]). The second inequality holds by Lemma 3.2.
The third inequality holds since (1 − 1/x)x ≥ 1/4 for all x ≥ 2, and the fourth follows from the restriction
α ≤ .09.

    Let f i denote the projected function obtained from f by first removing term Ti from the monotone
DNF for f and then restricting all of the variables which were present in term Ti to 1. For ` 6= i we
write T`i to denote the term obtained by setting all variables in Ti to 1 in T` , i.e. T`i is the term in f i
corresponding to T` . Note that if T`i 6≡ T` then T`i is smaller than T` .
    The following lemma shows that each variable appears in a limited number of terms and that there-
fore not too many terms T`i in f i are smaller than their corresponding terms T` in f . In this and later
lemmas, “n sufficiently large” means that n is larger than a constant which depends on α but not on k or
t.

Lemma 3.4. Let
                                                                       !2k−1 α 2 /(√t logt)
                                                        ekt 3/2 logt
                                     δmany := n                                               .
                                                         n2k−1 α 2

For n sufficiently large, with probability at least 1 − δmany over the draw of f from Mt,k   n , both of the
following conditions hold:
                                                                    √
    • Every variable v j , 1 ≤ j ≤ n, appears in at most 2k−1 α 2 /( t logt) terms of f ; and
                                            √
    • For all 1 ≤ i ≤ t at most k2k−1 α 2 /( t logt) terms T`i with ` 6= i in the projection f i are smaller
      than the corresponding terms T` in f .

    Note that since (k,t) is a monotone α-interesting pair and t = O(n2−γ ) for some fixed γ > 0, for
sufficiently large n this probability bound is non-trivial.

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                                153
                                          J. JACKSON AND R. S ERVEDIO

Proof of Lemma 3.4. We first prove that with high probability every variable appears in a limited number
of terms. Fix any variable v j . For each term T` we have that v j occurs in T` with probability k/n. Since
the terms are chosen independently, the number of occurrences of v j is binomially distributed according
to B(t, p) with p = k/n. Taking β = n2k−1 α 2 /(kt 3/2 logt) in the Chernoff bound √     (which is greater than
1 for sufficiently large n), the probability that v j appears in β pt = 2k−1 α 2 /( t logt) or more terms is at
most
                                                       !2k−1 α 2 /(√t logt)
                                              3/2
                                          ekt logt
                                                                            .
                                           n2k−1 α 2
The lemma follows by the union bound over the n variables v j .
    For the bound on the number of           i
                         k−1 2
                                √ terms in f smaller than those in f , simply note that if every vari-
  k−1  2
           √ in at most 2 i α /( t logt) iterms then, since there are k variables in term Ti , at most
able appears
k2 α /( t logt) terms T` with ` 6= i in f are smaller than the corresponding terms T` in f .

    The next lemma shows that there is probably little overlap between any pair of terms in f :
                                   2
Lemma 3.5. Let δshared := t 2 ( kn )log logt . With probability at least 1 − δshared over the random draw of
f from Mt,kn , for all 1 ≤ i, j ≤ t no set of log logt or more variables belongs to two distinct terms Ti and
T j in f .
Proof. We are interested in upper bounding the probability pi that log logt or more of the variables in
a fixed term Ti belonging to f also appear in some other term T` of f , for any ` 6= i. First, a simple
counting argument shows that the probability that a fixed set of log logt variables appears in a set       of k
                                                                       log logt                      k
variables randomly chosen from among n variables is at most (k/n)               . Since there are log logt ways
to choose a fixed set of log logt variables from term Ti , we have
                                                    log logt
                                              k       k
                                   pi ≤                          (t − 1) .
                                          log logt    n
The lemma follows by the union bound over the t probabilities pi .

    Using the preceding lemmas, we can show that for f drawn from Mt,k    n , with high probability each
term is “uniquely satisfied” by a noticeable fraction of assignments. More precisely, we have:
Lemma 3.6. Let δusat := δmany + δshared . For n sufficiently large and k ≥ 5, with probability at least
1 − δusat over the random draw of f from Mt,k
                                          n , f is such that for all i = 1, . . . ,t we have

                                                                                        α3
                      Pr[Ti is satisfied by x but no other T j is satisfied by x] ≥         .
                      x                                                                2k+3
Proof. Given an f drawn according to Mt,k   n and given any term Ti in f , we are interested in the prob-
ability over uniformly drawn instances that Ti is satisfied and T` is not satisfied for all ` 6= i. Let T`6=i
represent the formula that is satisfied by an assignment x if and only if all of the T` with ` 6= i are not
satisfied by x. We want a lower bound on

                                       Pr[Ti ∧ T`6=i ] = Pr[T`6=i | Ti ] · Pr[Ti ] .

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                               154
                                     L EARNING R ANDOM DNF F ORMULAS

Since Pr[Ti ] = 1/2k , what remains is to show that with very high probability over random draw of f ,
Pr[T`6=i | Ti ] is bounded below by α 3 /8 for all Ti . That is, we need to show that Pr[ f i ] ≥ α 3 /8 with very
high probability.
    We have that all of the following statements hold with probability at least 1−δusat for every 1 ≤ i ≤ n
for a random f from Mt,k   n :

   1. Pr[ f i ] ≥ ∏`:`6=i Pr[T`i ]: this follows from Equation (3.1) in the proof of Lemma 3.3.

   2. ∏`:T i ≡T` Pr[T`i ] > α 3 /4. This holds because the terms in this product are a subset of the terms in
          `
      Equation (3.1) (in the proof of Lemma 3.3).
                             √
   3. At most k2k−1 α 2 /( t logt) terms T` with ` 6= i are smaller in f i than they are in f (by Lemma 3.4).

   4. No term in f i has fewer than k − log logt variables (by Lemma 3.5).

      These conditions together imply that
                                                                               √
                                                           k         !kα 2 /2 t
                                                α3     logt 2 /(logt)
                                                  
                              Pr[ f i ] ≥            1− k                        .
                                                4       2
                 √
Note that kα 2 /2 t ≤ 1/2 for all k ≥ 5, since for such k we have k2 α 4 ≤ αk2 < α2k ≤ t. Thus, since
        x
 1 − 1x ≥ 1/4 for all x ≥ 2, we have that Pr[ f i ] ≥ α 3 /8.
       

    On the other hand, we can upper bound the probability that two terms of a random DNF f will be
satisfied simultaneously:

Lemma 3.7. With probability at least 1 − δshared over the random draw of f from Mt,k
                                                                                 n , for all 1 ≤ i <
j ≤ t we have Pr[Ti ∧ T j ] ≤ logt
                               22k
                                   .

Proof. By Lemma 3.5, with probability at least 1 − δshared f is such that, for all 1 ≤ i < j ≤ n, terms Ti
and T j share at most log logt variables. Thus for each pair of terms a specific set of at least 2k − log logt
variables must be simultaneously set to 1 in an instance in order for both terms to be satisfied.

3.3     Identifying co-occurring variables
We now show how to identify pairs of variables that co-occur in some term of f . First, some notation.
Given a monotone DNF f over variables v1 , . . . , vn , define DNF formulas g∗∗ , g1∗ , g∗1 , and g11 over
variables v3 , . . . , vn as follows:

      • g∗∗ is the disjunction of the terms in f that contain neither v1 nor v2 ;

      • g1∗ is the disjunction of the terms in f that contain v1 but not v2 (but with v1 removed from each
        of these terms);

      • g∗1 is defined similarly as the disjunction of the terms in f that contain v2 but not v1 (but with v2
        removed from each of these terms);

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                 155
                                         J. JACKSON AND R. S ERVEDIO

    • g11 is the disjunction of the terms in f that contain both v1 and v2 (with both variables removed
      from each term).

We thus have f = g∗∗ ∨ (v1 g1∗ ) ∨ (v2 g∗1 ) ∨ (v1 v2 g11 ). Note that any of g∗∗ , g1∗ , g∗1 , g11 may be an empty
disjunction which is identically false.
    We can empirically estimate each of the following using uniform random examples (x, f (x)):

                   p00 := Pr[g∗∗ ] = Pr [ f (x) = 1 | x1 = x2 = 0]
                               x         x∈Un
                   p01 := Pr[g∗∗ ∨ g∗1 ] = Pr [ f (x) = 1 | x1 = 0, x2 = 1]
                               x                x∈Un
                   p10 := Pr[g∗∗ ∨ g1∗ ] = Pr [ f (x) = 1 | x1 = 1, x2 = 0]
                               x                x∈Un
                   p11 := Pr[g∗∗ ∨ g∗1 ∨ g1∗ ∨ g11 ] = Pr [ f (x) = 1 | x1 = 1, x2 = 1] .
                               x                            x∈Un

It is clear that g11 is nonempty if and only if v1 and v2 co-occur in some term of f ; thus we would
ideally like to obtain Prx∈Un [g11 ]. While we cannot obtain this probability from p00 , p01 , p10 , and p11 ,
the following lemma shows that we can estimate a related quantity:

Lemma 3.8. Let P denote p11 − p10 − p01 + p00 . Then P = Pr[g11 ∧ g1∗ ∧ g∗1 ∧ g∗∗ ] − Pr[g1∗ ∧ g∗1 ∧ g∗∗ ].

Proof. P gets a net contribution of 0 from those x which belong to g∗,∗ (since each such x is added twice
and subtracted twice in P). We proceed to analyze the contributions to P from the remaining 8 subsets
of the events g11 , g1∗ , and g∗1 :

    • P gets a net contribution of 0 from those x which are in g1∗ ∧ g∗1 ∧ g∗∗ since each such x is counted
      in p11 and p10 but not in p01 or p00 . Similarly P gets a net contribution of 0 from those x which
      are in g∗1 ∧ g1∗ ∧ g∗∗ .

    • P gets a net contribution of Pr[g11 ∧ g1∗ ∧ g∗1 ∧ g∗∗ ] since each such x is counted in p11 .

    • P gets a net contribution of − Pr[g1∗ ∧ g∗1 ∧ g∗∗ ] since each such x is counted in p01 , p10 , and p11 .



     More generally, let Pi j be defined as P but with vi , xi , v j , and x j substituted for v1 , x1 , v2 , and
x2 , respectively, throughout the definitions of the g’s and p’s above. The reader familiar with Boolean
Fourier analysis will readily recognize that Pi j is a scaled (by a factor of −2) version of the second-
order Fourier coefficient of f corresponding to the pair of variables (vi , v j ). (This coefficient is equal to
2 Prx∈Un [ f (x) = xi ⊕ x j ] − 1; see [23] for a nice overview of Boolean Fourier analysis in the context of
uniform-distribution learning.) The following lemma shows that, for most random choices of f , for all
1 ≤ i, j ≤ n, the value of Pi j is a good indicator of whether or not vi and v j co-occur in some term of f :

Lemma 3.9. For n sufficiently large and t ≥ 16, with probability at least 1 − δusat − δshared − δmany
over the random draw of f from Mt,k   n , we have that for all 1 ≤ i, j ≤ n (i) if vi and v j do not co-occur in
some term of f then Pi j ≤ 0; (ii) if vi and v j do co-occur in some term of f then Pi j ≥ α 4 /16t.


                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                   156
                                    L EARNING R ANDOM DNF F ORMULAS

Proof. Part (i) holds for any monotone DNF by Lemma 3.8. For (ii), we first note that with probability
at least 1 − δusat − δshared − δmany , a randomly chosen f has all the following properties:

   1. Each term in f is uniquely satisfied with probability at least α 3 /2k+3 (by Lemma 3.6);

   2. Each pair of terms Ti and T j in f are both satisfied with probability at most logt/22k (by
      Lemma 3.7); and
                                                       √
   3. Each variable in f appears in at most 2k−1 α 2 /( t logt) terms (by Lemma 3.4).

We call such an f well-behaved. For the sequel, assume that f is well-behaved and also assume without
loss of generality that i = 1 and j = 2. We consider separately the two probabilities

                    ρ1 = Pr[g11 ∧ g1∗ ∧ g∗1 ∧ g∗∗ ]       and      ρ2 = Pr[g1∗ ∧ g∗1 ∧ g∗∗ ]

whose difference defines P12 = P. By property (1) above, ρ1 ≥ α 3 /2k+3 , since each instance x that
uniquely satisfies a term T j in f containing both v1 and v2 also satisfies g11 while falsifying all of g1∗ ,
g∗1 , and g∗∗ . Since (k,t) is monotone α-interesting, this implies that ρ1 ≥ α 4 /8t. On the other hand,
clearly ρ2 ≤ Pr[g1∗ ∧ g∗1 ]. By property (2) above, for any pair of terms consisting of one term from
                                             that both terms are satisfied is at most logt/22k . Since each
g1∗ and the other from g∗1 , the probability √
                                    k−1  2
of g1∗ and g∗1 contains at most 2 α /( t logt) terms by property (3), by a union bound we have
ρ2 ≤ α 4 /(4t logt), and the lemma follows:

                                                    α4   α4      α4
                                        ρ1 − ρ2 ≥      −       ≥
                                                    8t 4t logt   16t

given the assumption that t ≥ 16.

    Thus, our algorithm for finding all of the co-occurring pairs of a randomly chosen monotone DNF
consists of estimating Pi j for each of the n(n − 1)/2 pairs (i, j) so that all of our estimates are—with
probability at least (1 − δ )—within an additive factor of α 4 /32t of their true values. Recalling that
each Pi j is a scaled version of the second-order Fourier coefficient, by the standard Hoeffding bound a
uniform random sample of size O(t 2 ln(n2 /δ )/α 8 ) is sufficient to estimate all of the Pi j ’s to the specified
tolerance with overall probability at least 1 − δ . We thus have the following theorem for monotone α-
interesting (k,t) with t = O(n2−γ ):

Theorem 3.10. For n sufficiently large and any δ > 0, with probability at least 1 − δusat − δshared −
δmany − δ over the choice of f from Mt,kn and the choice of random examples, our algorithm runs in
   2 2
O(n t log(n/δ )) time and identifies exactly those pairs (vi , v j ) which co-occur in some term of f .


3.4   Forming a hypothesis from pairs of co-occurring variables
Here we show how to construct an accurate DNF hypothesis for a random f drawn from Mt,k
                                                                                    n .


                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                  157
                                         J. JACKSON AND R. S ERVEDIO

Identifying k-cliques. By Theorem 3.10, with high probability we have complete information about
which pairs of variables (vi , v j ) co-occur in some term of f . We thus may consider the graph G with
vertices v1 , . . . , vn and edges for precisely those pairs of variables (vi , v j ) which co-occur in some term
of f . This graph is a union of t randomly chosen k-cliques from {v1 , . . . , vn } which correspond to the t
terms in f ; we will call these the f -cliques of G. Ideally, if we could identify exactly the t f -cliques of
G, then we could exactly reconstruct f . While we do not know how to accomplish this, we do show how
to find a set of k-cliques corresponding to a set of terms whose union closely approximates f .
     Specifically, we will show how to efficiently identify (with high probability over the choice of f and
random examples of f ) a set of k-cliques in G that contains as a subset the set of all of the f -cliques in
G. Once these k-cliques have been identified, as we show later it is easy to construct an accurate DNF
hypothesis for f .
     The following lemma shows that with high probability over the choice of f , each pair (vi , v j ) co-
occurs in at most a constant number of terms:
                              2
Lemma 3.11. Let δC := ( tkn2 )C (δC is a function of C as well as of t, k, and n) and fix 1 ≤ i < j ≤ n. For
any C ≥ 0 and all sufficiently large n, we have

              Pr [some pair of variables (vi , v j ) co-occur in more than C terms of f ] ≤ δC .
             f ∈Mt,k
                 n


Proof. For any fixed r ∈ {1, . . . ,t} we have that

                                                                    k(k − 1) k2
                            Pr[vi and v j co-occur in term Tr ] =           ≤   .
                                                                    n(n − 1) n2

Since these events are independent for all r, the probability that there is any collection of C terms such
that vi and v j co-occur in all C of these terms is at most
                                             2
                                           t   k C  tk2 C
                                             · 2 ≤          .
                                           C   n     n2



    By Lemma 3.11 we know that, for any given pair (vi , v j ) of variables, with probability at least 1 − δC
there are at most Ck other variables v` such that (vi , v j , v` ) all co-occur in some term of f . Suppose
that we can efficiently (with high probability) identify the set Si j of all such variables v` . Then we can
perform an exhaustive search over all (k − 2)-element subsets S0 of Si j in at most Ck                 k    O(logC)
                                                                                              
                                                                                            k ≤ (eC) = n
time, and can identify all of the sets S0 such that S0 ∪ {vi , v j } is a clique of size k in G that includes both
vi and v j . Repeating this over all pairs of variables (vi , v j ), we can with high probability identify a set
Gk of k-cliques in G such that Gk contains all of the f -cliques.
    Thus, to identify Gk , it remains only to show that for every pair of variables vi and v j , we can
determine the set Si j of those variables v` that co-occur in at least one term with both vi and v j . Assume
that f is such that all pairs of variables co-occur in at most C terms, and let T be a set of variables of
cardinality at most C having the following properties:

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                   158
                                    L EARNING R ANDOM DNF F ORMULAS

    • In the projection fT ←0 of f in which all of the variables of T are fixed to 0, vi and v j do not
      co-occur in any term; and

    • For every set T 0 ⊂ T such that |T 0 | = |T | − 1, vi and v j do co-occur in fT 0 ←0 .
Then T is clearly a subset of Si j . Furthermore, if we can identify all such sets T , then their union will be
Si j . There are only O(nC ) possible sets to consider, so our problem now reduces to the following: given
a set T of at most C variables, determine whether vi and v j co-occur in fT ←0 .
       The proof of Lemma 3.9 shows that f is well-behaved with probability at least 1 − δusat − δshared −
δmany over the choice of f . Furthermore, if f is well-behaved √          then it is easy to see that for every
|T | ≤ C, fT ←0 is also well-behaved, since fT ←0 is just f with O( t) terms removed (by Lemma 3.4).
That is, removing terms from f can only make it more likely that the remaining terms are uniquely
satisfied, does not change the bound on the probability of a pair of remaining terms being satisfied,
and can only decrease the bound on the number of remaining terms in which a remaining variable can
appear. Furthermore, Lemma 3.8 holds for any monotone DNF f . Therefore, if f is well-behaved then
the proof of Lemma 3.9 also shows that for every |T | ≤ C, the Pi j ’s of fT ←0 can be used to identify
the co-occurring pairs of variables within fT ←0 . It remains to show that we can efficiently simulate a
uniform example oracle for fT ←0 so that these Pi j ’s can be accurately estimated.
       In fact, for a given set T , we can simulate a uniform example oracle for fT ←0 by filtering the ex-
amples from the uniform oracle for f so that only examples setting the variables in T to 0 are accepted.
Since |T | ≤ C, the filter accepts with constant probability at least 1/2C . A Chernoff argument shows
that if all Pi j ’s are estimated using a single sample of size 2C+12t 2 ln(2(C + 2)nC /δ )/α 8 (filtered appro-
priately when needed) then all of the estimates will have the desired accuracy with probability at least
1−δ.
       In somewhat more detail, the Gk -finding algorithm can be written as:

    • Given: α, γ, C, δ

    • (Note that f is well-behaved and has the “each pair occurs in at most C terms” property with
      probability at least 1 − δusat − δshared − δmany − δC . So assume this of f for the remainder of the
      algorithm.)

    • Draw set S of O(t 2 log2 (n/δ )) examples of f

    • For 1 ≤ i < j ≤ n (fewer than n2 times)

          – Estimate Pi j over S (O(|S|) time)
          – Add (vi , v j ) to the set of co-occurring pairs if estimated Pi j exceeds the threshold α 4 /(32t)

    • For each |T | ≤ C (at most nC times)

          – For each co-occurring pair (vi , v j ) disjoint from T (less than tk2 times)
             (1) Estimate Pi j over (T ← 0)-filtered S (O(|S|) time)
             (2) For each subset T 0 ⊂ T , |T 0 | = |T | − 1 (at most C times)
                 (i) Estimate Pi j over (T 0 ← 0)-filtered S (O(|S|) time)

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                               159
                                          J. JACKSON AND R. S ERVEDIO

                ∗ Add T to Si j if it passes all threshold tests, i.e. the estimate from (1) is at least α 4 /(32t)
                  and each estimate from (2)(i) is at most α 4 /(32t) (O(C) time)
    • For each co-occurring pair (vi , v j ) (less than tk2 times)
          – For each (k − 2)-size subset U of Si j (nO(logC) times)
              ∗ Test if the union of (vi , v j ) and U is a clique (O(k2 ) time)
    The time bound for this algorithm is then
                                   O(|S| + n2 |S| + nC tk2C|S| + tk2 nO(logC) k2 )
which is dominated by the third term if C ≥ 2.
   The sample complexity |S| is derived as follows. We need a sample large enough to
    • succeed for all n2 tests for co-occurring pairs (over the full sample), and
    • succeed for all nC (C + 1) tests over filtered examples.
The total number of tests if C ≥ 2 is bounded by O((C + 2)nC ). Recalling that our estimates need
to be accurate to within an additive factor of α 4 /32t, we see that if all tests are run over samples of
size m = 211t 2 ln(2(C + 2)nC /δ )/α 8 then, by Hoeffding and the union bound, all tests succeed with
probability at least 1 − δ /2.
    We want |S| large enough so that all (C + 1)nC filtered samples will be of size m with probability
1 − δ /2. If a filter accepts with probability p over a sample of size 2m/p, then the probability that fewer
than m examples are accepted is at most e−m/4 by Chernoff. Using the m given in the previous paragraph
and the union bound, it can be seen that choosing |S| = 2m/p gives us the desired probability of success
over all tests.
    Thus, since we are using p = 1/2C in the filtering, the final time bound of the algorithm becomes
(for arbitrary C ≥ 2) O((2n)C t 3 k2 log(CnC /δ )). This gives us the following:
Theorem 3.12. For n sufficiently large, any δ > 0, and any fixed C ≥ 2, with probability at least
1 − δusat − δshared − δmany − δC − δ over the random draw of f from Mt,k      n and the choice of random
examples, a set Gk containing all of the f -cliques of G can be identified in time nO(C)t 3 k2 log(n/δ ).

The main learning result for monotone DNF From Gk we construct in the obvious way a list
T10 , . . . , TN0 (with N = |Gk | = O(nC )) of length-k monotone terms that contains all t true terms T1 , . . . , Tt of
 f . Now note that the target function f is simply an OR of some subset of these N “variables” T1 , . . . , TN ,
so the standard elimination algorithm for PAC learning disjunctions (under any distribution) can be used
to PAC learn the target function. The algorithm requires O((1/ε) log(1/δ ) + N/ε) examples and runs
in time which is linear in its sample size; see e.g. Chapters 1 and 2 of [20].
       Call the above described entire learning algorithm A. In summary, we have proved the following:
Theorem 3.13. Fix γ, α > 0 and C ≥ 2. Let (k,t) be a monotone α-interesting pair. For any ε > 0, δ > 0,
and t = O(n2−γ ), algorithm A will with probability at least 1 − δusat − δshared − δmany − δC − δ (over
the random choice of DNF from Mt,k n and the randomness of the example oracle) produce a hypothesis
h that ε-approximates the target with respect to the uniform distribution. Algorithm A runs in time
polynomial in n, log(1/δ ), and 1/ε.

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                      160
                                         L EARNING R ANDOM DNF F ORMULAS

4      Non-monotone DNF
4.1     Interesting parameter settings
As with Mt,k
         n we are interested in pairs (k,t) for which E f ∈Dt,k
                                                            n
                                                                [Pr[ f ]] is between α and 1 − α:

Definition 4.1. For α > 0, the pair (k,t) is said to be α-interesting if α ≤ E f ∈Dt,k
                                                                                   n
                                                                                       [Pr[ f ]] ≤ 1 − α.

      For any fixed x ∈ {0, 1}n we have
                                            1 t                                                         1 t
               Pr [ f (x) = 0] = 1 −             ,       and thus        E f ∈Dt,k [Pr[ f ]] = 1 − 1 −
              f ∈Dt,k
                  n
                                            2k                                 n                         2k

by linearity of expectation; this formula will be useful later.
    Throughout the rest of Section 4 we assume that α > 0 is fixed and (k,t) is an α-interesting pair
where t = O(n3/2−γ ) for some γ > 0.

4.2     Properties of random DNF
In this section we develop analogues of Lemmas 3.6 and 3.7 for Dt,k      t,k
                                                                n . The Dn analogue of Lemma 3.7
follows directly from the proof of Lemma 3.7, and we have:

Lemma 4.2. With probability at least 1 − δshared over the random draw of f from Dt,k
                                                                                 n , for all 1 ≤ i <
                       logt
j ≤ n, Pr[Ti ∧ T j ] ≤ 22k .

      In the following lemma we use McDiarmid’s bound to prove a Dt,k
                                                                  n version of Lemma 3.6:

Lemma 4.3. Let
                                             k2 log logt       
                                                                           −α 2t
                                                                                        
                            0
                           δusat := t (t − 1)               + exp                          .
                                               n                    16 ln2 (2/α) log2 t
                               0
With probability at least 1 − δusat , a random f drawn from Dt,k
                                                             n is such that for each i = 1, . . . ,t, we have

                                                                                                α
                        Pi ≡ Pr[Ti is satisfied by x but no other T j is satisfied by x] ≥           .
                                x                                                             2k+1
Proof. We show that P1 ≥ α/2k+1 with probability at least 1 − δusat     0    /t; the lemma follows by a union
bound. We first show that E f ∈Dt,k [P  ] ≥ α/2 k . For any fixed x ∈ T   , we have
                                n
                                      1                                1


                            Pr[T 2 (x) ∧ · · · · · · ∧ T t (x)] = (1 − 2−k )t−1 > (1 − 2−k )t ≥ α

where the last inequality holds since (k,t) is α-interesting. Since a 2−k fraction of all x ∈ {0, 1}n belong
to T1 , by linearity of expectation we have E f ∈Dt,k
                                                  n
                                                      [P1 ] ≥ α/2k .
    Now we show that with high probability the deviation of P1 from its expected value is low. Given
any fixed length-k term T1 , let Ω denote the set of all length-k terms T which satisfy Pr[T1 ∧ T ] ≤
                                                                                                     2
(logt)/22k . By reasoning as in the proof of Lemma 4.2, with probability at least 1 − (t − 1)( kn )log logt

                                T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                             161
                                                J. JACKSON AND R. S ERVEDIO

each of T2 , . . . , Tt belongs to Ω, so we henceforth assume that this is in fact the case, i.e. we condition on
the event {T2 , . . . , Tt } ⊂ Ω. Note that under this conditioning we have that each of T2 , . . . , Tt is selected
uniformly and independently from Ω. Note also that this conditioning can change the value of P1 (a
                                          2
probability) by at most (t − 1)( kn )log logt < 2k+2    α
                                                            , so under this conditioning we have E[P1 ] ≥ 34 · 2αk .
      We now use McDiarmid’s inequality where the random variables are the randomly selected terms
T2 , . . . , Tt from Ω and F(T2 , . . . , Tt ) denotes P1 , i.e.
                F(T2 , . . . , Tt ) = Pr[T1 is satisfied by x but no T j with j ≥ 2 is satisfied by x] .
                                      x

Since each T j belongs to Ω, we have
                                                                                       logt
                          |F(T2 , . . . , Tt ) − F(T2 , . . . , T j−1 , T j0 , T j+1 , . . . , Tt )| ≤ ci =
                                                                                        22k
for all j = 2, . . . ,t. Taking τ = 41 · 2αk , McDiarmid’s inequality implies that Pr P1 < 2k+1
                                                                                               α
                                                                                                   
                                                                                                     is at most
                              !
          −α 2 /(16 · 22k )                     −α 2 22k              −α 2 22k                     −α 2t
                                                                                                           
   exp                          = exp                         ≤ exp                ≤ exp
           (t − 1)( logt
                       22k
                           )2              16(t − 1) log2 t           16t log2 t            16 ln2 (2/α) log2 t

where the last inequality holds since (k,t) is α-interesting. Combining all the failure probabilities, the
lemma is proved.

4.3   Identifying (most pairs of) co-occurring variables
Recall that in Section 3.3 we partitioned the terms of our monotone DNF into four disjoint groups
depending on what subset of {v1 , v2 } was present in each term. In the non-monotone case, we will
partition the terms of f into nine disjoint groups depending on whether each of v1 , v2 is unnegated,
negated, or absent:
      f = g∗∗ ∨ (v1 g1∗ ) ∨ (v1 g0∗ ) ∨ (v2 g∗1 ) ∨ (v1 v2 g11 ) ∨ (v1 v2 g01 ) ∨ (v2 g∗0 ) ∨ (v1 v2 g10 ) ∨ (v1 v2 g00 )
Thus g∗∗ contains those terms of f which contain neither v1 nor v2 in any form; g0∗ contains the terms
of f which contain v1 but not v2 in any form (with v1 removed from each term); g∗1 contains the terms
of f which contain v2 but not v1 in any form (with v2 removed from each term); and so on. Each g·,· is
thus a DNF (possibly empty) over literals formed from v3 , . . . , vn .
    For all four possible values of (a, b) ∈ (0, 1)2 , we can empirically estimate
                    pab := Prx [g∗∗ ∨ ga∗ ∨ g∗b ∨ gab ] = Prx [ f (x) = 1 | x1 = a, x2 = b] .
     It is easy to see that Pr[g11 ] is either 0 or else at least 4/2k depending on whether g11 is empty or
not. Ideally we would like to be able to accurately estimate each of Pr[g00 ], Pr[g01 ], Pr[g10 ], and Pr[g11 ];
if we could do this then we would have complete information about which pairs of literals involving
variables v1 and v2 co-occur in terms of f . Unfortunately, the probabilities Pr[g00 ], Pr[g01 ], Pr[g10 ], and
Pr[g11 ] cannot in general be obtained from p00 , p01 , p10 , and p11 . However, we will show that we can
efficiently obtain some partial information which enables us to learn to fairly high accuracy.
     As before, our approach is to accurately estimate the quantity P = p11 − p10 − p01 + p00 . We have
the following two lemmas:

                            T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                             162
                                    L EARNING R ANDOM DNF F ORMULAS

Lemma 4.4. If all four of g00 , g01 , g10 , and g11 are empty, then P equals

                     Pr[g1∗ ∧ g∗0 ∧ (no other g·,· )] + Pr[g0∗ ∧ g∗1 ∧ (no other g·,· )]
                         − Pr[g1∗ ∧ g∗1 ∧ (no other g·,· )] − Pr[g0∗ ∧ g∗0 ∧ (no other g·,· )] .                (4.1)

Proof. Since all four of g00 , g01 , g10 , and g11 are empty we need only consider the five events g∗∗ , g∗0 ,
g0∗ , g∗1 , and g1∗ . We now analyze the contribution to P from each possible subset of these 5 events:

   • P gets a net contribution of 0 from those x which belong to g∗,∗ (and to any other subset of the
     remaining four events) since each such x is counted in each of p00 , p01 , p10 , and p11 . It remains
     to consider all 16 subsets of the four events g∗0 , g0∗ , g∗1 , and g1∗ .

   • P gets a net contribution of 0 from those x which are in at least 3 of the four events g∗0 , g0∗ , g∗1 ,
     and g1∗ since each such x is counted in each of p00 , p01 , p10 , and p11 . P also gets a net contribution
     of 0 from those x which are in exactly one of the four events g∗0 , g0∗ , g∗1 , and g1∗ . It remains to
     consider those x which are in exactly two of the four events g1∗ , g0∗ , g∗1 , and g∗0 .

   • P gets a net contribution of 0 from those x which are in g1∗ and g0∗ and no other events, since each
     such x is counted in each of p00 , p01 , p10 , and p11 . The same is true for those x which are in g∗1
     and g∗0 and no other events.

   • P gets a net contribution of

                                       − Pr[g1∗ ∧ g∗1 ∧ (no other g·,· occurs)]

      from those x which are in g1∗ and g∗1 and no other event. Similarly, P gets a net contribution of

                                       − Pr[g0∗ ∧ g∗0 ∧ (no other g·,· occurs)]

      from those x which are in g0∗ and g∗0 and no other event. P gets a net contribution of

                                        Pr[g1∗ ∧ g∗0 ∧ (no other g·,· occurs)]

      from those x which are in g1∗ and g∗0 and no other event, and gets a net contribution of

                                        Pr[g0∗ ∧ g∗1 ∧ (no other g·,· occurs)]

      from those x which are in g0∗ and g∗1 and no other event.



Lemma 4.5. If exactly one of g00 , g01 , g10 and g11 is nonempty (say g11 ), then P equals (4.1) plus

     Pr[g11 ∧ g1∗ ∧ g∗0 ∧ (no other g·,· )] + Pr[g11 ∧ g0∗ ∧ g∗1 ∧ (no other g·,· )]
         − Pr[g11 ∧ g1∗ ∧ g∗1 ∧ (no other g·,· )] − Pr[g11 ∧ g0∗ ∧ g∗0 ∧ (no other g·,· )]
         + Pr[g11 ∧ g0∗ ∧ (no other g·,· )] + Pr[g11 ∧ g∗0 ∧ (no other g·,· )] + Pr[g11 ∧ (no other g·,· )] .


                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                     163
                                        J. JACKSON AND R. S ERVEDIO

Proof. We suppose that g11 is nonempty. We wish to analyze the contribution to P from all 64 subsets
of the six events g∗∗ , g1∗ , g0∗ , g∗1 , g∗0 , and g11 . From Lemma 4.4 we know this contribution for the 32
subsets which do not include g11 is (4.1) so only a few cases remain:

   • P gets a net contribution of 0 from those x which are in g11 and in g∗∗ and in any other subset of
     events (each such x is counted in each of p11 , p01 , p10 , and p00 ). Similarly, P gets a contribution
     of 0 from those x which are in g11 and in at least three of g1∗ , g0∗ , g∗1 , and g∗0 . So it remains only
     to analyze the contribution from subsets which contain g11 , contain at most two of g1∗ , g0∗ , g∗1 ,
     g∗0 , and contain nothing else.

   • An analysis similar to that of Lemma 4.4 shows that P gets a net contribution of

                  Pr[g11 ∧ g1∗ ∧ g∗0 ∧ (no other g·,· )] + Pr[g11 ∧ g0∗ ∧ g∗1 ∧ (no other g·,· )]
                    − Pr[g11 ∧ g1∗ ∧ g∗1 ∧ (no other g·,· )] − Pr[g11 ∧ g0∗ ∧ g∗0 ∧ (no other g·,· )]

      from those x which are in g11 , in exactly two of {g1∗ , g0∗ , g∗1 , g∗0 }, and in no other events. So
      it remains only to consider subsets which contain g11 and at most one of g1∗ , g0∗ , g∗1 , g∗0 , and
      nothing else.

   • P gets a contribution of 0 from x which are in g11 and g1∗ and in nothing else; likewise from x
     which are in g11 and g∗1 and in nothing else. P gets a contribution of

                                            Pr[g11 ∧ g0∗ ∧ (no other g·,· )]

      from x which are in g11 and g0∗ and in nothing else, and a contribution of

                                            Pr[g11 ∧ g∗0 ∧ (no other g·,· )]

      from x which are in g11 and g∗0 and in nothing else.

   • P gets a net contribution of Pr[g11 ∧ ( no other g·,· )] from those x which are in g11 and in no other
     event.



   Using the above two lemmas we can show that the value of P is a good indicator for distinguishing
between all four of g00 , g01 , g10 , and g11 being empty versus exactly one of them being nonempty:
                                                                                 0
Lemma 4.6. For n sufficiently large and t ≥ 4, with probability at least 1 − δusat    − δshared − δmany
                                   t,k
over a random draw of f from Dn , we have that: (i) if v1 and v2 do not co-occur in any term of f then
P ≤ α 2 /8t; (ii) if v1 and v2 do co-occur in some term of f and exactly one of g00 , g01 , g10 , and g11 is
nonempty, then P ≥ 3α 2 /16t.
                                      0
Proof. With probability at least 1 − δusat − δshared − δmany a randomly chosen f from Dt,k
                                                                                       n will have
all of the following properties:

   1. Each term in f is uniquely satisfied with probability at least α/2k+1 (by Lemma 4.3);

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                164
                                      L EARNING R ANDOM DNF F ORMULAS

                                                       √
   2. Each variable in f appears in at most 2k−1 α 2 /( t logt) terms (by Lemma 3.4); and

   3. Each pair of terms Ti and T j in f are both satisfied with probability at most logt/22k (by
      Lemma 4.2).

    For the sequel assume that we have such an f . We first prove (i) by showing that P—as represented
by (4.1) of Lemma 4.4—is at most α 4 /(t logt). By property 3 above, for any pair of terms consisting of
one term from g1∗ and the other from g∗0 , the probability
                                                      √    that both terms are satisfied is at most logt/22k .
Since each of g1∗ and g∗0 contains at most 2k−1 α 2 /( t logt) terms by property 2, a union bound gives

                                                                                  α4
                          Pr[g1∗ ∧ g∗0 ∧ (no other g·,· )] ≤ Pr[g1∗ ∧ g∗0 ] ≤           .
                                                                                4t logt

A similar bound holds for Pr[g0∗ ∧ g∗1 ∧ (no other g·,· )], which is the only other positive summand in
(4.1), so P is certainly at most α 4 /(t logt). This is at most α 2 /8t since α ≤ 1/2 and t ≥ 4.
    We now prove (ii). By an argument similar to the above we have that the first six summands (not
including (4.1) in the expression of Lemma 4.5, namely Pr[g11 ∧ g1∗ ∧ g∗0 ∧ (no other g·,· )] through
Pr[g11 ∧ g∗0 ∧ (no other g·,· )], are each at most α 4 /(4t logt) in magnitude. Now observe that each in-
stance x that uniquely satisfies a term T j in f containing both v1 unnegated and v2 unnegated must
satisfy g11 and no other g·,· . Thus under the conditions of (ii) the last summand in Lemma 4.5, namely
Pr[g11 ∧ ( no other g·,· )], is at least α/2k+1 by property 1 above, so we have that (ii) is at least

                                                    α    5 α4
                                                       −         .
                                                   2k+1 2 t logt

(Here the 5α 4 /(2t logt) comes from the ten summands – four from (4.1) and six from the first six
summands of Lemma 4.5 – each of which contributes at most α 4 /(4t logt) in magnitude.) Since (k,t) is
α-interesting we have t/2k ≥ α, and from this and the constant bounds on α and t it is easily shown that

                                      α       α2              5 α4       5α 2
                                      k+1
                                          ≥             and            ≤      ,
                                  2           2t              2 t logt   16t

from which the lemma follows.

      It is clear that an analogue of Lemma 4.6 holds for any pair of variables vi , v j in place of v1 , v2 . Thus,
for each pair of variables vi , v j , if we decide whether vi and v j co-occur (negated or otherwise) in any
term on the basis of whether Pi j is large or small, we will err only if two or more of g00 , g01 , g10 , and g11
are nonempty.
      We now show that for f ∈ Dt,k    n , with very high probability there are not too many pairs of variables
(vi , v j ) which co-occur (with any sign pattern) in at least two terms of f . Note that this immediately
bounds the number of pairs (vi , v j ) which have two or more of the corresponding g00 , g01 , g10 , and g11
nonempty.

Lemma 4.7. Let d > 0 and f ∈ Dt,k                                                2 4 2
                                      n . The probability that more than (d + 1)t k /n pairs of variables
                                                                     2 3 4  4
(vi , v j ) each co-occur in two or more terms of f is at most exp(−d t k /n ).


                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                   165
                                            J. JACKSON AND R. S ERVEDIO

Proof. We use McDiarmid’s inequality, where the random variables are the terms T1 , . . . , Tt chosen in-
dependently from the set of all possible terms of length k and F(T1 , . . . , Tt ) denotes the number of pairs
of variables (vi , v j ) that co-occur in at least two terms. For each ` = 1, . . . ,t we have

                                                                                     k2
                                       Pr[T` contains both v1 and v2 ] ≤                ,
                                                                                     n2
so by a union bound we have

                                                                                                              t 2 k4
           Pr[ f contains at least two terms which contain both v1 and v2 in any form] ≤                             .
                                                                                                                n4

By linearity of expectation we have µ = E[F] ≤ t 2 k4 /n2 . Since each term involves at most k2 pairs of
co-occurring variables, we have

                        |F(T1 , . . . , Tt ) − F(T1 , . . . , Ti−1 , Ti0 , Ti+1 , . . . , Tt )| ≤ ci = k2 .

We thus have by McDiarmid’s inequality that Pr[F ≥ t 2 k4 /n2 + τ] ≤ exp(−τ 2 /(tk4 )). Taking τ =
dt 2 k4 /n2 , we have Pr[F ≥ (d + 1)t 2 k4 /nx 2] ≤ exp(−d 2t 3 k4 /n4 ).

    Taking d = n2 /(t 5/4 k4 ) in the above lemma (note that d > 1 for n sufficiently √
                                                                                      large since t 5/4 =
O(n15/8 )), we have (d +1)t 2 k4 /n2 ≤ 2t 3/4 and the failure                              4
                                                       √ 4probability is at most exp(− t/k ) (we hence-
forth write δco-occur to denote this quantity exp(− t/k )). The results of this section (together with a
standard analysis of error in estimating each Pi j ) thus yield:

Theorem 4.8. For n sufficiently large and for any δ > 0, with probability at least 1 − δco-occur − δusat  0   −
                                                              t,k
δshared − δmany − δ over the random draw of f from Dn and the choice of random examples, the above
algorithm runs in O(n2t 2 log(n/δ )) time and outputs a list of pairs of variables (vi , v j ) such that: (i)
if (vi , v j ) is in the list then vi and v j co-occur in some term of f ; and (ii) at most N0 = 2t 3/4 pairs of
variables (vi , v j ) which do co-occur in f are not on the list.

4.4   Reconstructing an accurate DNF hypothesis
It remains to construct a good hypothesis for the target DNF from a list of pairwise co-occurrence
relationships as provided by Theorem 4.8. As in the monotone case, we consider the graph G with
vertices v1 , . . . , vn and edges for precisely those pairs of variables (vi , v j ) which co-occur (with any sign
pattern) in some term of f . As before this graph is a union of t randomly chosen k-cliques S1 , . . . , St
which correspond to the t terms in f , and as before we would like to find a set of k-cliques in G that
contains the t k-cliques corresponding to the terms of f as a subset. However, there are two differences
now: the first is that instead of having the true graph G, we instead have access only to a graph G0
which is formed from G by deleting some set of at most N0 = 2t 3/4 edges. The second difference is
that the final hypothesis must take the signs of literals in each term into account. To handle these two
differences, we use a different reconstruction procedure than we used for monotone DNF in Section 3.4;
this reconstruction procedure only works for t = O(n3/2−γ ) where γ > 0.

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                             166
                                    L EARNING R ANDOM DNF F ORMULAS

     We first show how to identify (with high probability over the choice of f ) the set of all k-cliques in
G0 ; clearly, the k-cliques corresponding to terms in f are a subset of this set. We then show how to form
a DNF hypothesis from the set of all k-cliques in G0 .
     We now describe an algorithm which, for t = O(n3/2−γ ) with γ > 0, with high probability runs in
polynomial time and identifies all the k-cliques in G0 which contain edge (v1 , v2 ). Since G0 has at most
tk2 edges, running the algorithm at most tk2 times on all edges in G0 will give us with high probability
all the k-cliques in G0 . The algorithm is:

    • Let ∆ be the set of vertices v j such that v1 , v2 , and v j form a triangle in G0 . Run a brute-force
      algorithm to find all (k − 2)-cliques in the subgraph induced by ∆ (this is the subgraph of G0
      whose vertices are the vertices of ∆, and whose edges are the edges that are present in G0 between
      vertices of ∆).

    It is clear that the algorithm finds every k-clique which contains edge (v1 , v2 ). To bound the algo-
rithm’s running time, it suffices to give a high probability bound on the size of ∆ in the graph G (clearly
∆ only shrinks in passing from G to G0 ). The following lemma gives such a bound:

Lemma 4.9. Let G be a randomgraphas described above. For any t = O(n3/2−γ ) and any C > 0 we
                                     6C
have that with probability 1 − O log
                                  n2γC
                                        n
                                          the size of ∆ in G is at most Ck.

Proof. In order for v1 , v2 , and v j to form a triangle in G, it must be the case that either (i) some clique Si
contains {1, 2, j}; or (ii) there is some pair of cliques Sa , Sb with 2 ∈ / Sa and {1, j} ⊂ Sa and 1 ∈ / Sb and
{2, j} ⊂ Sb .
     For (i), we have from Lemma 3.11 that v1 and v2 co-occur in more than C terms with probability
at most (tk2 /n2 )C . Since each term in which v1 and v2 co-occur contributes at most k − 2 vertices
v j to condition (i), the probability that more than C(k − 2) vertices v j satisfy condition (i) is at most
(tk2 /n2 )C = O(1/nC/2 ).
     For (ii), let A be the set of those indices a ∈ {1, . . . ,t} such that 2 ∈/ Sa and 1 ∈ Sa , and let SA be
∪a∈A Sa . Similarly let B be the set of indices b such that 1 ∈    / Sb and 2 ∈ Sb , and let SB be ∪b∈B Sb . It is
clear that A and B are disjoint. For each ` = 1, . . . ,t we have that ` ∈ A independently with probability
at most p = k/n, so E[|A|] ≤ tk/n. We now consider two cases:
Case 1: t ≤ n/ log n. In this case we may take β = n log n/(tk) in the Chernoff bound, and we have that
Pr[|A| ≥ β pt] equals
                                   β pt        log n           log n
                                   e          ek              e               1
                Pr[|A| ≥ log n] ≤        ≤              =                  = ω(1) .
                                   β           2
                                            log n          Ω(log n)         n

The same bound clearly holds for B. Note that in Case 1 we thus have |SA |, |SB | ≤ k log n with probability
1 − 1/nω(1) .
Case 2: t > n/ log n. In this case we may take β = log n in the Chernoff bound and we obtain
                                                        kt(log n)/n        k
                                        tk log n       e                    e         1
              Pr[|A| ≥ β pt] = Pr |A| ≥           ≤                    <           = ω(1)
                                            n        log n                log n     n

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                  167
                                        J. JACKSON AND R. S ERVEDIO

where the last inequality holds since k = Ω(log n) (since t > n/ log n and (k,t) is α-interesting). In Case
2 we thus have |SA |, |SB | ≤ (tk2 log n)/n with probability 1 − 1/nω(1) .
    Let SA0 denote SA − {1} and SB0 denote SB − {2}. Since        A and B are disjoint, it is easily seen that
                   0                                0       n−2 0
conditioned on SA being of some particular size sA , all s0 sA -element subsets of {3, . . . , n} are equally
                                                              A
likely for SA0 . Likewise, conditioned on SB0 being of size s0B , all n−2
                                                                             0
                                                                        s0B  sB -element subsets of {3, . . . , n}
                        0                                 0     0
are equally likely for SB . Thus, the probability that |SA ∩ SB | ≥ C is at most
                                 0   0 C  0 0 C  0 0 C
                                  sB       sA           sA sB         2sA sB
                                                  ≤                ≤                                       (4.2)
                                   C     n−2           n−2               n
(since the expression on the left is an upper bound on the probability that any collection of C elements
in SB0 all coincide with elements of SA0 ).
     In Case 1 (t ≤ n/ log n) we may assume that s0A , s0B are each at most k log n (recall from above that this
holds with probability 1 − n−ω(1) ), and thus (4.2) is at most [(2k2 log2 n)/n]C . In Case 2 (t > n/ log n)
we may assume that s0A , s0B ≤ (tk2 log n)/n (here too from above we have that this holds with probability
1 − n−ω(1) ) and thus (4.2) is at most
                                      2 4 2 C               6C 
                                       2t k log n               log n
                                              3
                                                        =O                .
                                            n                    n2γC
    Thus all in all, we have that except with probability O(1/nC/2 ) event (i) contributes
                                                                                     6C at most C(k − 2)
vertices v j such that {1, 2, j} forms a triangle, and except with probability O log    n2γC
                                                                                             n
                                                                                               event (ii) con-
tributes at most C vertices v j such that {1, 2, j} forms a triangle. This proves the lemma.

    By Lemma 4.9, doing a brute-force search which finds all (k − 2)-cliques in the graph induced by
                             eCk k
∆ takes at most Ck                  = (eC)O(log n) = nO(logC) time steps. Thus we can efficiently with high
                               
                       k   ≤  k
probability identify all the k-cliques in G0 . How many of the “true” cliques S1 , . . . , St in G are not
present as k-cliques in G0 ? By Lemma 3.11, with probability at least 1 − t 2 (tk2 /n2 )C each edge (vi , v j )
participates in at most C cliques from S1 , . . . , St . Since G0 is missing at most N0 edges from G0 , with
probability at least 1 − t 2 (tk2 /n2 )C the set of all k-cliques in G0 is missing at most CN0 “true” cliques
from S1 , . . . , St .
    Summarizing the results of this section so far, we have:
Theorem 4.10. Fix C ≥ 2. Given a DNF formula f drawn from Dt,k             n and a list of pairs of co-occurring
variables as described in Theorem 4.8, with probability at least 1 − 1/nΩ(C) the above procedure runs
in nO(logC) time and constructs a a list Z1 , . . . , ZN 0 (where N 0 = nO(logC) ) of k-cliques which contains all
but at most CN0 of the cliques S1 , . . . , St .
    We construct a hypothesis DNF from the list Z1 , . . . , ZN 0 of candidate k-cliques as follows: for each
Zi we form all 2k possible terms which could have given rise to Zi (corresponding to all 2k sign patterns
on the k variables in Zi ). We then test each of these 2k N 0 potential terms against a sample of M randomly
drawn negative examples and discard any terms which output 1 on any negative example; the final
hypothesis h is the OR of all surviving terms. Any candidate term T 0 which has
                                                                         ε
                                     Pr [T 0 (x) = 1 ∧ f (x) = 0] ≥ k+1 0
                                   x∈Un                              2 N

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                  168
                                     L EARNING R ANDOM DNF F ORMULAS

will survive this test with probability at most exp(−εM/(2k+1 N 0 )). Taking ε = 1/2k and

                                                     2k+1 N 0 log2 n
                                               M=
                                                           ε

we have that with probability 1−1/nω(1) each term in the final hypothesis contributes at most ε/(2k+1 N 0 )
toward the false positive rate of h, so with high probability the false positive rate of h is at most ε = 1/2k .
    The false negative rate of h is at most 21k times the number of terms in f which are missing in h.
Since the above algorithm clearly will not discard any term in f (since such a term will never cause a
false negative mistake), we need only bound the number ofterms in f which are not among our 2k N 0
candidates. With probability at least 1 − δclique := 1 − t 2 / nk , each true clique S1 , . . . , St in G gives rise
to exactly one term of f (the only way this does not happen is if two terms consist        of literals over the
exact same set of k variables, and the probability that this occurs is at most t 2 / nk ), so Theorem 4.10
implies that h is missing at most CN0 terms of f . Thus the false negative rate is at most

                                          CN0 2Ct 3/4     1
                                              ≤       =           .
                                           2k   2 k     Ω(t 1/4 )

    All in all the following is our main learning result for non-monotone DNF:

Theorem 4.11. Fix γ, α > 0 and C ≥ 2. Let (k,t) be a monotone α-interesting pair. For f randomly
chosen from Dt,k                                               0                                       Ω(C)
               n , with probability at least 1 − δco-occur − δusat − δshared − δmany − δclique − 1/n
the above algorithm runs in Õ(n2t 2 + nO(logC) ) time and outputs a hypothesis h whose error rate relative
to f under the uniform distribution is at most 1/Ω(t 1/4 ).

    It can be verified from the definitions of the various δ ’s that for any t = ω(1) as a function of n, the
failure probability is o(1) and the accuracy is 1 − o(1).


5    Future work
We can currently only learn random DNFs with o(n3/2 ) terms (o(n2 ) terms for monotone DNF); can
stronger results be obtained which hold for all polynomial-size DNF? A natural approach here for learn-
ing nc -term DNF might be to first try to identify all c0 -tuples of variables which co-occur in a term, where
c0 is some constant larger than c. Also, our current results for t = ω(1)-term DNF let us learn to some
1 − o(1) accuracy but we cannot yet achieve an arbitrary inverse polynomial error rate for non-monotone
DNF. Finally, another interesting direction is to explore other natural models of random DNF formulas,
perhaps by allowing some variation among term sizes or dependencies between terms.
Acknowledgement. Avrim Blum suggested to one of us (JCJ) the basic strategy that learning monotone
DNF with respect to uniform might be reducible to finding the co-occurring pairs of variables in the
target function. We thank the anonymous referees for helpful suggestions and corrections. This material
is based upon work supported by the National Science Foundation under Grant No. CCR-0209064 (JCJ)
and CCF-0347282 (RAS).

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                                   169
                                   J. JACKSON AND R. S ERVEDIO

References
 [1] * H. A IZENSTEIN AND L. P ITT: On the learnability of disjunctive normal form formulas. Machine
     Learning, 19:183–208, 1995. [ML:n226835168336578]. 1.1

 [2] * N OGA A LON AND J OEL H. S PENCER: The Probabilistic Method. John Wiley and Sons, 2000.
     2.1

 [3] * D. A NGLUIN: Queries and concept learning.          Machine Learning, 2(4):319–342, 1988.
     [ML:u228266621966h58]. 1.1

 [4] * DANA A NGLUIN AND M ICHAEL K HARITONOV: When won’t membership queries help? Jour-
     nal of Computer and System Sciences, 50(2):336–355, 1995. [JCSS:10.1006/jcss.1995.1026]. 1.1

 [5] * A. B LUM: Learning a function of r relevant variables (open problem). In Proc. 16th Ann.
     Conf. on Computational Learning Theory (COLT’03), volume 2777 of Lecture Notes in Computer
     Science, pp. 731–733. Springer, 2003. [COLT:fxdg79cvndr6n05r]. 1.2

 [6] * A. B LUM:        Machine learning:      a tour through some favorite results, direc-
     tions, and open problems.        FOCS 2003 tutorial slides, available at http://www-
     2.cs.cmu.edu/ avrim/Talks/FOCS03/tutorial.ppt, 2003. 1.1

 [7] * A. B LUM , C. B URCH , AND J. L ANGFORD: On learning monotone boolean func-
     tions.   In Proc. 39th FOCS, pp. 408–415. IEEE Computer Society Press, 1998.
     [FOCS:10.1109/SFCS.1998.743491]. 1.1

 [8] * A. B LUM , M. F URST, J. JACKSON , M. K EARNS , Y. M ANSOUR , AND S. RUDICH: Weakly
     learning DNF and characterizing statistical query learning using Fourier analysis. In Proc. 26th
     STOC, pp. 253–262. ACM Press, 1994. [STOC:195058.195147]. 1.1

 [9] * B. B OLLOB ÁS: Combinatorics: Set Systems, Hypergraphs, Families of Vectors and Combinato-
     rial Probability. Cambridge University Press, 1986. 3.2

[10] * M. G OLEA , M. M ARCHAND , AND T. H ANCOCK: On learning µ-perceptron networks
     on the uniform distribution. Neural Networks, 9(1):67–82, 1994. [NeuralNet:10.1016/0893-
     6080(95)00009-7]. 2.2

[11] * T. H ANCOCK: Learning kµ decision trees on the uniform distribution. In Proc. 6th
     Ann. Conf. on Computational Learning Theory (COLT’93), pp. 352–360. ACM Press, 1993.
     [ACM:168304.168374]. 2.2

[12] * J. JACKSON: An efficient membership-query algorithm for learning DNF with respect to
     the uniform distribution. Journal of Computer and System Sciences, 55(3):414–440, 1997.
     [JCSS:10.1006/jcss.1997.1533]. 1.1

[13] * J. JACKSON , A. K LIVANS , AND R. S ERVEDIO: Learnability beyond AC0 . In Proc. 34th STOC,
     pp. 776–784. ACM Press, 2002. [STOC:509907.510018]. 2.2

                      T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                        170
                                L EARNING R ANDOM DNF F ORMULAS

[14] * J. JACKSON AND R. S ERVEDIO: Learning random log-depth decision trees under the uniform
     distribution. In Proc. 16th Ann. Conf. on Computational Learning Theory (COLT’03) and 7th
     Kernel Workshop, volume 2777 of Lecture Notes in Computer Science, pp. 610–624. Springer,
     2003. [doi:10.1007/b12006, COLT:31wxjv7lb9l5]. 2.2

[15] * J. JACKSON AND R. S ERVEDIO: On learning random DNF formulas under the uniform dis-
     tribution. In Proc. 9th Internat. Workshop on Randomization and Computation (RANDOM’05),
     volume 3624 of Lecture Notes in Computer Science, pp. 342–353. Springer, 2005. [RAN-
     DOM:2y5933y326xhbgar]. 1.2

[16] * J. JACKSON AND C. TAMON: Fourier Analysis in Machine Learning. ICML/COLT 1997 tutorial
     slides, available at http://learningtheory.org/resources.html, 1997. 1.1

[17] * M. K EARNS: Efficient noise-tolerant learning from statistical queries. Journal of the ACM,
     45(6):983–1006, 1998. [JACM:293347.293351]. 1.1

[18] * M. K EARNS , M. L I , L. P ITT, AND L. VALIANT: On the learnability of Boolean formulae. In
     Proc. 19th STOC, pp. 285–295. ACM Press, 1987. [STOC:28395.28426]. 3.1

[19] * M. K EARNS , M. L I , L. P ITT, AND L. VALIANT: Recent results on Boolean concept learning. In
     Proc. 4th Internat. Workshop on Machine Learning, pp. 337–352. Morgan Kaufmann, 1987. 1.1

[20] * M. K EARNS AND U. VAZIRANI: An introduction to computational learning theory. MIT Press,
     Cambridge, MA, 1994. 3.4

[21] * A. K LIVANS , R. O’D ONNELL , AND R. S ERVEDIO: Learning intersections and thresh-
     olds of halfspaces. In Proc. 43rd FOCS, pp. 177–186. IEEE Computer Society Press, 2002.
     [FOCS:10.1109/SFCS.2002.1181894]. 2.2

[22] * L. K U ČERA , A. M ARCHETTI -S PACCAMELA , AND M. P ROTASSI: On learning monotone
     DNF formulae under uniform distributions. Information and Computation, 110:84–95, 1994.
     [IandC:10.1006/inco.1994.1024]. 2.2

[23] * Y. M ANSOUR: Learning Boolean functions via the Fourier transform, pp. 391–424. Kluwer
     Academic Publishers, 1994. 3.3

[24] * C. M C D IARMID: On the method of bounded differences. In Surveys in Combinatorics 1989, pp.
     148–188. London Mathematical Society Lecture Notes, 1989. 2.1, 3.1

[25] * R. S ERVEDIO: On learning monotone DNF under product distributions. In Proc. 14th Ann.
     Conf. on Computational Learning Theory (COLT’01), volume 2111 of Lecture Notes in Computer
     Science, pp. 558–573. Springer, 2001. [COLT:3j42gw4570jb08yt]. 1.1, 2.2

[26] * L. VALIANT: A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
     [CACM:1968.1972]. 1.1, 1.2, 3.1

                      T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                        171
                                    J. JACKSON AND R. S ERVEDIO

[27] * K. V ERBEURGT: Learning DNF under the uniform distribution in quasi-polynomial time. In
     Proc. 3rd Ann. Workshop on Computational Learning Theory (COLT ’90), pp. 314–326. Morgan
     Kaufmann, 1990. [ACM:92571.92657]. 1.1, 2.2


AUTHORS

     Jeffrey C. Jackson
     Department of Mathematics and Computer Science
     Duquesne University
     Pittsburgh, PA 15282, USA
     jacksonj duq edu
     http://www.mathcs.duq.edu/~jackson


     Rocco A. Servedio
     Department of Computer Science
     Columbia University
     New York, NY 10027, USA
     rocco cs columbia edu
     http://www.cs.columbia.edu/~rocco


ABOUT THE AUTHORS

     J EFFREY C. JACKSON has a distinctive educational background, having received his B. S.
         from Oral Roberts University and his Ph. D. from Carnegie Mellon, where Merrick Furst
         was his advisor. He has been a member of the faculty of Duquesne University since
         1995, where he is currently chair of the Department of Mathematics and Computer
         Science. Jeff has also been a software engineer and manager in both the aerospace
         and dot-com industries and is the author of the textbook Web Technologies: A Science
         Computer Perspective. He is the proud father of four children (think about his last name
         for a moment and you’ll know why he and his wife didn’t stop at three).


     ROCCO A. S ERVEDIO received his B. S., M. S. and Ph. D. from Harvard University, where
       his Ph. D. was supervised by Leslie Valiant. For a change of pace, he then held an
       NSF postdoc at Harvard University, where he was supervised by Leslie Valiant. Since
       2003 he has been an assistant professor at Columbia University in the Department of
       Computer Science He is interested in computational learning theory and computational
       complexity, and has received the NSF Career award and a Sloan Foundation Fellowship.
       He enjoys spending time with his family and hopes to have dinner with Herman Melville
       in the afterlife.



                      T HEORY OF C OMPUTING, Volume 2 (2006), pp. 147–172                           172