Authors Dana Dachman-Soled, Homin K. Lee, Tal Malkin, Rocco A. Servedio, Andrew Wan, Hoeteck Wee,
License CC-BY-3.0
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282
www.theoryofcomputing.org
Optimal Cryptographic Hardness of
Learning Monotone Functions∗
Dana Dachman-Soled ‡ † Homin K. Lee ‡ Tal Malkin §
Rocco A. Servedio ¶ Andrew Wan ‡ Hoeteck Wee
Received: October 9, 2008; published: December 12, 2009.
Abstract: Over the years a range of positive algorithmic results have been obtained for
learning various classes of monotone Boolean functions from uniformly distributed random
examples. Prior to our work, however, the only negative result for learning monotone
functions in this model has been an information-theoretic lower bound showing that certain
√
super-polynomial-size monotone circuits cannot be learned to accuracy 1/2 + ω(log n/ n)
(Blum, Burch, and Langford, FOCS’98). This is in contrast with the situation for non-
monotone functions, where a wide range of cryptographic hardness results establish that
various “simple” classes of polynomial-size circuits are not learnable by polynomial-time
algorithms.
In this paper we establish cryptographic hardness results for learning various “simple”
classes of monotone circuits, thus giving a computational analogue of the information-
∗ A extended abstract of this paper appeared in the Proceedings of ICALP 2008 [10]. The results and proofs in Section 4
are new to this article.
† Supported in part by an FFSEAS Presidential Fellowship.
‡ Supported in part by NSF Cybertrust CNS-0716245.
§ Supported by NSF CAREER award CCF-03-047839 and NSF award SBE-0245014.
¶ Supported by DARPA grant HR0011-07-1-0012, NSF CAREER award CCF-0347282, NSF award CCF-0523664, and a
Sloan Foundation Fellowship.
ACM Classification: F.1.3
AMS Classification: 68Q17
Key words and phrases: computational learning theory, hardness amplification, lower bounds, cryp-
tography, analysis of Boolean functions
2009 Dana Dachman-Soled, Homin K. Lee, Tal Malkin, Rocco A. Servedio, Andrew Wan, and Hoeteck Wee
Licensed under a Creative Commons Attribution License DOI: 10.4086/toc.2009.v005a013
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
theoretic hardness results of Blum et al. mentioned above. Some of our results show the
cryptographic hardness of learning polynomial-size monotone circuits to accuracy only
√
slightly greater than 1/2 + 1/ n, which is close to the optimal accuracy bound by posi-
tive results of Blum et al. Other results show that under a plausible cryptographic hardness
assumption, a class of constant-depth, sub-polynomial-size circuits computing monotone
functions is hard to learn. This result is close to optimal in terms of the circuit-size param-
eter by known positive results as well (Servedio, Information and Computation 2004). Our
main tool is a complexity-theoretic approach to hardness amplification via noise sensitivity
of monotone functions that was pioneered by O’Donnell (JCSS 2004).
1 Introduction
More than two decades ago Valiant introduced the Probably Approximately Correct (PAC) model of
learning Boolean functions from random examples [34]. Since that time a great deal of research ef-
fort has been expended on trying to understand the inherent abilities and limitations of computationally
efficient learning algorithms. This paper addresses a discrepancy between known positive and nega-
tive results for uniform distribution learning by establishing strong computational hardness results for
learning various classes of monotone functions.
1.1 Background and motivation
In the uniform distribution PAC learning model, a learning algorithm is given access to a source of inde-
pendent random examples (x, f (x)) where each x is drawn uniformly from the n-dimensional Boolean
cube and f is the unknown Boolean function to be learned. The goal of the learner is to construct a high-
accuracy hypothesis function h, i. e., one that satisfies Pr[ f (x) 6= h(x)] ≤ ε, where the probability is with
respect to the uniform distribution and ε is an error parameter given to the learning algorithm. Algo-
rithms and hardness results in this framework have interesting connections with topics such as discrete
Fourier analysis [23], circuit complexity [22], noise sensitivity and influence of variables in Boolean
functions [16, 4, 21, 29], coding theory [11], privacy [8, 17], and cryptography [7, 20]. For these rea-
sons, and because the model is natural and elegant in its own right, the uniform distribution learning
model has been intensively studied for almost two decades.
Monotonicity makes learning easier For many classes of functions, uniform distribution learning al-
gorithms have been devised that substantially improve on a naive exponential-time approach to learning
via brute-force search. However, despite intensive efforts, researchers have not yet obtained poly(n)-
time learning algorithms in this model for various simple classes of functions. Interestingly, in many of
these cases restricting the class of functions to the corresponding class of monotone functions has led to
more efficient—sometimes poly(n)-time—algorithms. We list some examples:
1. A simple algorithm learns monotone O(log n)-juntas1 to perfect accuracy in poly(n) time, and
a more complex algorithm [9] learns monotone Õ(log2 (n))-juntas to any constant accuracy in
1 A k-junta f (x , . . . , x ) is a Boolean function that only depends on k of the n input variables.
1 n
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 258
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
poly(n) time. In contrast, the fastest known algorithm for learning arbitrary k-juntas runs in time
n.704k [25].
2. The fastest known uniform distribution learning algorithm for the general class of s-term DNF, due
to Verbeurgt [36], runs in time nO(log s) to learn to any constant accuracy. In contrast,√[31] gives
an algorithm that runs in time sO(log s) for s-term monotone DNF. Thus, the class of 2O( log n) -term
monotone DNF √ can be learned to any constant accuracy in poly(n) time, but no such result is
known for 2O( log n) -term general DNF.
3. The fastest known algorithm for learning poly(n)-size general decision trees to constant accuracy
takes nO(log n) time (following from [36]), but poly(n)-size decision trees that compute monotone
functions can be learned to any constant accuracy in poly(n) time [29].
4. No poly(n)-time algorithm can learn the general class of all Boolean functions on {0, 1}n to accu-
racy better than 1/2 + poly(n)/2n , but a simple polynomial-time algorithm can learn the class of
√
all monotone Boolean functions to accuracy
√ 1/2+Ω(1/ n) [6]. We note also that the result of [9]
mentioned above follows from a 2Õ( n) -time algorithm for learning arbitrary monotone functions
on n variables to constant accuracy. (It is easy to see that no comparable algorithm can exist for
learning arbitrary Boolean functions to constant accuracy.)
Cryptography and hardness of learning Essentially all known representation-independent hardness
of learning results (i. e., results that apply to learning algorithms that do not have any restrictions on the
syntactic form of the hypotheses they output) rely on some cryptographic assumption, or an assumption
that easily implies a cryptographic primitive. For example, under the assumption that certain subset
sum problems are hard on average, Kharitonov [20] showed that the class AC1 of logarithmic-depth,
polynomial-size Boolean circuits (circuits with AND, OR, and NOT gates) is hard to learn under the uni-
ε
form distribution. Subsequently Kharitonov [19] showed that if factoring Blum integers is 2n -hard for
some fixed ε > 0, then even the class AC0 of constant-depth, polynomial-size Boolean circuits similarly
cannot be learned in polynomial time under the uniform distribution. In later work, Naor and Rein-
gold [26] gave constructions of pseudorandom functions with very low circuit complexity. Their results
imply that if factoring Blum integers is super-polynomially hard, then even depth-5 TC0 circuits cannot
be learned in polynomial time under the uniform distribution. (TC0 circuits are Boolean circuits that
can also use MAJ gates. The value of a MAJ gate is one if at least half of its inputs are one, and zero
otherwise.) We note that all of these hardness results apply even to algorithms that may make black-box
“membership queries” to obtain the value f (x) for inputs x of their choosing.
Monotonicity versus cryptography? Given that cryptography precludes efficient learning while
monotonicity seems to make efficient learning easier, it is natural to investigate how these phenom-
ena interact. One could argue that prior to the current work there was something of a mismatch between
known positive and negative results for uniform-distribution learning: as described above, a fairly broad
range of polynomial-time learning results have been obtained for various classes of monotone functions,
but there were no corresponding computational hardness results for monotone functions. Can all mono-
tone Boolean functions computed by polynomial-size circuits be learned to 99% accuracy in polynomial
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 259
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
time from uniform random examples? As far as we are aware, prior to our work answers were not
known even to such seemingly basic questions about learning monotone functions as this one. This gap
in understanding motivated the research presented in this paper (which, as we describe below, lets us
answer “no” to the above question in a strong sense).
1.2 Our results and techniques: cryptography trumps monotonicity
We present several different constructions of “simple” (polynomial-time computable) monotone Boolean
functions and prove that these functions are hard to learn under the uniform distribution, even if mem-
bership queries are allowed. We now describe our main results, followed by a high-level description of
how we obtain them.
Blum, Burch, and Langford [6] showed that arbitrary monotone functions cannot be learned to ac-
√
curacy better than 1/2 + O(log n/ n) by any algorithm that makes poly(n) many membership queries.
This is an information-theoretic bound that is proved using randomly generated monotone DNF formulas
of size (roughly) nlog n that are not polynomial-time computable. A natural goal is to obtain computa-
tional lower bounds for learning polynomial-time computable monotone functions that match, or nearly
√
match, this level of hardness (which is close to optimal by the (1/2 + Ω(1/ n))-accuracy algorithm of
Blum et al. described above). We prove near-optimal hardness for learning polynomial-size monotone
Boolean circuits (circuits with AND and OR gates):
Theorem 1.1 (informal statement). If one-way functions exist, then there is a class of poly(n)-size
monotone Boolean circuits that cannot be learned to accuracy 1/2 + 1/n1/2−ε for any fixed ε > 0.
Our approach yields even stronger lower bounds if we make stronger assumptions:
• Assuming the existence of sub-exponential one-way functions, we improve the bound on the ac-
√
curacy to 1/2 + polylog(n)/ n.
• Assuming the hardness of factoring Blum integers, our hard-to-learn functions may be computed
in monotone NC1 .
ε
• Assuming that Blum integers are 2n -hard to factor on average (which is the same hardness as-
sumption used in Kharitonov’s work [19]), we obtain a lower bound for learning constant-depth
circuits of sub-polynomial size that almost matches the positive result from [31]. More precisely,
we show that for any (sufficiently large) constant d, the class of monotone functions computed
O(1)/(d+1)
by depth-d Boolean circuits of size 2(log n) cannot be learned to accuracy 51% under the
uniform distribution in poly(n) time. In contrast, the positive result of [31] shows that mono-
1/(d+1) )
tone functions computed by depth-d Boolean circuits of size 2O((log n) can be learned to any
constant accuracy in poly(n) time.
These results are summarized in Figure 1.
Proof techniques A natural first approach is to try to replace the random nlog n -term monotone DNFs
constructed in [6] by pseudorandom DNFs of polynomial size. We were not able to do this directly;
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 260
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Hardness assumption Complexity of f Accuracy bound Ref.
1 ω(log n)
none random nlog n -term monotone DNF 2 + n1/2 [6]
1 1
OWF (poly) poly-size monotone circuits 2 + n1/2−ε Thm. 1.1
α 1 poly(log n)
OWF (2n ) poly-size monotone circuits 2+ n1/2
Thm. 2.9
1 1
factoring BI (poly) monotone NC1 -circuits 2 + n1/2−ε Thm. 3.2
nα O(1)/(d+1) 1
factoring BI (2 ) depth-d, size 2(log n) Boolean circuits 2 + o(1) Thm. 3.5
Figure 1: Summary of known hardness results for learning monotone Boolean functions. The meaning
of each row is as follows: under the stated hardness assumption, there is a class of monotone functions
computed by circuits of the stated complexity that no poly(n)-time membership query algorithm can
learn to the stated accuracy. In the first column, OWF and BI denote one-way functions and Blum
α α
Integers respectively, and “poly” and “2n ” mean that the problems are intractable for poly(n)- and 2n -
time algorithms, respectively (for some fixed α > 0). Recall that the poly(n)-time algorithm of [6] for
learning monotone functions implies that the best possible accuracy bound for monotone functions is
1/2 + Ω(1)/n1/2 .
indeed, as we discuss in Section 5, constructing such DNFs seems closely related to an open prob-
lem of Goldreich, Goldwasser, and Nussboim [13]. However, it turns out that a closely related ap-
proach does yield some results along the desired lines; in Section 4 we present and analyze a sim-
ple variant of the information-theoretic construction from [6] and then show how to replace random
choice by pseudorandom in this the variant. Because our variant gives a weaker quantitative bound on
the information-theoretic hardness of learning than [6], this gives a construction of polynomial-time-
computable monotone functions that, assuming the existence of one-way functions, cannot be learned
to accuracy 1/2 + 1/polylog(n) under the uniform distribution. While this answers the question posed
√
above (even with “51%” in place of “99%”), the 1/polylog(n) factor is rather far from the O(log n/ n)
factor that one might hope for as described above.
In Section 2 we use a different construction to obtain much stronger quantitative results; another
advantage of this second construction is that it enables us to show hardness of learning monotone circuits
rather than just circuits computing monotone functions. We start with the simple observation that using
standard tools it is easy to construct polynomial-size monotone circuits computing “slice” functions that
are pseudorandom on the middle layer of the Boolean cube {0, 1}n . Such functions are easily seen to be
√
mildly hard to learn, i. e., hard to learn to accuracy 1 − Ω(1/ n). We then use the elegant machinery of
hardness amplification of monotone functions pioneered by O’Donnell [28] to amplify the hardness of
this construction to near-optimal levels (as summarized in rows 2–4 of Figure 1). We obtain our result
for constant-depth, sub-polynomial-size circuits (row 5 of Figure 1) by augmenting this approach with
an argument that, at a high level, is similar to one used in [2], by “scaling down” and modifying our
hard-to-learn functions using a variant of Nepomnjaščiı̆’s theorem [27].
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 261
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
1.3 Preliminaries
We consider Boolean functions of the form f : {0, 1}n → {0, 1}. We view {0, 1}n as endowed with the
natural partial order: x ≤ y if and only if xi ≤ yi for all i = 1, . . . , n. A Boolean function f is monotone if
x ≤ y implies f (x) ≤ f (y).
Our work uses various standard definitions from the fields of circuit complexity, learning, and cryp-
tographic pseudorandomness; and for completeness we recall this material below.
Learning As described earlier, all of our hardness results apply even to learning algorithms that may
make membership queries, i. e., black-box queries to an oracle that gives the label f (x) of any example
x ∈ {0, 1}n on which it is queried. It is clear that for learning with respect to the uniform distribution,
having membership query access to the target function f is at least as powerful as being given uniform
random examples labeled according to x, as the learner can simply generate uniform random strings for
herself and query the oracle to simulate a random example oracle.
The goal of the learning algorithm is to construct a hypothesis h so that Prx [h(x) 6= f (x)] is small,
where the probability is taken over the uniform distribution. We shall only consider learning algorithms
that are allowed to run in poly(n) time, so the learning algorithm L may be viewed as a probabilistic
polynomial-time oracle machine that is given black-box access to the function f and attempts to output
a hypothesis h with small error relative to f .
We establish that a class C of functions is hard to learn by showing that for a uniform random f ∈ C,
the expected error of any poly(n)-time learning algorithm L is close to 1/2 when run with f as the target
function. Thus we bound the quantity
Pr n L f (1n ) → h; h(x) = f (x)
(1.1)
f ∈C,x∈{0,1}
where the probability is also taken over any internal randomization of the learning algorithm L. We
say that a class C is hard to learn to accuracy 1/2 + ε(n) if, for every poly(n)-time membership query
learning algorithm L (i. e., probabilistic polynomial-time oracle algorithm), we have that the above quan-
tity (1.1) is smaller than 1/2 + ε(n) for all sufficiently large n. As noted in [6], it is straightforward to
transform a lower bound of this sort into a lower bound for the usual ε, δ formulation of PAC learning.
Circuit complexity We shall consider various classes of circuits computing Boolean functions, in-
cluding the classes NC1 (polynomial-size, logarithmic-depth, bounded fan-in Boolean circuits, AC0
(polynomial-size, constant-depth, unbounded fan-in Boolean circuits), and TC0 (polynomial-size, con-
stant-depth unbounded fan-in Boolean circuits with MAJ gates).
A circuit is said to be monotone if it is composed entirely of AND/OR gates with no negations. Every
monotone circuit computes a monotone Boolean function, but of course non-monotone circuits may
compute monotone functions as well. The famous result of [30] shows that there are natural monotone
Boolean functions (such as the perfect matching function) that can be computed by polynomial-size
circuits but cannot be computed by quasi-polynomial-size monotone circuits, and Tardos [32] observed
that the separation can be increased to an exponential gap.
Thus, in general, it is a stronger result to show that a function can be computed by a small monotone
circuit than to show that it is monotone and can be computed by a small circuit.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 262
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Pseudorandom functions Pseudorandom functions [12] are the main cryptographic primitive that
underlie our constructions. Fix k(n) ≤ n, and let G be a family of functions g : {0, 1}k(n) → {0, 1}
each of which is computable by a circuit of size poly(k(n)). We say that G is a t(n)-secure pseudorandom
function family if the following condition holds: for any probabilistic t(n)-time oracle algorithm A, we
have
0
Pr [Ag (1n ) outputs 1] − 0Pr 0 [Ag (1n ) outputs 1] ≤ 1/t(n)
g∈G g ∈G
k(n)
where G0 is the class of all 22 functions from {0, 1}k(n) to {0, 1} (so the second probability above is
taken over the choice of a truly random function g0 ). Note that the purported distinguisher A has oracle
access to a function on k(n) bits but is allowed to run in time t(n).
It is well known that a pseudorandom function family that is t(n)-secure for all polynomials t(n)
can be constructed from any one-way function [12, 14]. We shall use the following folklore quantitative
variant that relates the hardness of the one-way function to the security of the resulting pseudorandom
function:
Proposition 1.2. Fix t(n) ≥ poly(n) and suppose there exist one-way functions that are hard to invert
on average for t(n)-time adversaries. Then there
exists a constant, 0 < c < 1, such that for any k(n) ≤ n,
there is a pseudorandom family G of functions g : {0, 1} k(n) → {0, 1} that is (t(k(n)))c -secure.
2 Lower bounds via hardness amplification of monotone functions
In this section we prove our main hardness results, summarized in Figure 1, for learning various classes
of monotone functions under the uniform distribution with membership queries.
Let us start with a high-level explanation of the overall idea. Inspired by the work on hardness
amplification within NP initiated by O’Donnell [28, 33, 15], we study constructions of the form
f (x1 , . . . , xm ) = C f 0 (x1 ), . . . , f 0 (xm )
where C is a Boolean “combining function” with low noise stability (we give precise definitions later)
that is both efficiently computable and monotone. Recall that O’Donnell showed that if f 0 is weakly
hard to compute and the combining function C has low noise stability, then f is very hard to compute.
This result holds for general (not necessarily monotone) functions C, and thus generalizes Yao’s XOR
lemma, which addresses the case where C is the XOR of m bits (and hence has the lowest noise stability
of all Boolean functions [28]).
Roughly speaking, we establish an analogue of O’Donnell’s result for learning. Our analogue, given
in Section 2.2, essentially states that for certain well-structured2 functions f 0 that are hard to learn to high
accuracy, if C has low noise stability then f is hard to learn to accuracy even slightly better than 1/2. As
our ultimate goal is to establish that “simple” classes of monotone functions are hard to learn, we shall
use this result with combining functions C that are computed by “simple” monotone Boolean circuits. In
order for the overall function f to be monotone and efficiently computable, we need the initial f 0 to be
well-structured, monotone, efficiently computable, and hard to learn to high accuracy. Such functions
2 As will be clear from the proof, we require that f 0 be balanced and have a “hard-core set.”
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 263
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
are easily obtained by a slight extension of an observation of Kearns, Li, and Valiant [18]. They noted
0 k
√ slice f of a random Boolean function on {0, 1} is hard to learn to accuracy greater than0
that the middle
1 − Θ 1/ k [6, 18]; by taking the middle slice of a pseudorandom function instead, we obtain an f
with the desired properties. In fact, by a result of Berkowitz [5] (see also [35, 3]), this slice function is
computable by a polynomial-size monotone circuit, so the overall hard-to-learn functions we construct
are computed by polynomial-size monotone circuits.
Organization
In Section 2.2 we adapt the analysis from [28, 33, 15] to reduce the problem of constructing hard-
to-learn monotone Boolean functions to constructing monotone combining functions C with low noise
stability. In Section 2.3 we show how constructions and analyses in [28, 24] can be used to obtain a “sim-
ple” monotone combining function with low noise stability. In Section 2.4 we establish Theorems 2.8
and 2.9 (lines 2 and 3 of Figure 1) by making different assumptions about the hardness of the initial
pseudorandom functions. Finally, in Section 3 we establish Theorems 3.2 and 3.5 by making specific
number theoretic assumptions (namely, the hardness of factoring Blum integers) to obtain hard-to-learn
monotone Boolean functions that can be computed by very simple circuits.
2.1 Preliminaries
Functions Let C : {0, 1}m → {0, 1} and f 0 : {0, 1}k → {0, 1} be Boolean functions. We write C ◦ f 0⊗m
to denote the Boolean function over ({0, 1}k )m given by
C ◦ f 0⊗m (x) = C f 0 (x1 ), . . . , f 0 (xm ) ,
where x = (x1 , . . . , xm ) .
For g : {0, 1}k → {0, 1}, we write slice(g) to denote the “middle slice” function:
1
if |x| > bk/2c,
slice(g)(x) = g(x) if |x| = bk/2c,
0 if |x| < bk/2c,
where |x| denotes the number of ones in the string x. It is immediate that slice(g) is a monotone Boolean
function for any g.
Bias and noise stability Following the analysis in [28, 33, 15], we shall study the bias and noise
stability of various Boolean functions. Specifically, we adopt the following notations and definitions
from [15]. The bias of a 0-1 random variable X is defined to be
def
Bias[X] = Pr[X = 0] − Pr[X = 1] .
Recall that a probabilistic Boolean function h on {0, 1}k is a probability distribution over Boolean func-
tions on {0, 1}k (so for each input x, the output h(x) is a 0-1 random variable). The expected bias of a
probabilistic Boolean function h is
def
ExpBias[h] = Ex [Bias[h(x)]] .
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 264
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Let C : {0, 1}m → {0, 1} be a Boolean function and 0 ≤ δ ≤ 1/2. The noise stability of C at noise rate
δ , denoted NoiseStabδ [C], is defined to be
def
NoiseStabδ [C] = Ex,ν [C(x) ⊕C(x ⊕ ν)] = 2 · Pr [C(x) = C(x ⊕ η)] − 1
x,η
where x ∈ {0, 1}m is uniform random, η ∈ {0, 1}m is a vector whose bits are each independently 1 with
probability δ , and ⊕ denotes bitwise XOR.
2.2 Hardness amplification for learning
Throughout this subsection we write m for m(n) and k for k(n). We shall establish the following:
Lemma 2.1. Let C : {0, 1}m → {0, 1} be a polynomial-time computable function. Let G0 be the family
k
of all 22 functions from {0, 1}k to {0, 1}, where n = mk and k = ω(log n). Then the class
C = { f = C ◦ slice(g)⊗m | g ∈ G0 }
of Boolean functions over {0, 1}n is hard to learn to accuracy
1 1q 1
+ NoiseStabΘ(1/√k) [C] + ω(1) .
2 2 n
This easily yields Corollary 2.3, which is an analogue of Lemma 2.1 with pseudorandom rather than
truly random functions, and which we use to obtain our main hardness of learning results.
Proof of Lemma 2.1. Let k, m be such that mk = n, and let C : {0, 1}m → {0, 1} be a Boolean combining
function. We prove the lemma by establishing an upper bound on the probability
h i
f n
0
Pr n
L (1 ) → h; h(x) = f (x) (2.1)
g∈G ,x∈{0,1}
where L is an arbitrary probabilistic polynomial-time oracle machine (running in time poly(n) on input
def
1n ) that is given oracle access to f = C ◦ slice(g)⊗m and outputs some hypothesis h : {0, 1}n → {0, 1}.
We first observe that because C is computed by a uniform family of circuits of size poly(m) ≤
poly(n), it is easy for a poly(n)-time machine to simulate oracle access to f if it is given oracle access
to g. So, the probability in (2.1) is at most
h i
g n ⊗m
0
Pr n
L (1 ) → h; h(x) = (C ◦ slice(g) )(x) . (2.2)
g∈G , x∈{0,1}
To analyze the above probability, suppose that in the course of its execution L never queries g on
any of the inputs x1 , . . . , xm ∈ {0, 1}k , where x = (x1 , . . . , xm ). Then the a posteriori distribution of
g(x1 ), . . . , g(xm ) (for uniform random g ∈ G0 ), given the responses to the queries of L that it received
from g, is identical to the distribution of g0 (x1 ), . . . , g0 (xm ), where g0 is an independent uniform draw
from G0 : both distributions are uniform random over {0, 1}m . (Intuitively, this just means that if L never
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 265
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
queries the random function g on any of x1 , . . . , xm , then giving L oracle access to g does not help it pre-
dict the value of f on x = (x1 , . . . , xm ).) As L runs in poly(n) time, for any fixed x1 , . . . , xm the probability
that L queried g on any of x1 , . . . , xm is at most m · poly(n)/2k . Hence (2.2) is bounded by
h i m · poly(n)
g n 0 ⊗m
Pr L (1 ) → h; h(x) = (C ◦ slice(g ) )(x) + . (2.3)
g,g0 ∈G , x∈{0,1}n
0 2k
The first summand in (2.3) is the probability that L correctly predicts the value C ◦ slice(g0 )⊗m (x), given
oracle access to g, where g and g0 are independently random functions and x is uniform over {0, 1}n .
It is clear that the best possible strategy for L is to use a maximum likelihood algorithm, i. e., predict
according to the function h that, for any fixed input x, outputs 1 if and only if the random variable
(C ◦ slice(g0 )⊗m )(x) (which we emphasize has randomness over the choice of g0 ) is biased towards 1.
The expected accuracy of this h is precisely
1 1
+ ExpBias[C ◦ slice(g0 )⊗m ] . (2.4)
2 2
Now fix
1
def k 1
δ = k =Θ √
2 bk/2c k
to be the fraction of inputs in the “middle slice” of {0, 1}k . We observe that the probabilistic function
slice(g0 ) (where g0 is truly random) is “δ -random” in the sense of Definition 3.1 of [15], meaning that it
is balanced, truly random on inputs in the middle slice, and deterministic on all other inputs. This means
that we may apply the following technical lemma (Lemma 3.7 from [15], see also [28]):
Lemma 2.2. Let h : {0, 1}n → {0, 1} be a function that is δ -random. Then
ExpBias[C ◦ h⊗m ] ≤ NoiseStabδ [C] .
p
Applying this lemma to the function slice(g0 ) we obtain
ExpBias[C ◦ slice(g0 )⊗m ] ≤ NoiseStabδ [C] .
p
(2.5)
Combining (2.3), (2.4) and (2.5) and recalling that k = ω(log n), we obtain Lemma 2.1.
Corollary 2.3. Let C : {0, 1}m → {0, 1} be a polynomial-time computable function. Let G be a pseu-
dorandom family of functions from {0, 1}k to {0, 1} that are secure against poly(n)-time adversaries,
where n = mk and k = ω(log n). Then the class
C = f = C ◦ slice(g)⊗m | g ∈ G
of Boolean functions over {0, 1}n is hard to learn to accuracy
1 1q 1
+ NoiseStabΘ(1/√k) [C] + ω(1) .
2 2 n
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 266
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Proof. The corollary follows from the fact that (2.2) must differ from its pseudorandom counterpart,
h i
Pr n Lg (1n ) → h; h(x) = (C ◦ slice(g)⊗m )(x) , (2.6)
g∈G, x∈{0,1}
by less than any fixed 1/ poly(n). Otherwise, we would easily obtain a poly(n)-time distinguisher that,
given oracle access to g, runs L to obtain a hypothesis h and checks whether
h(x) = (C ◦ slice(g)⊗m )(x)
for a random x to determine whether g is drawn from G or G0 .
By instantiating Corollary 2.3 with a “simple” monotone function C having low noise stability, we
obtain strong hardness results for learning simple monotone functions. We exhibit such a function C in
the next section.
2.3 A simple monotone combining function with low noise stability
In this section we combine known results of [28, 24] to obtain:
Proposition 2.4. Given a value k, let m = 3` d2d for `, d satisfying 3` ≤ k6 < 3`+1 and d ≤ O(k.35 ).
Then there exists a monotone function C : {0, 1}m → {0, 1} computed by a uniform family of poly(m)-
size, log(m)-depth monotone circuits such that
6
k log m
NoiseStabΘ(1/√k) [C] = O . (2.7)
m
.35
Note that in this proposition we may have m as large as 2Θ(k ) but not larger. O’Donnell [28]
established the lower bound 2
log m
NoiseStab √ [C] = Ω
Θ 1/ k m
for every monotone m-variable function C, so the above upper bound is fairly close to the best possible
Θ(1)
(within a polylog(m) factor if m = 2k ).
Following [28, 15], we use the “recursive majority of 3” function and the “tribes” function in our
construction. We require the following results on noise stability:
`
Lemma 2.5 ([28]). Let Rec-Maj-3` : {0, 1}3 → {0, 1} be defined as follows: for x = (x1 , x2 , x3 ) where
`−1
each xi ∈ {0, 1}3 ,
def
Rec-Maj-3` (x) = Maj Rec-Maj-3`−1 (x1 ), Rec-Maj-3`−1 (x2 ), Rec-Maj-3`−1 (x3 ) .
Then for ` ≥ log1.1 (1/δ ), we have
NoiseStabδ [Rec-Maj-3` ] ≤ δ −1.1 (3` )−.15 .
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 267
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
d
Lemma 2.6 ([24]). Let Tribesd : {0, 1}d2 → {0, 1} denote the “tribes” function on d2d variables, i. e.,
the read-once 2d -term monotone d-DNF
def
Tribesd (x1 , . . . , xd2d ) = (x1 ∧ · · · ∧ xd ) ∨ (xd+1 ∧ · · · ∧ x2d ) ∨ · · · .
Then if η = O(1/d), we have
ηd 2 1
NoiseStab 1−η [Tribesd ] = O = O .
2 d2d 2d
Lemma 2.7 ([28]). If h is a balanced Boolean function and ψ : {0, 1}r → {0, 1} is arbitrary, then for
any δ we have
NoiseStabδ [ψ ◦ h⊗r ] = NoiseStab 1 NoiseStabδ [h] [ψ] .
2− 2
⊗d2d
Proof of Proposition 2.4. We take C to be Tribesd ◦ Rec-Maj-3` . Given that Rec-Maj-3` is balanced,
by Lemma 2.7 we have
NoiseStabδ [C] = NoiseStab 1 NoiseStabδ [Rec-Maj-3` ] [Tribesd ] .
2− 2
√
Setting δ = Θ 1/ k and recalling that 3` ≤ k6 , we have ` ≥ log1.1 (1/δ ) so we may apply Lemma 2.5
to obtain
√ 1.1
−.15
k6 = O k−.35 .
√
NoiseStabΘ(1/ k) [Rec-Maj-3` ] ≤ Θ k
As O(k−.35 ) ≤ O(1/d), we may apply Lemma 2.6 with the previous inequalities to obtain
1
NoiseStabΘ(1/√k) [C] = O d .
2
The bound (2.7) follows from a rearrangement of the bounds on k, m, d and `. It is easy to see that C
can be computed by monotone circuits of depth O(`) = O(log m) and size poly(m). This completes the
proof.
2.4 Nearly optimal hardness of learning polynomial-size monotone circuits
of k, klet m = 3 d2 for `, d as in Proposition 2.4. Let G be a pseudorandom family of
Given a value ` d
functions g : {0, 1} → {0, 1} secure against poly(n)-time adversaries, where n = mk. Given that we
have k = ω(log n), we may apply Corollary 2.3 with the combining function from Proposition 2.4 and
conclude that the class C = {C ◦ slice(g)⊗m | g ∈ G} is hard to learn to accuracy
3√ 3.5 √
1 k log m 1 k log n
+O √ + o(1/n) ≤ + O √ . (2.8)
2 m 2 n
We claim that the functions in C can, in fact, be computed by poly(n)-size monotone circuits. This
follows from a result of Berkowitz [5] that states that if a k-variable slice function is computed by a
Boolean circuit of size s and depth d, then it is also computed by a monotone Boolean circuit with
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 268
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
MAJ gates of size O(s + k) and depth d + 1. Combining these monotone circuits for slice(g) with the
monotone circuit for C, we obtain a poly(n)-size monotone circuit for each function in C.
By making various different assumptions on the hardness of one-way functions, Proposition 1.2 lets
us obtain different quantitative relationships between k (the input length for the pseudorandom functions)
and n (the running time of the adversaries against which they are secure), and thus different quantitative
hardness results from (2.8) above:
Theorem 2.8. Suppose that standard one-way functions exist. Then for any constant ε > 0 there is a
class C of poly(n)-size monotone circuits that is hard to learn to accuracy 1/2 + 1/n1/2−ε .
Proof. If poly(n)-hard one-way functions exist then we may take k = nc in Proposition 1.2 for an arbi-
trarily small constant c; this corresponds to taking d = γ log k for γ a large constant in Proposition 2.4.
The claimed bound on (2.8) easily follows. (We note that while not every n is of the required form
mk = 3` d2d k, it is not difficult to see that this and our subsequent theorems hold for all (sufficiently
large) input lengths n by padding the hard-to-learn functions.)
α
Theorem 2.9. Suppose that sub-exponentially hard (2n for some fixed α > 0) one-way functions
exist. Then there is a class C of poly(n)-size monotone circuits that is hard to learn to accuracy
√
1/2 + polylog(n)/ n.
Proof. As above, but now we take k = logγ n for some sufficiently large constant γ (i. e., d = c log k for
a small constant c).
3 Hardness of learning simple circuits
In this section we obtain hardness results for learning very simple classes of circuits computing mono-
tone functions under a concrete hardness assumption for a specific computational problem, namely fac-
toring Blum integers. Naor and Reingold [26] showed that if factoring Blum integers is computationally
hard then there is a pseudorandom function family, which we denote G? , that is computable in TC0 .
From this it easily follows that the functions {slice(g) | g ∈ G? } are also computable in TC0 .
We now observe that the result of Berkowitz [5] mentioned earlier for converting slice circuits into
monotone circuits applies not only to Boolean circuits, but also to TC0 circuits.
This means that the functions in {slice(g) | g ∈ G? } are in fact computable in monotone TC0 , i. e., by
polynomial-size, constant-depth circuits composed only of AND/OR/MAJ gates. As the majority func-
tion can be computed by polynomial-size, O(log n)-depth monotone Boolean circuits, (see, e. g., [1]), the
functions in {slice(g) | g ∈ G? } are computable by O(log n)-depth monotone Boolean circuits. Finally,
using the parameters in Theorem 2.8 we have a combining function C that is a O(log n)-depth poly-size
monotone Boolean circuit, which implies the following lemma:
Lemma 3.1. Let C be the monotone combining function from Proposition 2.4 and G? be a family of pseu-
dorandom functions computable in TC0 . Then every function in {C ◦ slice(g)⊗m | g ∈ G? } is computable
in monotone NC1 .
This directly yields a hardness result for learning monotone NC1 circuits (the fourth line of Figure 1):
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 269
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
Theorem 3.2. If factoring Blum integers is hard on average for any poly(n)-time algorithm, then for any
constant ε > 0 there is a class C of poly(n)-size monotone NC1 circuits that is hard to learn to accuracy
1/2 + 1/n1/2−ε .
Now we show that under a stronger but still plausible assumption on the hardness of factoring Blum
integers, we get a hardness result for learning a class of constant-depth monotone circuits that is very
close to a class known to be learnable to any constant accuracy in poly(n) time. Suppose that n-bit Blum
α
integers are 2n -hard to factor on average for some fixed α > 0 (which is the same hardness assumption
α/2
that was earlier used by Kharitonov [19]). This means there exist 2n -secure pseudorandom functions
that are computable in TC0 . Using such a family of functions in place of G? in the construction for the
preceding theorem and fixing ε = 1/3, we obtain:
Lemma 3.3. Assume that Blum integers are 2n -hard to factor on average. Then there is a class C
α
of poly(n)-size monotone NC1 circuits that is hard for any 2n -time algorithm to learn to accuracy
α/20
1/2 + 1/n1/6 .
Now we “scale down” this class C as follows. Let n0 be such that n0 = (log n)κ for a suitable constant
κ > 20/α, and let us substitute n0 for n in the construction of the previous lemma; we call the resulting
class of functions C0 . In terms of n, the functions in C0 (which are functions over {0, 1}n that only
depend on the first n0 variables) are computed by (log n)O(κ) -size, O(log log n)-depth monotone circuits
whose inputs are the first (log n)κ variables in x1 , . . . , xn . We moreover have that C0 is hard for any
0 α/20 κα/20
2(n ) = 2(log n) = ω(poly(n))-time algorithm to learn to some accuracy
1 1 1
+ 0 1/6 = + o(1) .
2 (n ) 2
We now recall the following variant of Nepomnjaščiı̆’s theorem that is implicit in [2].
Lemma 3.4. For every language L ∈ NL, for all sufficiently large constant d there are AC0d circuits of
O(1)/(d+1)
size 2n that recognize L.
As every function in C0 can be computed in NC1 , which is contained in NL, combining Lemma 3.4
with the paragraph that proceeds it, we obtain the following theorem (final line of Figure 1):
Theorem 3.5. Suppose that Blum integers are sub-exponentially hard to factor on average. Then there
is a class C of monotone functions that is hard for any poly(n)-time algorithm to learn to accuracy 1/2 +
O(1)/(d+1)
o(1) and that, for all sufficiently large constant d, are computed by AC0d circuits of size 2(log n) .
This final hardness result is of interest because it is known that constant-depth circuits of only slightly
smaller size can be learned to any constant accuracy in poly(n) time under the uniform distribution
(without needing membership queries):
1/(d+1) )
Theorem 3.6 ([31] Corollary 2). For all d ≥ 2, the class of AC0d circuits of size 2O((log n) that
compute monotone functions can be learned to any constant accuracy 1 − ε in poly(n)-time.
Theorem 3.5 is thus nearly optimal in terms of the size of the constant-depth circuits for which it
establishes hardness of learning.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 270
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
4 A computational analogue of the Blum-Burch-Langford lower bound
In this section we first present a simple variant of the lower bound construction in [6], obtaining an
information-theoretic lower bound on the learnability of the general class of all monotone Boolean
functions. The quantitative bound our variant achieves is weaker than that of [6], but has the advantage
that it can be easily derandomized. Indeed, as mentioned in Section 5 (and further discussed below),
our construction uses a certain probability distribution over monotone DNFs, such that a typical random
input x satisfies only poly(n) many “candidate terms” (which are terms that may be present in a random
DNF drawn from our distribution). By selecting terms for inclusion in the DNF in a pseudorandom
rather than truly random way, we obtain a class of poly(n)-size monotone circuits that is hard to learn to
accuracy 1/2 + 1/polylog(n) (assuming one-way functions exist).
Below we start with an overview of why it is difficult to obtain a computational analogue of the exact
construction of [6] using the pseudorandom approach sketched above, and the idea behind our variant,
which overcomes this difficulty. We then provide our information theoretic construction and analysis,
followed by its computational analogue.
4.1 Idea
Recall the information-theoretic lower bound from [6]. It works by defining a distribution Ps over mono-
tone functions of the form {0, 1}n → {0, 1} as follows. (Here s is a numerical parameter which should
be thought of as the number of membership queries that a learning algorithm is allowed to make.) Take
t = log(3sn). A draw from Ps is obtained by randomly including each length-t monotone term in the DNF
independently with probability p0 , where p0 is chosen so that the function is expected to be balanced on
“typical inputs” (more precisely, on inputs with exactly n/2 ones). The naive idea for derandomizing
this construction is to simply use a pseudorandom function with bias p0 to determine whether each pos-
sible term of size t should be included or excluded in the DNF. However, there is a problem with this
approach: we do not know an efficient way to determine whether a typical example x (with, say, n/2
n/2
ones) has any of its t candidate terms (each of which is pseudorandomly present/not present in f )
actually present in f , so we do not know how to evaluate f on a typical input x in less than n/2
t time.
We get around this difficulty by instead considering a new distribution of random monotone DNFs.
In our construction we will again use a random function with bias p to determine whether each possible
term of length t is present in the DNF. However, in our construction, a typical example x will have only
a polynomial number of candidate terms that could be satisfied, and thus it is possible to check all of
them and evaluate the function in poly(n) time.
The main difficulty of this approach is to ensure that although a typical example has only a poly-
nomial number of candidate terms, the function is still hard to learn in polynomial time. We achieve
this by partitioning the variables into blocks of size k and viewing each block as a “super-variable”
(corresponding to the AND of all k variables in the block). We then construct the DNF by randomly
choosing length-t terms over these super-variables. It is not difficult to see that with this approach, we
can equivalently view our problem as learning a t-DNF f with terms chosen as above, where each of the
n/k variables is drawn from a product distribution with bias 1/2k . By fine-tuning the parameters that
determine t (the size of each term of the DNF) and k (the size of the partitions), we are able to achieve
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 271
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
an information-theoretic lower bound showing that this distribution over monotone functions is hard to
learn to accuracy 1/2 + o(1).
4.2 Construction
Let us partition the variables x1 , . . . , xn into m = n/k blocks B1 , . . . , Bm of k variables each. Let Xi denote
the conjunction of all k variables in Bi (X1 , . . . , Xm are the super-variables). The following is a description
of our distribution P over monotone functions. A function f is drawn from P as follows (we fix the values
of k,t later):
• Construct a monotone DNF f1 as follows: each possible conjunction of t super-variables chosen
from {X1 , . . . , Xm } is placed in the target function f1 independently with probability p, where p is
defined as the solution to:
m/2k 1
(1 − p)( t ) = . (4.1)
2
Note that for a uniform choice of x ∈ {0, 1}n , we expect m/2k ones in the corresponding “super-
assignment” X = (X1 , . . . , Xm ), and any superassignment with this many ones will be satisfied
k
by m/2t many terms. Thus p is chosen such that a “typical” example X, with m/2k ones, has
probability 1/2 of being labeled positive under f1 .
• Let
(
f1 (x) if the number of supervariables satisfied in x is at most m/2k + (m/2k )2/3 ,
f (x) =
1 otherwise.
Note that because of the final step of the construction, the function f is not actually a DNF (though
it is a monotone function). Intuitively, the final step is there because if too many supervariables were
satisfied in x, there could be too many (more than poly(n)) candidate terms to check, and we would not
be able to evaluate f1 efficiently. We will show later that the probability that the number of supervariables
k 1/3
satisfied in x is greater than m/2k + (m/2k )2/3 is at most 2e−(m/2 ) /3 = 1/nω(1) , and thus the function
f is 1/nω(1) -close to f1 ; so hardness of learning results established for the random DNFs f1 carry over
to the actual functions f . For most of our discussion we shall refer to P as a distribution over DNFs,
meaning the functions f1 .
4.3 Information-theoretic lower bound
As discussed previously, we view the distribution P defined above as a distribution over DNFs of terms of
size t over the supervariables. Each possible combination of t supervariables appears in f1 independently
with probability p and the supervariables are drawn from a product distribution that is 1 with probability
1/2k and 0 with probability 1 − 1/2k . We first observe that learning f over the supervariables drawn
from the product distribution is equivalent to learning the original function over the original variables.
This is because if we are given the original membership query oracle for n-bit examples we can simulate
answers to membership queries on m-bit “supervariable” examples and vice versa. Thus we henceforth
analyze the product distribution.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 272
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
We follow the proof technique of [6]. To simplify our analysis, we consider an “augmented” oracle,
as in [6]. Given a query X, with ones in positions indexed by the set SX , the oracle will return the first
conjunct in lexicographic order that appears in the target function and is satisfied by X. Additionally,
the oracle returns 1 if X is positive and 0 if X is negative. (So instead of just giving a single bit as its
response, if the example is a positive one the oracle tells the learner the lexicographically first term in
the target DNF that is satisfied.) Clearly, lower bounds for this augmented oracle imply the same bounds
for the standard oracle.
We are interested in analyzing Ps , the conditional distribution over functions drawn from the initial
distribution P that are consistent with the information learned by A in the first s queries. We can think of
Ps as a vector Vs of mt elements, one for each possible conjunct of size t. Initially, each element of the
vector contains p, the probability that the conjunct is in the target function. When a query is made, the
oracle examines one by one the entries that satisfy X. For each entry having value p, we can think of the
oracle as flipping a coin and replacing the entry by 0 with probability 1 − p and by 1 with probability
p. After s queries, Vs will contain some entries set to 0, some set to 1 and the rest set to p. Because Vs
describes the conditional distribution Ps given the queries made so far, the Bayes-optimal prediction for
an example X is simply to answer 1 if Vs (X) ≥ 1/2 and 0 otherwise.
We now analyze Vs (X), the conditional probability over functions drawn from P that are consistent
with the first s queries that a random example, X, drawn from the distribution, evaluates to 1, given
the answers to the first s queries. We will show that for s = poly(n), for X drawn from the product
distribution on {0, 1}m , with probability at least 1 − 1/nω(1) the value Vs (X) lies in 1/2 ± 1/log n. This
is easily seen to give a lower bound of the type we require.
Following [6], we first observe that after s queries there can be at most s entries set to one in the
vector Vs . We shall also use the following lemma from [6]:
Lemma 4.1 ([6]). After s queries, with probability 1 − e−s/4 , there are at most 2s/p zeros in Vs .
We thus may henceforth assume that there are at most 2s/p zeros in Vs .
We now establish the following, which is an analogue (tailored to our setting) of Claim 3 of [6]:
Lemma 4.2. For any vector Vs of size mt with at most s entries set to 1, at most 2s/p entries set to
0, and the remaining entries set to p, for a random example X (drawn from {0, 1}m according to the
1/2k -biased product distribution), we have that with probability at least 1 − ε1 , the quantity Vs (X) lies
in the range
√
k k 1/3
(m/2 −(m/2 )
)− 2sp2ktn k k 1/3
m/2 +(m/2 )
≤ Vs (X) ≤ 1 − (1 − p)( ).
t
1 − (1 − p) t (4.2)
Here √
2 n k 1/3
ε1 = s · + 1 2−kt + 2e−(m/2 ) /3 . (4.3)
p
Proof. Let X be a random example drawn from the 1/2k -biased product distribution over {0, 1}m , and
consider the following 3 events:
• None of the 1-entries in Vs are satisfied by X.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 273
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
There are at most s 1-entries in Vs and the probability that any one is satisfied by X is 2−kt .
Therefore the probability that some 1-entry is satisfied by X is at most s2−kt and the probability
that none of the 1-entries in Vs are satisfied by X is at least 1 − s2−kt .
√
• At most (2s n/p)2−kt of the 0-entries in Vs are satisfied by X.
Because there are at most 2s/p entries set to 0 in Vs , the expected number of 0-entries in Vs satisfied
by X is at most (2s/p)2−kt . By Markov’s inequality, the probability that the actual number exceeds
√ √
this by a n factor is at most 1/ n.
• The number of ones in X lies in the range m/2k ± (m/2k )2/3 .
Using a multiplicative Chernoff bound, we have that this occurs with probability at least 1 −
k 1/3
2e−(m/2 ) /3 . Note that for any X in this range, f (X) = f1 (X). So, conditioning on this event
occurring, we can assume that f (X) = f1 (X).
Therefore, the probability that all 3 of the above events occurs is at least 1 − ε1 where
√
2 n k 1/3
ε1 = s · + 1 2−kt + 2e−(m/2 ) /3 .
p
Given that these events all occur, we show that Vs (X) lies in the desired range. We follow the approach
of [6].
For the lower bound, Vs (X) is minimized when X has as few ones as possible and when as many of
the 0-entries in Vs are satisfied by X as possible. So Vs (X) is at least
√
k k 2/3
(m/2 −(m/2
t
)
)− 2sp2ktn
Vs (X) ≥ 1 − (1 − p) .
For the upper bound, Vs (X) is maximized when X has as many ones as possible and as few zeros as
possible. So, Vs (X) is at most
m/2k +(m/2k )2/3
V (X) ≤ 1 − (1 − p)(
s t ),
which completes the proof.
Now let us choose values for k and t. What are our goals in setting these parameters? First off, we
k
want m/2t to be at most poly(n) (so that there are at most poly(n) candidate terms to be checked for a
“typical” input). Moreover, for any s = poly(n) we want both sides of (4.2) to be close to 1/2 (so the
accuracy of any s-query learning algorithm is indeed close to 1/2 on typical inputs), and we want ε1 to
be small (so almost all inputs are “typical”). With this motivation, we set k = Θ(log n) to be such that
√
m/2k (recall, m = n/k) equals log6 n, and we set t = log n. This means
6
m/2k √
log n
= √ ≤ 26 log(log n) log n n .
t log n
Now (4.1) gives p 1/n; together with k = Θ(log n), for any s = poly(n) we have ε1 = 1/nω(1) .
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 274
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Now we analyze (4.2). First the lower bound:
m/2k −(m/2k )2/3 2s√n
( )− p2kt
Vs (X) ≥ 1 − (1 − p) t
3s√n
m/2k −(m/2k )2/3
≥ 1 − (1 − p) ( t ) e p2kt
m/2k −(m/2k )2/3
= 1 − (1 − p)( t ) 1 + 1/nω(1)
h m/2k −(m/2k )2/3 m/2k i
= 1 − 2−( t )/( t ) · 1 + 1/nω(1) .
(In the last step we are using the definition of p from (4.1).) Let us bound the exponent:
!t
m/2k −(m/2k )2/3
m/2 k − (m/2k )2/3 − t
t
m/2k
≥
m/2k
t
6 √ √log n
log n − log4 n − log n
=
log6 n
√
6 4 log n
log n − 2 log n
≥
log6 n
√log n
2
= 1−
log2 n
2
≥ 1− .
log1.5 n
So
h 1.5
i 1 1
Vs (X) ≥ 1 − 2−(1−2/ log n) · (1 + 1/nω(1) ) ≥ − .
2 log n
Now for the upper bound:
k
m/2 +(m/2 ) k 2/3 k k 2/3 k
Vs (x) ≤ 1 − (1 − p)( t ) = 1 − 2−(m/2 +(m/2
t
)
)/(m/2t ) .
Again bounding the exponent:
m/2k +(m/2k )2/3
log6√n+log4 n
t log n
=
m/2k log6 n
√
t log n
6 4 √log n
log n + log n
≤ √
log6 n − log n
√log n
2 log4 n
≤ 1+ √
log6 n − log n
4
≤ 1+ .
log1.5 n
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 275
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
So
− 1+ 4
log1.5 n
1 1
Vs (X) ≤ 1 − 2 ≤ + .
2 log n
The above analysis has thus established the following.
Lemma 4.3. Let L be any poly(n)-time learning algorithm. If L is run with a target function that
is a random draw f from the distribution P described above, then for all but a 1/nω(1) fraction of
inputs x ∈ {0, 1}n , the probability that h(x) = f (x) (where h is the hypothesis output by L) is at most
1/2 + 1/log n.
It is easy to see that by slightly modifying the values of t and k in the above construction, it is actually
possible to replace 1/log n with any 1/polylog n in the above lemma.
4.4 Computational lower bound
To obtain a computational analogue of Lemma 4.3, we make a pseudorandom choice of terms in a draw
of f1 from P.
Recall that the construction of P placed each possible term (conjunction of t supervariables) in the
target function with probability p, as defined in (4.1). We first consider a distribution that uses uniform
bits to approximate the probability p. This can be done by approximating log(p−1 ) with poly(n) bits,
associating each term with independent uniform poly(n) bits chosen this way, and including that term
in the target function if all bits are set to 0. It is easy to see that the resulting construction yields a
probability distribution that is statistically close to P, and we denote it by Pstat .
Now, using a pseudorandom function rather than a truly random (uniform) one for the source of
uniform bits will yield a distribution, which we denote by PPSR . Similar arguments to those we give
elsewhere in the paper show that a poly(n) time adversary cannot distinguish the resulting construction
from the original one (or else a distinguisher could be constructed for the pseudorandom function).
To complete the argument, we observe that every function f in the support of PPSR can be evaluated
with a poly(n)-size circuit. It is obviously easy to count the number of supervariables that are satisfied
in an input x, so we need only argue that the function f1 can be computed efficiently on a “typical”
input x that has “few” supervariables satisfied. But by construction, such an input will satisfy only
poly(n) candidate terms of the monotone DNF f1 and thus a poly(n)-size circuit can check each of these
candidate terms separately (by making a call to the pseudorandom function for each candidate term to
determine whether it is present or absent). Thus, as a corollary of Lemma 4.3, we can establish the main
result of this section:
Theorem 4.4. Suppose that standard one-way functions exist. Then there is a class C of poly(n)-size
monotone circuits that is hard to learn to accuracy 1/2 + 1/polylog(n).
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 276
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
5 Discussion and future work
An obvious goal for future work is to establish even sharper quantitative bounds on the cryptographic
hardness of learning monotone functions. Blum, Burch, and Langford [6] obtained their
1 ω(log n)
+ √
2 n
information-theoretic lower bound by considering random monotone DNF that are constructed by in-
dependently including each of the logn n possible terms of length log n in the target function. Can we
match this hardness with a class of polynomial-size circuits?
As mentioned in Section 1, it is natural to consider a pseudorandom variant of the construction in [6]
in which a pseudorandom rather than truly random function is used to decide whether or not to include
n
each of the log n candidate terms. However, we have not been able to show that a function f constructed
in this way can be computed by a poly(n)-size circuit. Intuitively, the problem is that for an input x with
(typically) n/2 bits set to 1, to evaluate f we must check the pseudorandom function’s value on all of
n/2
the log n potential “candidate terms” of length log n that x satisfies. Indeed, the question of obtaining
an efficient implementation of these “huge pseudorandom monotone DNF” has a similar flavor to Open
Problem 5.4 of [13]. In that question the goal is to construct pseudorandom functions that support
“subcube queries” that give the parity of the function’s values over all inputs in a specified subcube
of {0, 1}n . In our scenario we are interested in functions f that are pseudorandom only over the logn n
inputs with precisely log n ones (these inputs correspond to the “candidate terms” of the monotone DNF)
and are zero everywhere else, and we only need to support “monotone subcube queries” (i. e., given an
input x, we want to know whether f (y) = 1 for any y ≤ x).
References
[1] M. A JTAI , J. KOML ÓS , AND E. S ZEMER ÉDI: An O(n log n) sorting network. Combinatorica,
3(1):1–19, 1983. [doi:10.1007/BF02579338]. 269
[2] E. A LLENDER , L. H ELLERSTEIN , P. M C C ABE , T. P ITASSI , AND M. S AKS: Minimizing DNF
formulas and ACd0 circuits given a truth table. In Proc. 21st IEEE Conf. Comput. Complex. (CCC),
pp. 237–251. IEEE Comp. Soc. Press, 2006. [CCC:10.1109/CCC.2006.27]. 261, 270
[3] R. B EALS , T. N ISHINO , AND K. TANAKA: On the complexity of negation-limited Boolean net-
works. SIAM J. Comput., 27(5):1334–1347, 1998. [SICOMP:10.1137/S0097539794275136]. 264
[4] I. B ENJAMINI , G. K ALAI , AND O. S CHRAMM: Noise sensitivity of Boolean functions and appli-
cations to percolation. Publ. Math. Inst. Hautes Etudes Sci., 90:5–43, 1999. 258
[5] S. J. B ERKOWITZ: On some relationships between monotone and non-monotone circuit complex-
ity. Technical report, Technical Report, University of Toronto, 1982. 264, 268, 269
[6] A. B LUM , C. B URCH , AND J. L ANGFORD: On learning monotone Boolean functions. In Proc.
39th FOCS, pp. 408–415. IEEE Comp. Soc. Press, 1998. [FOCS:10.1109/SFCS.1998.743491].
259, 260, 261, 262, 264, 271, 273, 274, 277
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 277
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
[7] A. B LUM , M. F URST, M. K EARNS , AND R. L IPTON: Cryptographic primitives based on hard
learning problems. In Proc. Advances in Cryptology – CRYPTO’93, number 773 in LNCS, pp.
278–291. Springer, 1993. [doi:10.1007/3-540-48329-2]. 258
[8] A. B LUM , K. L IGETT, AND A. ROTH: A learning theory perspective on data privacy: New hope
for releasing useful databases privately. In Proc. 40th STOC, pp. 609–618. ACM Press, 2008.
[STOC:10.1145/1374376.1374464]. 258
[9] N. B SHOUTY AND C. TAMON: On the Fourier spectrum of monotone functions. J. ACM,
43(4):747–770, 1996. [JACM:10.1145/234533.234564]. 258, 259
[10] D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE:
Optimal cryptographic hardness of learning monotone functions. In Proc. 35th Intern. Colloq. Au-
tomata, Languages and Programming, Part I, pp. 36–47. Springer-Verlag, 2008. [doi:10.1007/978-
3-540-70575-8 4]. 257
[11] V. F ELDMAN , P. G OPALAN , S. K HOT, AND A. P ONNUSWAMI: New results for learning noisy
parities and halfspaces. In Proc. 47th FOCS, pp. 563–576. IEEE Comp. Soc. Press, 2006.
[FOCS:10.1109/FOCS.2006.51]. 258
[12] O. G OLDREICH , S. G OLDWASSER , AND S. M ICALI: How to construct random functions. J.
ACM, 33(4):792–807, 1986. [JACM:10.1145/6490.6503]. 263
[13] O. G OLDREICH , S. G OLDWASSER , AND A. N USSBOIM: On the implementation of
huge random objects. In Proc. 44th FOCS, pp. 68–79. IEEE Comp. Soc. Press, 2003.
[FOCS:10.1109/SFCS.2003.1238182]. 261, 277
[14] J. H ÅSTAD , R. I MPAGLIAZZO , L. L EVIN , AND M. L UBY: A pseudorandom gen-
erator from any one-way function. SIAM J. Comput., 28(4):1364–1396, 1999.
[SICOMP:10.1137/S0097539793244708]. 263
[15] A. H EALY, S. VADHAN , AND E. V IOLA: Using nondeterminism to amplify hardness. SIAM J.
Comput., 35(4):903–931, 2006. [SICOMP:10.1137/S0097539705447281]. 263, 264, 266, 267
[16] J. K AHN , G. K ALAI , AND N. L INIAL: The influence of variables on Boolean functions. In Proc.
29th FOCS, pp. 68–80. IEEE Comp. Soc. Press, 1988. [FOCS:10.1109/SFCS.1988.21923]. 258
[17] S. K ASIVISWANATHAN , H. K. L EE , K. N ISSIM , S. R ASKHODNIKOVA , AND A. S MITH: What
can we learn privately? In Proc. 49th FOCS, pp. 531–540. IEEE Comp. Soc. Press, 2008.
[FOCS:10.1109/FOCS.2008.27]. 258
[18] M. J. K EARNS , M. L I , AND L. G. VALIANT: Learning Boolean formulas. J. ACM, 41(6):1298–
1328, 1994. [JACM:10.1145/195613.195656]. 264
[19] M. K HARITONOV: Cryptographic hardness of distribution-specific learning. In Proc. 25th FOCS,
pp. 372–381. IEEE Comp. Soc. Press, 1993. [FOCS:10.1109/SFCS.1993.366850]. 259, 260, 270
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 278
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
[20] M. K HARITONOV: Cryptographic lower bounds for learnability of Boolean functions on the uni-
form distribution. J. Comput. System Sci., 50(3):600–610, 1995. [JCSS:10.1006/jcss.1995.1046].
258, 259
[21] A. K LIVANS , R. O’D ONNELL , AND R. S ERVEDIO: Learning intersections and thresholds of
halfspaces. J. Comput. System Sci., 68(4):808–840, 2004. [JCSS:10.1016/j.jcss.2003.11.002]. 258
[22] N. L INIAL , Y. M ANSOUR , AND N. N ISAN: Constant depth circuits, Fourier transform and learn-
ability. J. ACM, 40(3):607–620, 1993. [JACM:10.1145/174130.174138]. 258
[23] Y. M ANSOUR: Learning Boolean functions via the Fourier transform. In Theoretical Advances in
Neural Computation and Learning, pp. 391–424. Kluwer Academic Publishers, 1994. 258
[24] E. M OSSEL AND R. O’D ONNELL: On the noise sensitivity of monotone functions. Random
Structures Algorithms, 23(3):333–350, 2003. [Wiley:10.1002/rsa.10097]. 264, 267, 268
[25] E. M OSSEL , R. O’D ONNELL , AND R. S ERVEDIO: Learning functions of k relevant variables. J.
Comput. System Sci., 69(3):421–434, 2004. [JCSS:10.1016/j.jcss.2004.04.002]. 259
[26] M. NAOR AND O. R EINGOLD: Number-theoretic constructions of efficient pseudo-random func-
tions. J. ACM, 51(2):231–262, 2004. [JACM:10.1145/972639.972643]. 259, 269
[27] V. A. N EPOMNJA Š ČI Ĭ: Rudimentary predicates and Turing computations (in Russian). Dokl.
Akad. Nauk SSSR, 195:282–284, 1970. English translation in Soviet Mathematics Doklady, Vol. 11,
pp. 1642–1645, 1970. Math. Reviews MR02811611 (43#7326). 261
[28] R. O’D ONNELL: Hardness amplification within NP. J. Comput. System Sci., 69(1):68–94, 2004.
[JCSS:10.1016/j.jcss.2004.01.001]. 261, 263, 264, 266, 267, 268
[29] R. O’D ONNELL AND R. S ERVEDIO: Learning monotone decision trees in polynomial time. SIAM
J. Comput., 37(3):827–844, 2007. [SICOMP:10.1137/060669309]. 258, 259
[30] A. R AZBOROV: Lower bounds on the monotone network complexity of the logical permanent.
Mat. Zametki, 37:887–900, 1985. English translation in Math. Notes of the Academy of Sciences
of the USSR, Vol 37, pp. 485-493, 1985. 262
[31] R. S ERVEDIO: On learning monotone DNF under product distributions. Inform. Comput.,
193(1):57–74, 2004. 259, 260, 270
[32] É. TARDOS: The gap between monotone and non-monotone circuit complexity is exponential.
Combinatorica, 8(1):141–142, 1988. [10.1007BF02122563]. 262
[33] L. T REVISAN: List decoding using the XOR lemma. In Proc. 44th FOCS, pp. 126–135. IEEE
Comp. Soc. Press, 2003. [FOCS:10.1109/SFCS.2003.1238187]. 263, 264
[34] L. G. VALIANT: A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.
[ACM:10.1145/1968.1972]. 258
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 279
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
[35] L. G. VALIANT: Negation is powerless for Boolean slice functions. SIAM J. Comput., 15:531–535,
1986. [SICOMP:10.1137/0215037]. 264
[36] K. V ERBEURGT: Learning DNF under the uniform distribution in quasi-polynomial time. In Proc.
3rd Ann. Workshop Comput. Learn. Theory, pp. 314–326. Morgan Kaufman, 1990. 259
AUTHORS
Dana Dachman-Soled [About the author]
Department of Computer Science
Columbia University, New York, NY
dglasner cs columbia edu
Homin K. Lee [About the author]
Department of Computer Science
University of Texas at Austin
homin cs utexas edu
http://www.cs.utexas.edu/∼homin
Tal Malkin [About the author]
Assistant Professor
Department of Computer Science
Columbia University, New York, NY
tal cs columbia edu
http://www.cs.columbia.edu/∼tal
Rocco A. Servedio [About the author]
Associate Professor
Columbia University, New York, NY
rocco cs columbia edu
http://www.cs.columbia.edu/∼rocco
Andrew Wan [About the author]
Department of Computer Science
Columbia University, New York, NY
atw12 columbia edu
http://www.cs.columbia.edu/∼atw12
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 280
O PTIMAL C RYPTOGRAPHIC H ARDNESS OF L EARNING M ONOTONE F UNCTIONS
Hoeteck Wee [About the author]
Assistant Professor
Department of Computer Science
Queens College, City University of New York
hoeteck cs qc cuny edu
http://www.cs.qc.edu/∼hoeteck/
ABOUT THE AUTHORS
DANA DACHMAN -S OLED (a. k. a. DANA G LASNER) is a Ph. D. candidate in the Depart-
ment of Computer Science at Columbia University, supervised by Tal Malkin and Rocco
Servedio. She received her undergraduate degree in Computer Science from Yeshiva
University. Dana’s research interests include property testing and cryptography. She
enjoys life in New York City with her husband and their 17-month-old3 son.
H OMIN K. L EE is a postdoctoral researcher under the direction of Adam Klivans at the Uni-
versity of Texas at Austin4 . He received his Ph. D. in 2009 from Columbia University
where he was advised by Tal Malkin and Rocco A. Servedio. His thesis was titled On the
Learnability of Monotone Functions, and his research interests include computational
learning theory and the analysis of Boolean function. He is often hungry.
TAL M ALKIN received her Ph. D. in Computer Science from MIT in 2000, under the su-
pervision of Shafi Goldwasser. After spending three years at AT&T Labs Research, she
joined the Department of Computer Science at Columbia University, where she heads
the Cryptography Lab. Her research interests are in cryptography, security, complexity
theory, and related areas, with foundations of cryptography being closest to her heart.
She enjoys living in New York City and spending time with (and without) her husband,
two sons, and cat. In theory she loves theater, ice hockey, editing the journal “Theory of
Computing,” and sleep, but in practice she doesn’t get to enjoy these activities too often.
ROCCO A. S ERVEDIO is an associate professor in the Department of Computer Science at
Columbia University. He graduated from Harvard University in 2001 where his Ph. D.
was supervised by Leslie Valiant. He is interested in computational learning theory,
computational complexity, and other topics. He enjoys spending time with his family
and hopes to drink a quart of stout with Herman Melville in the afterlife.
3 At the time of submission.
4 The author performed this work as a Ph. D. candidate at Columbia University.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 281
D. DACHMAN -S OLED , H. K. L EE , T. M ALKIN , R. A. S ERVEDIO , A. WAN , AND H. W EE
A NDREW WAN is a Ph. D. candidate at Columbia University, advised by Tal Malkin and
Rocco Servedio. His interests include complexity theory, cryptography, and compu-
tational learning theory. Before graduate school, he was a student of philosophy at
Columbia University and enjoyed playing the piano, the trumpet, and the accordion.
Although he still enjoys playing music, the PAC model rarely affords him the time.
H OETECK W EE is an assistant professor at Queens College, CUNY5 . He received his Ph. D.
from UC Berkeley under the supervision of Luca Trevisan and his B. S. from MIT. He
was previously a postdoc at Columbia University and a visiting student at Tsinghua
University and IPAM. Hoeteck currently lives in Manhattan close to the cafés in order
to cut down on his commute. He’s working to convince more people that “black box is
the new black.”
5 This work was performed while the author was a postdoc at Columbia University.
T HEORY OF C OMPUTING, Volume 5 (2009), pp. 257–282 282