Learning Restricted Models of Arithmetic Circuits

Authors Adam Klivans, Amir Shpilka,
Plaintext
                                 T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206
                                             http://theoryofcomputing.org




                         Learning Restricted Models of
                              Arithmetic Circuits
                                       Adam R. Klivans∗                                 Amir Shpilka
                                    Received: August 31,2005; published: September 28, 2006.




        Abstract: We present a polynomial time algorithm for learning a large class of algebraic
        models of computation. We show that any arithmetic circuit whose partial derivatives in-
        duce a low-dimensional vector space is exactly learnable from membership and equivalence
        queries. As a consequence, we obtain polynomial-time algorithms for learning restricted
        algebraic branching programs as well as noncommutative set-multilinear arithmetic formu-
        lae. In addition, we observe that the algorithms of Bergadano et al. (1996) and Beimel et
        al. (2000) can be used to learn depth-3 set-multilinear arithmetic circuits. Previously only
        versions of depth-2 arithmetic circuits were known to be learnable in polynomial time. Our
        learning algorithms can be viewed as solving a generalization of the well known polyno-
        mial interpolation problem where the unknown polynomial has a succinct representation.
        We can learn representations of polynomials encoding exponentially many monomials. Our
        techniques combine a careful algebraic analysis of the partial derivatives of arithmetic cir-
        cuits with “multiplicity automata” learning algorithms due to Bergadano et al. (1997) and
        Beimel et al. (2000).

   ∗ This research was supported by an NSF Mathematical Sciences Postdoctoral Fellowship.



ACM Classification: I.2.6, F.2.2
AMS Classification: 68Q32
Key words and phrases: learning, exact learning, arithmetic circuits, partial derivatives, multiplicity
automata

Authors retain copyright to their work and grant Theory of Computing unlimited rights
to publish the work electronically and in hard copy. Use of the work is permitted as
long as the author(s) and the journal are properly acknowledged. For the detailed
copyright statement, see http://theoryofcomputing.org/copyright.html.

c 2006 Adam R. Klivans and Amir Shpilka                                                         DOI: 10.4086/toc.2006.v002a010
                                      A. R. K LIVANS AND A. S HPILKA

1     Introduction
Let p be an unknown multivariate polynomial over a fixed field. Given random input/output pairs cho-
sen from some distribution D, can a computationally bounded learner output a hypothesis which will
correctly approximate p with respect to future random examples chosen from D? This problem, known
as the multivariate polynomial learning problem, continues to be a fundamental area of research in
computational learning theory. If the learner is allowed to query the unknown polynomial at points of
his choosing (instead of receiving random examples) and is required to output the exact polynomial p,
then this problem is precisely the well-known polynomial interpolation problem. Both the learning and
the interpolation problem have received a great deal of attention from the theoretical computer science
community. In a learning context, multivariate polynomials are expressive structures for encoding in-
formation (sometimes referred to as the “algebraic” analogue of DNF formulae (see e. g. [4])) while
polynomial interpolation has been studied in numerous contexts and has important applications in com-
plexity theory, among other fields [2, 34].
    Previous research on this problem has focused on giving algorithms whose running time is polyno-
mial in the number of terms or monomials of the unknown polynomial. This is a natural way to measure
the complexity of learning and interpolating polynomials when the unknown polynomial is viewed in
the usual “sum of monomials” representation. That is to say, given that the polynomial p = ∑ti=1 mi is the
sum of t monomials, we may wish to output a list of these monomials (and their respective coefficients),
hence using at least t time steps simply to write down the list of coefficients. Several researchers have
developed powerful interpolation and learning algorithms for a variety of contexts which achieve time
bounds polynomial in all the relevant parameters, including t (see for example [4, 11, 16, 20, 23, 31]).

1.1   Arithmetic circuits
In this paper we are concerned with learning succinct representations of polynomials via alternate al-
gebraic models of computation, most notably arithmetic circuits. An arithmetic circuit syntactically
represents a multivariate polynomial in the obvious way: a multiplication (addition) gate outputs the
product (sum) of the polynomials on its inputs. The input wires to the circuit correspond to the input
variables of the polynomial and thus the output of the circuit computes some polynomial of the input
variables. We measure the size of an arithmetic circuit as the number of gates. For example, the stan-
dard “sum of monomials” representation of a polynomial p = ∑ti=1 αi xi1 · · · xin (αi is an arbitrary field
element) corresponds precisely to a depth-2 arithmetic circuit with a single sum gate at the root and t
product gates feeding into the sum gate (each product gate multiplies some subset of the input variables).
To rephrase previous results on learning and interpolating polynomials in terms of arithmetic circuits,
we could say that depth-2 arithmetic circuits with a sum gate at the root are learnable in time polynomial
in the size of the circuit.
    Moving beyond the standard “sum of products” representation, we consider the complexity of learn-
ing higher depth arithmetic circuits. It is easy to see that there exist polynomial size depth-3 (or even
depth-2 with a product gate at the root) arithmetic circuits capable of computing polynomials with ex-
ponentially many monomials. For example, let {Li, j }, 1 ≤ i, j ≤ n be a family of n2 distinct linear forms
over n variables. Then ∑ni=1 ∏ni=1 Li, j is a polynomial size depth-3 arithmetic circuit but cannot be written
as a sum of polynomially many monomials. Although arithmetic circuits have received a great deal of

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                               186
                                              L EARNING A RITHMETIC C IRCUITS

attention in complexity theory and, more recently, derandomization, the best known result for learning
arithmetic circuits in a representation other than the depth-2 sum of products representation is due to
Beimel et al. [4] who show that depth-2 arithmetic circuits with a product gate of fan in O(log n) at
the root and addition gates1 of unbounded fan-in in the bottom level are learnable in polynomial time,
and that circuits that compute polynomials of the form ∑i ∏ j pi, j (x j ) (pi, j is a univariate polynomial of
polynomial degree) can be learned in polynomial time.2

1.2     Our results
We learn various models of algebraic computation capable of encoding exponentially many monomials
in their input size. Our algorithms work with respect to any distribution and require membership query
access to the concept being learned. More specifically we show that any class of polynomial size arith-
metic circuits whose partial derivatives induce a vector space of dimension polynomial in the circuit
size is learnable in polynomial time. This characterization generalizes the work of Beimel et al. [4] and
yields the following results:

      • An algorithm for learning general depth-3 arithmetic circuits with m product gates each of fan in
        at most d in time polynomial in m, 2d , and n, the number of variables.

      • The first polynomial time algorithm for learning polynomial size noncommutative formulae com-
        puting polynomials over a fixed partition of the variables (note there are no depth restrictions on
        the size of the formula).

      • The first polynomial time algorithm for learning polynomial size read once, oblivious algebraic
        branching programs.

    As an easy consequence of our results we observe a polynomial time algorithm for learning the
class of depth-3 set-multilinear circuits: polynomials p = ∑m  i=1 ∏ j=1 Li, j (X j ) where each Li, j is a linear
                                                                     n

form and the X j ’s are a partition of the input variables. We note that this result also follows as an easy
corollary from the work of [4]. Finally we show that, with respect to known techniques, it is hard to
learn polynomial size depth-3 homogeneous arithmetic circuits in polynomial time.3 This indicates that
our algebraic techniques give a fairly tight characterization of the learnability of arithmetic circuits with
respect to current algorithms.

1.3     Our techniques
We use as a starting point the work on multiplicity automata due to Beimel et al. [4]. A multiplicity
automaton is a nondeterministic finite automaton where each transition edge has weight from the un-
derlying field (for a precise definition see Section 2). On input x, f (x) is equal to the sum, over all
accepting paths of the automaton on input x, of the product of the edge weights on that accepting path.
   1 Beimel et al. actually allow addition gates to sum powers of the input variables, rather than just summing variables.
   2 The latter class of circuit can be viewed as a restricted version of depth-3 circuits where the addition gates at the bottom

can only sum powers of a certain variable.
                                 di
                          ∑i=1 ∏ j=1
   3 A depth-3 circuit p = m         Li, j (x1 , . . . , xn ) is homogeneous if in each Li, j the free term is zero.


                               T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                           187
                                      A. R. K LIVANS AND A. S HPILKA

In [7, 4], the authors, building on work due to [27], show that multiplicity automata can be learned in
polynomial time and that these multiplicity automata can compute polynomials in their standard sum
of products representation (actually, as mentioned earlier, they can learn any polynomial p of the form
p = ∑ni=1 ∏mj=1 pi j (x j ) where pi j (x j ) is a univariate polynomial of polynomial degree). Their analysis
centers on the Hankel matrix of a multiplicity automaton (see Section 2 for a definition).
    We give a new characterization of learnability in terms of partial derivatives. In particular we show
that any polynomial whose partial derivatives induce a low dimensional vector space has a low rank
Hankel matrix. We conclude that any arithmetic circuit or branching program whose partial derivatives
form a low dimensional vector space can be computed by polynomial size multiplicity automaton and
are amenable to the learning algorithms developed in [7, 4]. As such, we output a multiplicity automaton
as our learner’s hypothesis.
    Our next task is to show which circuit classes have partial derivatives that induce low dimensional
vector spaces. At this point we build on work due to Nisan [25] and Nisan and Wigderson [26] (see
also [33]) who analyzed the partial derivatives of certain arithmetic circuit classes in the context of prov-
ing lower bounds and show that a large class of algebraic models have “well-behaved” partial deriva-
tives. For example we show that the dimension of the vector space of partial derivatives induced by a
set-multilinear depth-3 arithmetic circuit is polynomial in the size of the circuit.
    Our results suggest that partial derivatives are a powerful tool for learning multivariate polynomials,
as we are able to generalize all previous work in this area and give new results for learning interesting
algebraic models. Additionally, we can show there are depth-3 polynomial-size, homogeneous, arith-
metic circuits whose partial derivatives induce a vector space of superpolynomial dimension. We feel
this motivates the problem of learning depth-3 homogeneous, polynomial-size arithmetic circuits, as
such a result would require significantly new techniques. We are hopeful that our characterizations in-
volving partial derivatives will further inspire complexity theorists to use their techniques for developing
learning algorithms.

1.4   The relationship to lower bounds
In the case of learning Boolean functions, the ability to prove lower bounds against a class of Boolean
circuits usually coincides with the ability to give strong learning algorithms for those circuits. For
example the well known lower bounds of Håstad [19] against constant depth Boolean circuits are used
heavily in the learning algorithm due to Linial, Mansour, and Nisan [24]. Jackson et al. [21] have shown
that constant depth circuits with a majority gate, one of the strongest circuit classes for which we can
prove lower bounds (see [3]), also admit nontrivial learning algorithms. Furthermore Jackson et al. [21]
show that we will not be able to learn more complicated Boolean circuits unless certain cryptographic
assumptions are false.
    Our work furthers this relationship in the algebraic setting. The models of algebraic computation
we can learn correspond to a large subset of the models of algebraic computation for which strong
lower bounds are known. For example Nisan [25] gives exponential lower bounds for noncommutative
formulae. Nisan and Wigderson [26] prove exponential lower bounds for depth-3 set-multilinear circuits.
Moreover, in both papers the authors prove lower bounds by considering the partial derivatives spanned
by the circuit and the function computed by it, a method similar to ours. Over finite fields there are
exponential lower bounds for depth 3 circuits [15, 17], however no exponential lower bounds are known

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                               188
                                       L EARNING A RITHMETIC C IRCUITS

for general depth-3 arithmetic circuits over infinite fields (see [33]). As in the Boolean case, we exploit
many of the insights from the lower bound literature to prove the correctness of our learning algorithms.
    A preliminary version of this paper appeared in COLT 2003 [22].


1.5    Organization
In Section 2 we review relevant learning results for multiplicity automata as well as state some basic
facts from algebraic complexity. In Section 3 we prove our main theorem, characterizing the learnability
of arithmetic circuits via their partial derivatives. In Sections 4, 5, and 6 we state our main learning
results for various arithmetic circuits and algebraic branching programs.


2     Preliminaries
We denote with F the underlying field, and with char(F) the characteristic of F. When studying a
polynomial f we either assume that char(F) = 0 or that the degree of each variable in f is smaller than
char(F).


2.1    The learning model
We will work in the model of exact learning from membership and equivalence queries, first introduced
by Angluin [1]. In this model a learner begins with some candidate hypothesis h for an unknown concept
 f and is allowed access to both a membership query oracle and an equivalence query oracle. The
membership query oracle takes as input x and outputs f (x). The equivalence query oracle takes as
input the learner’s current hypothesis h and outputs a counterexample, namely an input y such that
h(y) 6= f (y). We assume that making a membership or an equivalence query of length k takes time k. If
no such counterexample exists then we say that the learner has exactly learned f . We say that a concept
 f is exactly learnable in time t if there exists an exact learner for f whose running time is bounded by
t. A concept class is considered to be exactly learnable in polynomial time if for every f in the concept
class there exists an exact learner for f running in time polynomial in the size of the smallest description
of f . Known transformations imply that if a concept class is exactly learnable in polynomial time then
it is also learnable in Valiant’s PAC model in polynomial time with membership queries.


2.2    Multiplicity automata
A multiplicity automaton is a nondeterministic automaton where each transition edge is assigned a
weight, and the output of the automaton for any input x is the sum over all accepting paths of x of
the products of the weights on each path.

Definition 2.1. Let Σ be an alphabet. A multiplicity automaton A of size r over Σ consists of a vector
γ̄ ∈ Fr (i. e. γ̄ = (γ1 , . . . , γr )) and a set of matrices {µσ }σ ∈Σ , where each µσ is an r × r matrix over F. The
output of A on input x = (x1 , . . . , xn ) ∈ Σn is defined to be the inner product of (∏ni=1 µxi )1 and γ̄ where

                          T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                     189
                                                  A. R. K LIVANS AND A. S HPILKA

(∏ni=1 µxi )1 equals the first row of the matrix4 ∏ni=1 µxi . In other words the output is the first coordinate
of the vector (∏ni=1 µxi ) · γ̄.

     Intuitively each matrix µσ corresponds to the transition matrix of the automaton for symbol σ ∈ Σ.
Iterative matrix multiplication keeps track of the weighted sum of paths from state i to state j for all
i, j ≤ r. The first row of the iterated product corresponds to transition values starting from the initial
state and γ̄ determines the acceptance criteria.
     Next we define the Hankel matix of a function:

Definition 2.2. Let Σ be an alphabet and f : |Σ|n → F. Fix an ordering of all strings in Σ≤n . We construct
a matrix H whose rows and columns are indexed by strings in Σ≤n in the following way. For x ∈ Σd and
y ∈ Σn−d , for some 0 ≤ d ≤ n, let the (x, y) entry of H be equal to f (x ◦ y). For any other pair of strings
(x, y) such that |x| + |y| =
                           6 n let Ha,b = 0. The resulting matrix H is called the Hankel matrix of f for
strings of length n. We define Hk to be the k-th “block” of H, i. e. Hk is the submatrix defined by all rows
of H indexed by strings of length exactly k and all columns of H indexed by strings of length exactly
n − k.

    The following key fact relates the rank of the Hankel matrix of a function for strings of length n with
the size of multiplicity automaton computing f on inputs of length n:

Theorem 2.3 ([13, 14, 4]). Let f : Σn → F. Then the rank of the Hankel matrix of f (over F) is equal to
the size of the smallest multiplicity automaton computing f on inputs of length n.

    Previous learning results have computed the rank of the Hankel matrices of particular polynomials
yielding a bound on the size of their multiplicity automata. In fact, Beimel et al. [4], improving on [27],
learn functions computed by multiplicity automata by iteratively learning their corresponding Hankel
matrices:

Theorem 2.4 ([4]). For f : Σn → F, let r be the rank of the Hankel matrix of f for strings of length
n. Then there exists an exact learning algorithm for f running in time polynomial in n, r, and |Σ|.
Furthermore the final hypothesis output by the learning algorithm is a multiplicity automaton of size r
over alphabet Σ. Moreover, if for every variable xi the degree of f as a polynomial in xi (over F) is at
most d, then the running time of the learning algorithm is polynomial in n, r and d.

    Our main technical contribution is to show that the rank of a function’s Hankel matrix is bounded by
(and in most cases equal to) the dimension of the vector space of a function’s partial derivatives. Thus
we reduce the problem of learning a polynomial to bounding the dimension of the vector space of its
partial derivatives.


2.3   Set-multilinear polynomials
In this paper we will work primarily with polynomials that respect a fixed partition of the input variables:

                  xn · µxn−1 · . . . · µx1 with ∏i=1 µxi .
  4 We denote µ                                  n



                                T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                        190
                                            L EARNING A RITHMETIC C IRCUITS

                         Sd
Definition 2.5. Let X = ˙ i=1 Xi be a partition of the variables into d sets. A polynomial over the variables
X is called set-multilinear if every monomial m is of the form y1 · y2 · · · yd where each yi is some variable
from Xi . Thus, any set-multilinear f is also homogeneous and multilinear of degree d.
    We will sometime use the notation f (X1 , . . . , Xd ) to denote that f is set-multilinear with respect to
                  Sd
the partition X = ˙ i=1 Xi .
Example 2.6. Let X = (xi, j )1≤i, j≤d be a d × d matrix. Let Xi = {xi,1 , . . . , xi,d } be the i-th row of X.
             Sd
Clearly X = ˙ i=1 Xi . Note that both the determinant and the permanent of X are set-multilinear polyno-
mials with respect to this partition.
    Another example is the class of depth-3 set-multilinear circuits, first defined by Nisan and Wigder-
son [26], that computes only set-multilinear polynomials.5 To see this note that any polynomial com-
puted by a depth-3 set-multilinear circuits is of the form p = ∑ni=1 ∏mj=1 Li, j (X j ) where each Li, j is a
linear form and the X j ’s are a partition of the input variables. In later sections we will show that certain
algebraic branching programs also compute set-multilinear polynomials and will therefore be amenable
to our learning techniques.
    Another notation that we use is the following:
                                Sd
Definition 2.7. Let X = ˙ i=1 Xi . For any 1 ≤ k ≤ d define
                                                                          k
                                     SM[X1 , ..., Xk ] = { M | M = ∏ xi , xi ∈ Xi } .
                                                                         i=1

      Thus SM[X1 , ..., Xk ] is the set of all set-multilinear monomials of degree k.

2.4     Partial derivatives
In this subsection we introduce some notation for computing partial derivatives.
Definition 2.8. Let M[x1 , . . . , xn ] be the set of monomials in the variables x1 , . . . , xn . Let Md [x1 , . . . , xn ]
be the set of monomials of degree at most d in x1 , . . . , xn .
Example 2.9. M2 [x1 , x2 ] = {1, x1 , x2 , x12 , x1 · x2 , x22 }.
Definition 2.10. Let d = ∑ki=1 di . For a function f (x1 , ..., xn ) and a monomial M = ∏ki=1 xi di let

                                                 ∂f   1     ∂d f
                                                    =   · k            ,
                                                 ∂ M M! ∏i=1 (∂ xi )di

where M! = ∏ki=1 (di !).
    Recall that in case that F is finite we only consider polynomials in which the degree of each variable
is smaller than the characteristic of F. In particular we will only consider partial derivatives with respect
to monomials in which each variable has degree smaller than char(F).
   5 In the original paper [26] these circuits are called multilinear circuits, but in recent works [28, 29, 30] they are referred to

as set-multilinear circuits.


                               T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                              191
                                             A. R. K LIVANS AND A. S HPILKA

Example 2.11. Let f (x1 , x2 , x3 ) = x12 x2 + x3 and M(x1 , x2 , x3 ) = x1 x2 . We have that

                                                    ∂f
                                                       = 2x1 , M! = 1 .
                                                    ∂M

Definition 2.12. For a function f (x1 , . . . , xn ) and k ≤ n let
                                                                                         
                                                 ∂f
                                ∂k ( f ) =          monomials M ∈ M[x1 , ..., xk ]            .
                                                 ∂M

Also define
                                             rankk ( f ) = dim (span (∂k ( f ))) .

    Note that in ∂k ( f ) we only consider partial derivatives with respect to the first k variables.

Example 2.13. Let X = (xi, j )1≤i, j≤3 be a 3 × 3 matrix. Let f (X) = Det(X) (the determinant of X).
Consider the following order of the variables xi, j < xi0 , j0 if i < i0 or i = i0 and j < j0 . Then

            ∂6 (X) = {x2,2 x3,3 − x2,3 x3,2 , x2,1 x3,3 − x2,3 x3,1 , x2,1 x3,2 − x2,2 x3,1 , x3,1 , x3,2 , x3,3 } .

Thus, rank6 ( f ) = 6.

    For set-multilinear polynomials we need a slightly different definition (although we use the same
notations).
                               Sd
Definition 2.14. Let X = ˙ i=1 Xi . For a set-multilinear polynomial f (X1 , . . . , Xd ) and k ≤ d let
                                                                                                  
                                        ∂f
                       ∂k ( f ) =          monomials M ∈ SM[X1 , ..., Xi ] for 1 ≤ i ≤ k                .
                                        ∂M

We define s-rankk ( f ) = dim (span (∂k ( f ))).

    Note that in particular we only consider partial derivatives with respect to monomials of the form
   0
∏ki=1 xi where xi ∈ Xi and k0 ≤ k. We will never consider partial derivative with respect to the monomial
x1 · x3 (again, xi ∈ Xi ).

Example 2.15. Let X be a 3 × 3 matrix (as in Example 2.6 with d = 3). Let f = Det(X) =
Det(X1 , X2 , X3 ) be the determinant of X where where X1 = {x1,1 , x1,2 , x1,3 }, X2 = {x2,1 , x2,2 , x2,3 }, and
X3 = {x3,1 , x3,2 , x3,3 }. Then ∂2 ( f ) = {x3,1 , x3,2 , x3,3 }. Thus, s-rank2 ( f ) = 3.

   Note the difference from Example 2.13 where we ignored the fact that the determinant is a set-
multilinear polynomial.

                             T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                       192
                                               L EARNING A RITHMETIC C IRCUITS

3    Characterizing learnability via partial derivatives
In this section, we present our main criterion for establishing the learnability of both arithmetic circuits
and algebraic branching programs. We prove that any polynomial whose partial derivatives form a low
degree vector space induce low rank Hankel matrices. To relate the rank of the Hankel matrix of C to its
partial derivatives we will need the following multivariate version of Taylor’s theorem:
Fact 3.1. Let X = {x1 , . . . , xn } and let f (X) be a degree d polynomial. Let ρ = (ρ1 ◦ρ2 ) be an assignment
to the variables, where ρ1 is an assignment to the first k variables and ρ2 as assignment to the last n − k
variables. For a monomial M define M(ρ) be value of M on assignment ρ. Then

                                                                                             ∂f ~
                                    f (ρ) = f (ρ1 ◦ ρ2 ) =          ∑             M(ρ1 ) ·
                                                                                             ∂M
                                                                                                (0 ◦ ρ2 ) .
                                                             M∈Md [x1 ,...,xk ]

Proof. Because of the linearity of the partial derivative operator it is enough to prove the claim for
the case that f is a monomial. Let f (x1 , . . . , xn ) = ∏ni=1 xi di , where ∑ di ≤ d. Consider a monomial
M ∈ Md [x1 , . . . , xk ] given by M = ∏ki=1 xi ei , where ∑ ei ≤ d. Notice that if there is some 1 ≤ i ≤ k with
ei > di then ∂∂Mf = 0. Also notice that if for some 1 ≤ i ≤ k we have that ei < di then ∂∂Mf (~0 ◦ ρ2 ) = 0
because the assignment to xi is zero. In particular the only contribution to the sum will come from the
partial derivative with respect to M0 = ∏ki=1 xi di that gives ∂∂Mf 0 = ∏ni=k+1 xi di . In particular

                                                                                k               n
                                    ∂f ~                       ∂f ~
           ∑             M(ρ1 ) ·
                                    ∂M
                                       (0 ◦ ρ2 ) = M0 (ρ1 ) ·
                                                              ∂ M0
                                                                   (0 ◦ ρ2 ) = ∏ xi di (ρ1 ) · ∏ xi di (ρ2 ) = f (ρ) .
    M∈Md [x1 ,...,xk ]                                                         i=1            i=k+1




Now we can state the main technical theorem of the paper:
Theorem 3.2. Let f (x1 , . . . , xn ) be a degree d polynomial. Then for every k ≤ n,

                                                   dim(Hk ( f )) ≤ rankk ( f ) .

If f is multilinear then equality holds.
Proof. We will define two matrices Vd,k and Ek such that rank(Ek ) ≤ rankk ( f ) and Hk = Vd,k · Ek .

Construction of Ek (Evaluation Matrix): We index the rows of Ek by the set of monomials Md [x1 , ..., xk ]
(in lexicographical order) and the columns by elements of Fn−k (in lexicographical order). The (M, ρ)
entry of Ek is equal to
                                                       ∂f ~
                                           (Ek )M,ρ =      (0 ◦ ρ) ,
                                                       ∂M
where ~0 is a length k vector and ρ is in Fn−k . This is equal to the value of the partial derivative of f with
respect to M at the point ~0 ◦ ρ. When k = 0, the matrix has only one row (the partial derivative of order
zero is the polynomial itself), in which the ρth position is equal to f (ρ). The following is a standard
fact from linear algebra:

                                T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                      193
                                        A. R. K LIVANS AND A. S HPILKA

Claim 3.3. rank(Ek ) ≤ rankk ( f ), and equality holds if f is multilinear.

Proof. Row M of Ek is the evaluation of ∂∂Mf on all inputs of the form ~0 ◦ ρ, where ~0 is a length k vector
and ρ is of length n − k. Hence each vector corresponds to part of the “truth table” of a particular
partial derivative of f in which the assignment to the first k variables is zero. Clearly if a set of partial
derivatives is linearly dependent then so are the corresponding rows. Thus rank(Ek ) ≤ rankk ( f ). When
 f is multilinear, all of the variables in M disappear from the resulting polynomial, and we actually get
that the rows of Ek represent the entire truth table of the corresponding partial derivative of f and hence
rank(Ek ) = rankk ( f ).
Construction of Vd,k (Generalized Vandermonde Matrix): The rows of Vd,k are indexed by elements of
Fk (in lexicographical order) and the columns are indexed by the set of monomials Md [x1 , ..., xk ] (again
in lexicographical order). The (ρ, M) entry of Vd,k is equal to M(ρ). When k = 0 the matrix contains
only one column, whose entries are equal to 1. We note that the column rank of Vk is full (similarly to
the usual Vandermonde matrix).
    Consider the matrix product Vd,k · Ek . Notice that its (ρ1 ◦ ρ2 ) entry is equal to
                                                                                    ∂f ~
                               (Vd,k · Ek )ρ1 ◦ρ2 =          ∑             M(ρ1 )
                                                                                    ∂M
                                                                                       (0 ◦ ρ2 )
                                                      M∈Md [x1 ,...,xk ]

which by Fact 3.1 equals f (ρ1 ◦ ρ2 ). Thus Vd,k · Ek = Hk . In particular rank(Hk ) ≤ rank(Ek ) ≤ rankk ( f ).
When f is multilinear we have that, as before, rank(Ek ) = rankk ( f ), and as the column rank of Vk is full
it follows that rank(Hk ) = rank(Ek ) = rankk ( f ).
    By summing over all values of k we obtain
Corollary 3.4. Let f (x1 , . . . , xn ) be a polynomial. Then
                                                                 n
                                          dim(H( f )) ≤ ∑ rankk ( f ) .
                                                                k=0
If f is multilinear then
                                                                 n
                                          dim(H( f )) = ∑ rankk ( f ) .
                                                                k=0
    Now we consider set-multilinear polynomials. We must be careful here to take into account partial
derivatives with respect to monomials M that are not in SM[X1 , . . . , Xi ] for any i. Below, we show that
rows in Ek corresponding to such M’s are zero.
    Let f (X1 , . . . , Xd ) be a set-multilinear polynomial with respect to X = ˙ Xi . We order the variables of
                                                                                   S

X as follows: first we set X1 < X2 < . . . < Xd , then we order the variables in each Xi in some linear order.
Consider the (M, ρ) entry in Ek . Notice that if M 6∈ di=1 SM[X1 , . . . , Xi ] then ∂∂Mf (~0 ◦ ρ) = 0. Indeed,
                                                              S

assume that the first k variables cover the sets X1 , . . . , Xi , as well as some of the variables in the set Xi+1 .
Since we substitute 0 to the first k variables we see that M must contain a variable from each X1 , . . . , Xi
(as otherwise, because f is set-multilinear, the entire M-th row of Ek is zero). We also note that M can’t
contain two variables from the set X j (as again this would imply that the M-th row is zero). In particular,
in order for the M-th row to be non zero we must have that M ∈ SM[X1 , . . . , Xi ] for that i. As a corollary
we get

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                   194
                                         L EARNING A RITHMETIC C IRCUITS

Corollary 3.5. Let f be a set-multilinear polynomial with an ordering of the variables as above. For
each 1 ≤ k ≤ n let ik be defined by
                                              ik
                                              [                        +1
                                                                     ik[
                                                    Xi ≤ k <                Xi .
                                              i=1                    i=1

Then rank(Ek ) ≤ s-rankik ( f ).

    This corollary implies the following version of Corollary 3.4 for set-multilinear polynomials.
                          Sd
Theorem 3.6. Let X = ˙ i=1 Xi , with |X| = n. Let f (X1 , . . . , Xd ) be a set-multilinear polynomial. Then

                                                                 d
                                         dim(H( f )) ≤ n ∑ s-rankk ( f ) .
                                                             k=0

Proof. According to Theorem 3.2 we have that dim(H( f )) = ∑nk=0 rank(Ek ). By Corollary 3.5 we get
that rank(Ek ) ≤ s-rankik ( f ), where ik is such that

                                              ik
                                              [                        +1
                                                                     ik[
                                                    Xi ≤ k <                Xi .
                                              i=1                    i=1

In particular we get that
                                    n               (∗) d                          d
                  dim(H( f )) = ∑ rank(Ek ) ≤ ∑ |Xi+1 | · s-ranki ( f ) ≤ n ∑ s-ranki ( f ) ,
                                   k=0                 i=0                         i=0

                                                                                         Si               Si+1
where inequality (∗) follows from the observation that for every k, such that             j=1 X j   ≤k<    j=1 X j ,
it holds that ik = i, and so there are |Xi+1 | such k’s.


4    Learning depth-3 arithmetic circuits
In this section we learn depth-3 arithmetic circuits. The results that we obtain also follow from the works
of [6, 4], however we reprove them in order to demonstrate the usefulness of our techniques. We begin
by defining the model:

Definition 4.1. A depth 3 arithmetic circuit is a layered graph with 3 levels and unbounded fan-in. At
the top we have either a sum gate or a product gate. A depth 3 arithmetic circuit C with a sum (product)
gate at the top is called a ΣΠΣ (ΠΣΠ) circuit and has the following structure:
                                                        m    d
                                                C = ∑ ∏ Li, j (X)
                                                       i=1 j=1

where each Li, j is a linear function in the input variables X = x1 , . . . , xn and m is the number of multipli-
cation gates. The size of the circuit is the number of gates, in this case O(md).

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                     195
                                     A. R. K LIVANS AND A. S HPILKA

    A ΣΠΣ circuit is a homogeneous circuit if all the linear forms are homogeneous linear forms (i. e.
the free term is zero) and all the product gates have the same fan in (or degree). In other words every
gate of the circuit computes a homogeneous polynomial. We will also be interested in set-multilinear
depth 3 circuits. To define this sub-model we need to impose a partition on the variables:
                                                                             Sd
Definition 4.2. A ΣΠΣ circuit is called set-multilinear with respect to X = ˙ i=1 Xi if every linear function
computed at the bottom is a homogeneous linear form in one of the sets Xi , and each multiplication gate
multiplies d homogeneous linear forms L1 , . . . , Ld where every Li is over a distinct set of variables Xi .
That is to say a depth-3 set-multilinear circuit C has the following structure:

                                                  m    d
                                             C = ∑ ∏ Li, j (X j )
                                                  i=1 j=1


where Li, j is an homogeneous linear form.

    We now give an algorithm for learning set-multilinear depth-3 circuits. The algorithm is based on
the following lemma that characterizes the dimension of a set-multilinear circuit’s partial derivatives:

Lemma 4.3. If a polynomial f is computed by a set-multilinear depth 3 circuit with m product gates
then for every 1 ≤ k ≤ d,
                                      s-rankk ( f ) ≤ km .

Proof. First notice that for every product gate

                                                      d
                                               P = ∏ Li (Xi )
                                                      i=1


we have s-rankk (P) ≤ k. Indeed, let 1 ≤ r ≤ k. then for any monomial M ∈ SM[X1 , ..., Xr ] we have that

                                                     d
                                          ∂P
                                             = αM · ∏ Li (Xi )
                                          ∂M       i=r+1

for some constant αM depending on M and P. Thus, as we vary over all r between 1 and k we obtain
only k distinct partial derivatives. The proof of the lemma now follows from the linearity of the partial
derivative operator.

Applying Lemma 4.3, Theorem 3.6, and Theorem 2.4 we obtain the following learning result:

Theorem 4.4. Let C be a set-multilinear depth-3 circuit with m product gates over n variables with
coefficients from a field F. Then C is learnable in time polynomial in m and n.

We note that this result also follows immediately from the results of [6, 4].

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                              196
                                   L EARNING A RITHMETIC C IRCUITS

4.1   Learning general depth 3 circuits
We now give our learning algorithm for general depth-3 arithmetic circuits. Unlike the algorithm in the
set-multilinear case, this algorithm runs in time exponential in the degree of the circuit (and polynomial
in the other parameters). Thus we can learn in subexponential time any depth-3 circuit of sublinear
degree. The running time of the algorithm is determined by the following lemma:
Lemma 4.5. Let f : Fn → F be a polynomial over a variable set X of size n computed by a depth-3
circuit with m product gates each of degree at most d. Then for every 1 ≤ k ≤ d,
                                                          k  
                                                             d
                                       rankk ( f ) ≤ m · ∑      .
                                                         i=1 i

Proof. The proof is similar to the case of set-multilinear depth-3 circuits. Notice that for every product
gate
                                                     d
                                                 P = ∏ Li (X)
                                                     i=1

we have rankk (P) ≤ ∑ki=1 di . Indeed, for any monomial M of degree r we have that
                            

                                       (                                    )
                           ∂P
                           ∂M
                              ∈ span       ∏ Li (X) T ⊂ [d], |T | = d − r       .
                                           i∈T

Since there are at most m product gates we obtain the claimed bound.

Applying the above lemma with Theorem 3.2 and Theorem 2.4 we get the following learning result (that
was also obtained in [6]):
Theorem 4.6. Let f : Fn → F be computed by a depth-3 arithmetic circuit with m product gates each of
fan in at most d. Then f is learnable in time polynomial in n, 2d , and m.

4.2   Discussion
The fact that the rank of f was bounded by the number of product gates is unique to set-multilinear
depth-3 circuits. For example consider the following depth-2 ΠΣ circuit:
                                                            n
                                      f (z, x1 , ..., xn ) = ∏(z + xi ) .
                                                           i=1

For every ordering of the variables, the dimension of the span of the partial derivatives of f (and hence
the rank of the Hankel matrix of f ) is exponential in n; this follows from the observation that the
coefficient of zd is the n − d symmetric polynomial whose partial derivatives have dimension 2Ω(n−d)
(see [33]). Thus it is no surprise that Beimel et al. [4] only considered depth-2 ΠΣ circuits where
the product gate at the root has fan in at most O(log n); fan in larger than O(log n) would correspond to
Hankel matrices of superpolynomial dimension and thus would not be learnable by multiplicity automata
techniques.

                       T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                            197
                                      A. R. K LIVANS AND A. S HPILKA

   To show the limits of current learning techniques we point out that the following homogeneous
depth-3 arithmetic circuit
                                              n             n
                                       C0 = ∏(z + xi ) + ∏(v + ui )
                                             i=1           i=1

is both irreducible and has exponentially many linearly independent partial derivatives. As its degree is
n we can only learn it in time exponential in n. We leave open the problem of learning homogeneous
depth-3 arithmetic circuits (as well as the more difficult problem of learning general depth-3 arithmetic
circuits) of superlogarithmic degree.


5    Learning classes of algebraic branching programs
Algebraic and Boolean branching programs have been intensely studied by complexity theorists and
have been particularly fruitful for proving lower bounds. Considerably less is known in the learning
scenario — Bshouty et al. [12] and Bergadano et al. [5] have shown some partial progress for learning
restricted width Boolean branching programs. In this section we will show how to learn any polynomial
size algebraic branching program that is both read once and oblivious. As such we will be able to
show that multiplicity automata are essentially equivalent to read once, oblivious algebraic branching
programs, a characterization that may be of independent interest. We begin with a general definition of
algebraic branching programs:

Definition 5.1. An algebraic branching program (ABP), first defined by Nisan [25], is a directed acyclic
graph with one vertex of in-degree zero, which is called source, and one vertex of out-degree zero,
which is called the sink. The vertices of the graph are partitioned into levels numbered 0, ..., d. Edges
are labeled with a homogeneous linear form in the input variables and may only connect vertices from
level i to vertices from level i + 1. The source is the only vertex at level 0 and the sink is the only vertex
at the level d. Finally the size of the ABP is the number of vertices in the graph.

    The polynomial that is computed by an ABP is the sum over all directed paths from the source to
the sink of the product of linear functions that labeled the edges of the path. It is clear that an ABP with
d + 1 levels computes a homogeneous polynomial of degree d.

   In this section we will show how to learn a natural restriction of an algebraic branching program as
mentioned above: the read once, oblivious algebraic branching program or ROAB.
                           Sd
Definition 5.2. Let X = ˙ i=1 Xi be a partition of the input variables into d disjoint sets. An ABP is
oblivious if for every level i only one set of variables X j appears. A function is a ROAB, a read once,
oblivious algebraic branching program, if it is an oblivious ABP and every set of variables X j appears in
at most one level.
                                                                                  Sd
    We are interested in learning ROABs with respect to the partition X = ˙ i=1 Xi in which the vari-
ables in Xi appear on edges from level i to level i + 1. In this section we measure the complexity of a
polynomial in terms of its smallest ROAB:

                        T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                               198
                                     L EARNING A RITHMETIC C IRCUITS

Definition 5.3. For a polynomial f we define B( f ) to be the size of the smallest ABP for f . For a
set-multilinear polynomial f we denote OB( f ) to be the size of the smallest ROAB for f .

  The main theorem of this section shows that for set-multilinear polynomials, the size of its smallest
ROAB is equal to the dimension of the vector space induced by its partial derivatives:

Theorem 5.4. For a set-multilinear polynomial f (X1 , ..., Xd ) we have that
                                           d
                                          ∑ s-rankk ( f ) = OB( f ) .
                                          k=1

To prove Theorem 5.4 we will need the following theorem which is implicit in Nisan [25]:

Theorem 5.5 ([25]). Let f (X1 , . . . , Xd ) be a set-multilinear polynomial. For each 0 ≤ k ≤ d define a
matrix Mk ( f ) as follows:

    • Each row is labeled with a monomial M1 ∈ SM[X1 , . . . , Xk ].

    • Each column is labeled with a monomial M2 ∈ SM[Xk+1 , . . . , Xd ].

    If k = 0 then M1 = 1 and if k = d then M2 = 1. The (M1 , M2 ) entry of Mk ( f ) is equal to the coefficient
of the monomial M1 · M2 in f . We have that
                                                    d
                                        OB( f ) = ∑ rank(Mk ( f )) .
                                                   k=0

Proof of Theorem 5.4. We will show that rank(Mk ( f )) = s-rankk ( f ) which, combined with Theo-
rem 5.5, completes the proof. Consider a row of Mk ( f ) corresponding to some monomial M ∈
SM[X1 , . . . , Xk ]. Since f is a set-multilinear polynomial it follows that ∂∂Mf is equal to ∑t αt Mt where
each αt is an element of the field and Mt ∈ SM[Xk+1 , . . . , Xd ], for all t. Notice, however, that row M
of Mk ( f ) is precisely equal to the row vector (α1 , . . . , αt ). Hence row M of Mk ( f ) is equal to the co-
efficients of the partial derivative of f viewed as a set-multilinear polynomial in Xk+1 , . . . , Xd . It is a
standard fact from linear algebra that the dimension of a vector space spanned by a set of polynomials
is equal to the rank of the matrix of their coefficients.


    Combining Theorem 5.4 and Theorem 3.6 we see that any polynomial-size ROAB obeying the above
partition is computed by a polynomial-size multiplicity automata. Applying the learning algorithm of
Beimel et al. [4] we obtain
                         Sd
Theorem 5.6. Let X = ˙ i=1 Xi . Let f (X1 , . . . , Xd ) be a set-multilinear polynomial that is computed by a
ROAB of size m. Then f is learnable in time polynomial in m, and |X|.

   Notice that ROABs can be thought of as the arithmetic generalization of OBDDs (Ordered Binary
Decision Diagrams, which are also known as oblivious read once branching programs), a model for
which Bergadano et al. [5] gave a learning algorithm based on multiplicity automata.

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                199
                                       A. R. K LIVANS AND A. S HPILKA

5.1   Equivalence of ROABs and multiplicity automata
We can now prove that ROABs are essentially equivalent to multiplicity automata. Since our learning
algorithm outputs as a hypothesis a multiplicity automaton, Theorem 5.6 implies that every ROAB of
size m in n variables is computed by a multiplicity automaton of size polynomial in m and n. We cannot
show that every multiplicity automaton is computed by a ROAB, but we can show that every multiplicity
automaton is computed by a ROAB which computes higher degree polynomials at each edge.

Definition 5.7. Define a ROAB of degree d to be a ROAB where every edge is labelled with a polynomial
of degree d.

Lemma 5.8. Let f be any polynomial over n variables computed by an algebraic multiplicity automaton
of size r. Assume also the the degree of each variable in f is bounded by d. Then f can be computed by
a ROAB with n + 2 levels of size nr + 2 and degree d.

Proof. Let S ⊆ Σ be a subset of the alphabet of size d +1. Let f be computed by a multiplicity automaton
A of size r consisting of the set of matrices {µσ }σ ∈Σ and the vector ~γ ∈ Σr . Construct a matrix T where
the i, j entry of T is a degree d univariate polynomial, Ti, j , interpolating the (i, j) entry of µσ for every
σ ∈ S. That is, Ti, j (σ ) is the (i, j) entry of µσ (for σ ∈ S). Consider a ROAB with n + 2 levels each
of size r where every level 1 ≤ i ≤ n has a copy of the r states of A (in particular we enumerate the
vertices in each level with {1, . . . , r}). Connect every vertex at level k, for 1 ≤ k ≤ n − 1, to every vertex
at level k + 1. For the j-th vertex in level k and the i-th vertex in level k + 1 we label edge ( j, i) with
the polynomial in the (i, j) entry of T , having xk as its variable (i. e. the label is Ti, j (xk )). Connect every
vertex in level n to the sink and label edge (i, sink) with the polynomial in the T1,i (xn ) (recall the output
of a multiplicity automata is the inner product of ~γ with the first row of the product of the µσ ’s). Also
connect the source to every one of the r vertices in the first level and label the edge to vertex i with
γi . It is clear that this ROAB computes a polynomial of degree at most d in each variable, and that for
every input from Sn the output of the ROAB agrees with f . Therefore, by the following version of the
Schwartz Zippel lemma [32, 36] we get that this ROAB computes f as required.
Lemma 5.9 ([32, 36]). Let f , g : Fn → F be two n-variate polynomials over F. Assume that the degree
of each variable in f and g is at most d. Let S ⊆ F be a set of size d + 1. If for every x ∈ Sn we have that
f (x) = g(x) then f = g.




6     Learning noncommutative formulæ
In this section we show how to learn another type of arithmetic circuits: polynomial size noncommu-
tative formulae. A noncommutative formula is an arithmetic formula where multiplication does not
necessarily commute; i.e. different orderings of inputs to a product gate result in different outputs. Intu-
itively this restriction makes it difficult for a formula to use the power of cancellation. This may seem to
be a strange restriction, but it is very natural in the context of function computation where an ordering
is enforced on groups of variables. For example, the product of k matrices M1 , . . . , Mk where matrix
Mi uses variables from a set Xi can be viewed as a set-multilinear noncommutative polynomial over an

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                   200
                                        L EARNING A RITHMETIC C IRCUITS

ordering of the variables X = ˙ Xi (changing the order of the matrices will result in a different output).
                               S

In addition, many of the known algorithms for computing polynomials are non-commutative by nature.
For example, the well known algorithm for the above mentioned iterated matrix multiplication can be
viewed as a non-commutative set-multilinear circuit. Similarly, Ryser’s algorithm for computing the
permanent (see, e. g. [35]) can be viewed as a non-commutative set-multilinear formula.
    Nisan proved the first lower bounds for noncommutative formulae in [25]; here we will give the first
learning algorithm for set-multilinear polynomials computed by noncommutative formulae. Previously
only algorithms for learning read-once arithmetic formulae were known (see e. g. [18, 10, 9, 8]). We
begin with a general definition for arithmetic formulae:

Definition 6.1. An arithmetic formula is a tree whose edges are directed towards the root. The leaves
of the tree are labeled with input variables. Every inner vertex is labeled with one of the arithmetic
operations {+, ×}. Every edge is labeled with a constant from the field in which we are working. The
size of the formula is defined to be the number of vertices.

      An arithmetic formula computes a polynomial in the obvious manner. We now define non-
commutative formulae. Roughly, a formula is noncommutative if for any two input variables xi and
x j , xi x j − x j xi 6= 0. More formally, let F{x1 , ..., xn } be the polynomial ring over the field F in the
non-commuting variables x1 , . . . , xn . That is, in F{x1 , ..., xn } the formal expressions xi1 · xi2 · . . . · xik and
x j1 · x j2 · . . . · x jl are equal if and only if k = l and ∀m im = jm (whereas in the commutative ring of polyno-
mials we have that any monomial remains the same even if we permute its variables, e. g. x1 · x2 = x2 · x1 ).
A non-commutative arithmetic formula is an arithmetic formula where multiplications are done in the
ring F{x1 , ..., xn }. As two polynomials in this ring do not necessarily commute, we have to distinguish in
every multiplication gate between the left son and the right son. For a polynomial f let F( f ) be the size
of the smallest noncommutative formula computing f . When considering non-commutative formulae
we are interested in syntactic computations, e. g. given the polynomial x1 · x2 we want the formula to out-
put this exact polynomial and not the polynomial x2 · x1 , even though they are semantically equal when
considering assignments from a field. In particular the formula (x1 − x2 ) · (x1 + x2 ) does not compute the
polynomial x12 − x22 .
      Note that every polynomial can be computed by a non-commutative formula, and that given a non-
commutative formula we can evaluate it over a commutative domain.
      In [25] Nisan proved exponential lower bounds on the size of noncommutative formula computing
the permanent and the determinant. An important ingredient of Nisan’s result is the following lemma
relating noncommutative formula size to algebraic branching program size:

Lemma 6.2 ([25]). Let f (X1 , . . . , Xd ) be a set-multilinear polynomial. Then

                                               B( f ) ≤ d(F( f ) + 1) .

    Using this we can give the following relationship between noncommutative formulae and ROABs:

Theorem 6.3. Let f (X1 , . . . , Xd ) be a set-multilinear polynomial computed by a noncommutative formula
of size m, then f is computed by a ROAB of size d(m + 1).

                           T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                                      201
                                      A. R. K LIVANS AND A. S HPILKA

Proof. Applying Lemma 6.2 we see that f is computed by an algebraic branching program B of size
d(m + 1). We will show that B is also computed by a ROAB of size d(m + 1), by constructing a ROAB
with d + 1 levels, in which the variables in Xk label the edges that go from level k − 1 to level k.
      Consider the set of edges in B from level i − 1 to i. Assume that two different sets of variables appear
from level i − 1 to level i say Xi and X j . Then the output of B will contain a monomial of the form and
Y x j Z where x j ∈ X j , Y is a set of variables appearing in levels less than i − 1 in B, and Z is the set of
variables appearing in levels greater than i in B. Note however that f is a set-multilinear polynomial,
in particular in each monomial of f the variables from X j appear as the j-th multiplicand. In particular
no monomial of the form Y x j Z appear in f . Thus, the coefficient of any monomial Y x j Z must be zero.
As such, we can substitute the constant 0 for all of the variables X j appearing on these edges and obtain
an oblivious branching program B0 computing the same polynomial as B. B0 can be made read-once in
a similar fashion. At the end we get a ROAB with d + 1 levels in which the variables from Xi label the
edges from level i − 1 to level i.

    Combining Theorem 6.3 with Theorem 5.6 we obtain
                                                                                Sd
Theorem 6.4. Let f (X1 , . . . , Xd ) be a set-multilinear polynomial, over X = ˙ i=1 Xi , that is computable
by a noncommutative formula of size m with coefficients from a field F. Then f is learnable in time
polynomial in |X| and m.


7    Acknowledgments
We thank Ran Raz for many helpful discussions in all stages of this work. We also thank Eli Ben-Sasson
for important conversations at an early stage of this research. We thank the anonymous referees for their
valuable comments, and for bringing [6] to our attention.


References
 [1] * D. A NGLUIN: Queries and concept learning.                  Machine Learning, 2:319–342, 1988.
     [ML:l147k68714mhg8m5]. 2.1

 [2] * S. A RORA , C. L UND , R. M OTWANI , M. S UDAN , AND M. S ZEGEDY: Proof verifica-
     tion and the hardness of approximation problems. Journal of the ACM, 45(3):501–555, 1998.
     [JACM:278298.278306]. 1

 [3] * J. A SPNES , R. B EIGEL , M. F URST, AND S. RUDICH: The expressive power of voting polyno-
     mials. In Proc. 23rd STOC, pp. 402–409. ACM Press, 1991. [STOC:103418.103461]. 1.4

 [4] * A. B EIMEL , F. B ERGADANO , N. H. B SHOUTY, E. K USHILEVITZ , AND S. VARRICCHIO:
     Learning functions represented as multiplicity automata. Journal of the ACM, 47(3):506–530,
     2000. [JACM:337244.337257]. 1, 1.1, 1.2, 1.3, 2.3, 2.2, 2.4, 4, 4, 4.2, 5

 [5] * F. B ERGADANO , N. H. B SHOUTY, C. TAMON , AND S. VARRICCHIO: On learning branching
     programs and small depth circuits. In Proc. of the 3rd European Conf. on Computational Learning

                         T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                               202
                                  L EARNING A RITHMETIC C IRCUITS

     Theory (EuroCOLT’97), volume 1208 of LNCS, pp. 150–161, 1997. [LNCS:73001q2141150g25].
     5, 5

 [6] * F. B ERGADANO , N. H. B SHOUTY, AND S. VARRICCHIO: Learning multivariate polynomials
     from substitution and equivalence queries. Electronic Colloquium on Computational Complexity,
     3(8), 1996. [ECCC:TR96-008]. 4, 4, 4.1, 7

 [7] * F. B ERGADANO AND S. VARRICCHIO: Learning behaviors of automata from multi-
     plicity and equivalence queries. SIAM Journal on Computing, 25(6):1268–1280, 1996.
     [SICOMP:10.1137/S009753979326091X]. 1.3

 [8] * D. B SHOUTY AND N. H. B SHOUTY: On interpolating arithmetic read-once formu-
     las with exponentiation. Journal of Computer and System Sciences, 56(1):112–124, 1998.
     [JCSS:10.1006/jcss.1997.1550]. 6

 [9] * N. H. B SHOUTY, T. R. H ANCOCK , AND L. H ELLERSTEIN:        Learning arith-
     metic read-once formulas.    SIAM Journal on Computing, 24(4):706–735, 1995.
     [SICOMP:10.1137/S009753979223664X]. 6

[10] * N. H. B SHOUTY, T. R. H ANCOCK , AND L. H ELLERSTEIN: Learning boolean read-once for-
     mulas over generalized bases. Journal of Computer and System Sciences, 50(3):521–542, 1995.
     [JCSS:10.1006/jcss.1995.1042]. 6

[11] * N. H. B SHOUTY AND Y. M ANSOUR: Simple learning algorithms for decision trees
     and multivariate polynomials. SIAM Journal on Computing, 31(6):1909–1925, 2002.
     [SICOMP:10.1137/S009753979732058X]. 1

[12] * N. H. B SHOUTY, C. TAMON , AND D. K. W ILSON: On learning width two branchinng
     programs.   Information Processing Letters, 65(4):217–222, 1998. [IPL:10.1016/S0020-
     0190(97)00204-4]. 5

[13] * J. W. C ARLYLE AND A. PAZ: Realizations by stochastic finite automata. Journal of Computer
     and System Sciences, 5(1):26–40, 1971. 2.3

[14] * M. F LIESS: Matrices de Hankel. Journal de Mathématiques Pures et Appliquées, 53:197–224,
     1974. 2.3

[15] * D. G RIGORIEV AND M. K ARPINSKI: An exponential lower bound for depth 3 arithmetic cir-
     cuits. In Proc. 30th STOC, pp. 577–582. ACM Press, 1998. [STOC:276698.276872]. 1.4

[16] * D. G RIGORIEV, M. K ARPINSKI , AND M. F. S INGER: Computational complex-
     ity of sparse rational interpolation. SIAM Journal on Computing, 23(1):1–11, 1994.
     [SICOMP:10.1137/S0097539791194069]. 1

[17] * D. G RIGORIEV AND A. A. R AZBOROV: Exponential complexity lower bounds for depth 3
     arithmetic circuits in algebras of functions over finite fields. In Proc. 39th FOCS, pp. 269–278.
     IEEE Computer Society Press, 1998. [FOCS:10.1109/SFCS.1998.743456]. 1.4

                       T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                        203
                                   A. R. K LIVANS AND A. S HPILKA

[18] * T. R. H ANCOCK AND L. H ELLERSTEIN: Learning read-once formulas over fields and extended
     bases. In Proc. of the 4th Ann. Conf. on Computational Learning Theory (COLT ’91), pp. 326–336.
     Morgan Kaufmann, 1991. [ACM:114836.114867]. 6

[19] * J. H ÅSTAD: Almost optimal lower bounds for small depth circuits. In Proc. 18th STOC, pp.
     6–20. ACM Press, 1986. [STOC:12130.12132]. 1.4

[20] * M. A. H UANG AND A. J. R AO: Interpolation of sparse multivariate polynomials
     over large finite fields with applications. Journal of Algorithms, 33(2):204–228, 1999.
     [JAlg:10.1006/jagm.1999.1045]. 1

[21] * J. C. JACKSON , A. R. K LIVANS , AND R. A. S ERVEDIO: Learnability beyond AC0 . In Proc.
     34th STOC, pp. 776–784. ACM Press, 2002. [STOC:509907.510018]. 1.4

[22] * A. K LIVANS AND A. S HPILKA: Learning arithmetic circuits via partial derivatives.
     In Proc. of the 16th Ann. Conf. on Learning Theory (COLT ’03), pp. 463–476, 2003.
     [COLT:48b02anqvmv32a6j]. 1.4

[23] * A. R. K LIVANS AND D. S PIELMAN: Randomness efficient identity testing of multivariate poly-
     nomials. In Proc. 33rd STOC, pp. 216–223. ACM Press, 2001. [STOC:380752.380801]. 1

[24] * N. L INIAL , Y. M ANSOUR , AND N. N ISAN: Constant depth circuits, Fourier transform and
     learnability. Journal of the ACM, 40(3):607–620, 1993. [JACM:174130.174138]. 1.4

[25] * N. N ISAN: Lower bounds for non-commutative computation. In Proc. 23rd STOC, pp. 410–418,
     1991. [STOC:103418.103462]. 1.3, 1.4, 5.1, 5, 5.5, 6, 6, 6.2

[26] * N. N ISAN AND A. W IGDERSON: Lower bounds on arithmetic circuits via partial derivatives.
     Computational Complexity, 6(3):217–234, 1997. [CC:v34728p847187762]. 1.3, 1.4, 2.3, 5

[27] * H. O HNISHI , H. S EKI , AND T. K ASAMI: A polynomial time learning algorithm for recognizable
     series. IEICE Transactions on Information and Systems, E77-D(10):1077–1085, 1994. 1.3, 2.2

[28] * R. R AZ: Multi-linear formulas for permanent and determinant are of super-polynomial size. In
     Proc. 36th STOC, pp. 633–641. ACM Press, 2004. [STOC:1007352.1007353]. 5

[29] * R. R AZ: Separation of multilinear circuit and formula size. Theory of Computing, 2(6):121–135,
     2006. Preliminary version appeared in FOCS’04, pp. 344–351. [ToC:v002/a006]. 5

[30] * R. R AZ AND A. S HPILKA: Deterministic polynomial identity testing in non-commutative mod-
     els. Computational Complexity, 14(1):1–19, 2005. [CC:p24h4777l51112j8]. 5

[31] * R. E. S CHAPIRE AND L. M. S ELLIE: Learning sparse multivariate polynomials over a field
     with queries and counterexamples. Journal of Computer and System Sciences, 52(2):201–213,
     1996. [JCSS:10.1006/jcss.1996.0017]. 1

[32] * J. T. S CHWARTZ: Fast probabilistic algorithms for verification of polynomial identities. Journal
     of the ACM, 27(4):701–717, 1980. [JACM:322217.322225]. 5.1, 5.9

                       T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                          204
                                  L EARNING A RITHMETIC C IRCUITS

[33] * A. S HPILKA AND A. W IGDERSON: Depth-3 arithmetic circuits over fields of characteristic zero.
     Computational Complexity, 10(1):1–27, 2001. [CC:p8hryxqwkmfr9cm0]. 1.3, 1.4, 4.2

[34] * M. S UDAN , L. T REVISAN , AND S. P. VADHAN: Pseudorandom generators with-
     out the XOR lemma. Journal of Computer and System Sciences, 62(2):236–266, 2001.
     [JCSS:10.1006/jcss.2000.1730]. 1

[35] * J. H. VAN L INT AND R. M. W ILSON: A Course in Combinatorics. Cambridge University Press,
     2001. 6

[36] * R. Z IPPEL: Probabilistic algorithms for sparse polynomials. In Proc. Intern. Symp. on Sym-
     bolic and Algebraic Manipulation, volume 72 of Lecture Notes in Computer Science, pp. 216–226.
     Springer, 1979. [LNCS:y1157233175643jq]. 5.1, 5.9


AUTHORS

      Adam R. Klivans
      Department of Computer Science
      The University of Texas at Austin
      Austin, TX 78712-1188
      klivans cs utexas edu
      http://www.cs.utexas.edu/~klivans/


      Amir Shpilka
      Department of Computer Science
      The Technion
      Haifa, 32000
      Israel
      shpilka cs technion ac il
      http://www.cs.technion.ac.il/~shpilka/


ABOUT THE AUTHORS

      A DAM R. K LIVANS received his B. S. and M. S. from Carnegie-Mellon University and his
         Ph. D. from MIT, where Dan Spielman was his advisor. He then held an NSF Mathemat-
         ical Sciences Postdoctoral Fellowship at Harvard under the guidance of Leslie Valiant.
         After spending six months at the Toyota Technological Institute at Chicago as a visiting
         professor, he became an assistant professor at the University of Texas at Austin in the
         Department of Computer Science. He is frequently confused with Adam Kalai.




                       T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                          205
                            A. R. K LIVANS AND A. S HPILKA

A MIR S HPILKA was born in 1972 in Israel and obtained his Ph. D. degree in Computer
   Science and Mathematics from the Hebrew University in Jerusalem in 2001, under the
   supervision of Avi Wigderson. As of 2005 he is a CS faculty member at the Technion.
   He is married to Carmit and has two children. His research interests lie in Complexity
   Theory, mainly in Arithmetic Circuit Complexity. When not working or enjoying his
   family he likes to read and play chess.




                T HEORY OF C OMPUTING, Volume 2 (2006), pp. 185–206                         206