Quantum Fan-out is Powerful

Authors Peter H{\o}yer, Robert {\v S}palek,
Plaintext
                                  T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103
                                              http://theoryofcomputing.org




                          Quantum Fan-out is Powerful
                                           Peter Høyer∗                             Robert Špalek†
                                      Received: October 26, 2004; published: August 3, 2005.




        Abstract: We demonstrate that the unbounded fan-out gate is very powerful. Constant-
        depth polynomial-size quantum circuits with bounded fan-in and unbounded fan-out over a
        fixed basis (denoted by QNC0f ) can approximate with polynomially small error the follow-
        ing gates: parity, mod[q], And, Or, majority, threshold[t], exact[t], and Counting. Classi-
        cally, we need logarithmic depth even if we can use unbounded fan-in gates. If we allow
        arbitrary one-qubit gates instead of a fixed basis, then these circuits can also be made exact
        in log-star depth. Sorting, arithmetic operations, phase estimation, and the quantum Fourier
        transform with arbitrary moduli can also be approximated in constant depth.

ACM Classification: F.2.1, F.2.2
AMS Classification: 68Q15, 81P68
Key words and phrases: quantum computing, quantum circuits, fan-out, quantum Fourier transform,
constant depth circuits, threshold circuits

1     Introduction
In this paper, we study the power of shallow quantum circuits. Long quantum computations encounter
various problems with decoherence, hence we want to speed them up as much as possible. We can
exploit the following two types of parallelism:

    1. Gates on different qubits can be applied at the same time.
    ∗ Supported by Canada’s NSERC and the Canadian Institute for Advanced Research (CIAR).
    † Work conducted in part while at Vrije Universiteit, Amsterdam. Partially supported by EU fifth framework project QAIP,

IST-1999-11234 and RESQ, IST-2001-37559.


Authors retain copyright to their work and grant Theory of Computing unlimited rights
to publish the work electronically and in hard copy. Use of the work is permitted as
long as the author(s) and the journal are properly acknowledged. For the detailed
copyright statement, see http://theoryofcomputing.org/copyright.html.

c 2005 Peter Høyer and Robert Špalek                                                           DOI: 10.4086/toc.2005.v001a005
                                     P ETER H ØYER , ROBERT Š PALEK

   2. Commuting gates can be applied to the same qubits at the same time.
    The first approach is just the classical parallel computation. The second approach only makes sense
when the gates applied on the same qubits commute, i.e. AB = BA, otherwise the outcome would be am-
biguous. Being able to do this is a strong assumption, however there are models of quantum computers,
in which it is physically feasible: ion-trap computers [4] and bulk-spin resonance (NMR) [9]. The basic
idea is that if two quantum gates commute, so do their Hamiltonians and therefore we can apply their
joint operation by performing both evolutions at the same time. This type of research started after the
Mølmer–Sørensen paper [15]. Recently, a Hamiltonian implementing the fan-out gate (which is crucial
for all our simulations) has been proposed by Fenner [8].
    In our paper, we investigate how much the power of quantum computation would increase if we
allow such commuting gates. The computation in the stronger model must be efficient, therefore we
do not require the ability to perform any set of commuting gates. This is in accordance with standard
quantum computation, where we also allow only some gates. We choose a representative, the so-called
unbounded fan-out gate, which is a sequence of controlled-not gates sharing one control qubit. We
call it fan-out, because if all target qubits are zero, then the gate copies the classical source bit into n
copies. We show that fan-out is in some sense universal for all sets of commuting gates. In particular,
the joint operation of any set of commuting gates (that can be easily diagonalised) can be simulated by
a constant-depth quantum circuit using just one-qubit and fan-out gates. To achieve this, we generalise
the parallelisation method of [17, 10] and adapt it to the constant-depth setting.
    We state our results in terms of circuit complexity classes. Classically, the main classes computed
by constant-depth, polynomial-size circuits are:
      NC0 with Not and bounded fan-in gates: And, Or,
      AC0 with Not and unbounded fan-in gates: And, Or,
      TC0 with Not and unbounded fan-in gates: And, Or, threshold[t] for all t,
      AC0 [q] with Not and unbounded fan-in gates: And, Or, mod[q],
      ACC0 = q AC0 [q].
               S


The zero in the exponent means constant depth, in general NCk means (logk n)-depth circuits. Several
separations between these classes are known. Razborov [18] proved that TC0 is strictly more powerful
than ACC0 . Using algebraic methods, Smolensky [21] proved that AC0 [q] 6= AC0 [q0 ], where q, q0 are
powers of distinct primes. In other words, threshold gates cannot be simulated by constant-depth circuits
with unbounded fan-in Or gates, and mod[q] gates do not simulate each other.
    The main quantum circuit classes corresponding to the classical classes are QNC0 , QAC0 , QTC0 ,
and QACC0 . We use subscript ‘f’ to indicate circuits where we allow the fan-out gate (e.g. QNC0f ).
Classically, fan-out (copying the result of one gate into inputs of other gates) is taken for granted. Sur-
prisingly, in contrast to the classical case, some of the quantum circuit classes are the same. Moore [16]
proved that parity is equivalent to fan-out, i.e. QAC0f = QAC0 [2]. Green et al. [10] proved that allowing
mod[q] gates with different moduli always leads to the same quantum classes, i.e. QACC0 = QAC0 [q]
for every integer q ≥ 2.
    In this paper, we extend these results and show that even exact[t] gates (which output 1 if the input
is of Hamming weight t, and 0 otherwise) can be approximated with polynomially small error by fan-
out and single qubit gates in constant depth. Our simulations have polynomially small error. Since

                         T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                              82
                                     Q UANTUM FAN - OUT IS P OWERFUL

exact[t] gates can simulate And, Or, threshold[t], and mod[q] gates, we conclude that the bounded-error
versions of the following classes are equal: B-QNC0f = B-QAC0f = B-QTC0f . The exact[t] gate can be
approximated in constant depth thanks to the parallelisation method. However, the simulation is not so
straightforward as for mod[q] in [10] and it works only with high probability.
    We then introduce a so-called Or-reduction that converts n input bits x into log n output bits y and
preserves the Or function, i.e. x is nonzero if and only y is. We show how to implement it exactly in
constant depth and use it to achieve exact computation of Or and exact[t] in log-star depth. (Circuits of
log-star depth are defined in Section 5.) We also apply the Or-reduction to decrease the size of most of
our circuits.
    Our results concerning the threshold[t] gate have several interesting implications. Siu et al. [20]
proved that sorting and integer arithmetic (addition and multiplication of n integers, and division with
remainder) are computable by constant-depth threshold circuits. It follows that all of them can be ap-
proximated in B-QNC0f .
    The last contribution of our paper concerns the quantum Fourier Transform (QFT). Cleve and Wa-
trous [5] published an elegant log-depth quantum circuit that approximates the QFT. By optimising their
methods to use the fan-out gate, we can approximate the QFT in constant depth with polynomially small
error. First, we develop a circuit for the QFT with respect to a power-of-2 modulus, and then, using
a technique of [11], we show that the QFT with respect to arbitrary moduli can be approximated too.
Hence the QFT is in B-QNC0f . The QFT has many applications, one of which is the phase estimation of
an unknown quantum state.
    Shor’s original algorithm for factoring [19] uses the QFT and the modular exponentiation. Cleve
and Watrous [5] have shown that it can be adapted to use modular multiplication of n integers. Since
we prove that both the QFT and arithmetic operations are in B-QNC0f , polynomial-time bounded-error
algorithms with oracle B-QNC0f can factorise numbers and compute discrete logarithms. We can make
the following conclusions: First, if B-QNC0f can be simulated by a BPP machine, then factoring can
be done in polynomial time by bounded-error Turing machines. Second, since it unlikely that BQP =
B-QNC0f , factoring and discrete logarithms are likely not the hardest things quantum computers can do.


2    Quantum circuits with unbounded fan-out
Quantum circuits resemble classical reversible circuits. A quantum circuit is a sequence of quantum
gates ordered into layers. The gates are consecutively applied in accordance with the order of the layers.
Gates in one layer can be applied in parallel. The size of a gate is the number of affected qubits. The
depth of a circuit is the number of layers and the size is the total size of all its gates. A circuit can solve
problems of a fixed input size, so we define families of circuits containing one circuit for every input
size. We consider only uniform families, whose description can be generated by a log-space Turing
machine.
     A quantum gate is a unitary operator applied to some subset of qubits. We usually use gates from
a fixed universal basis (Hadamard gate, rotation by an irrational multiple of π, and the controlled-not
gate) that can approximate any quantum gate with good precision [1]. The qubits are divided into 2
groups: Input/output qubits contain the description of the input at the beginning and they are measured
in the computational basis at the end. Ancilla qubits are initialised to |0i at the beginning and the circuits

                          T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                83
                                      P ETER H ØYER , ROBERT Š PALEK

                                                    H                       H       H           H
                                                    H                       H       H           H
                         =                   =                                  =
                                                    H                       H       H           H
                    2                               H                       H       H           H


                                Figure 1: Equivalence of parity and fan-out


usually clean them at the end, so that the output qubits are in a pure state and the ancillas may be reused.
   Since unitary evolution is reversible, every operation can be undone. Running the computation
backward is called uncomputation and is often used for cleaning ancilla qubits.


2.1     Definition of quantum gates
Quantum circuits cannot use a naive quantum fan-out gate mapping every quantum superposition

                                          |φ i|0i . . . |0i → |φ i . . . |φ i

due to the no-cloning theorem [23]. Such a gate is not linear, let alone unitary. Instead, our fan-out
gate copies only classical bits and the effect on superpositions is determined by linearity. It acts as a
controlled-not-. . . -not gate, i.e. it is an unbounded sequence of controlled-not gates sharing one control
qubit. Parity is a natural counterpart of fan-out. It is an unbounded sequence of controlled-not gates
sharing one target qubit.

Definition 2.1. The fan-out gate maps |y1 i . . . |yn i|xi → |y1 ⊕ xi . . . |yn ⊕ xi|xi, where x ⊕ y = (x + y)
mod 2. The parity gate maps |x1 i . . . |xn i|yi → |x1 i . . . |xn i|y ⊕ (x1 ⊕ · · · ⊕ xn )i.

Example 2.2. As used in [16], parity  and fan-out can simulate each other in constant depth. The
                               1    1
Hadamard gate is H = √12                 and it holds that H 2 = I. If a controlled-not gate is preceded
                               1 −1
and succeeded by Hadamard gates on both qubits, it just turns around. Since parity is a sequence of
controlled-not gates, we can turn around all of them in parallel. The circuit is shown in Figure 1.

      In this paper, we investigate the circuit complexity of, among others, these gates:

Definition 2.3. Let x = x1 . . . xn and let |x| denote the Hamming weight of x. The following (n + 1)-qubit
gates map |xi|yi → |xi|y ⊕ g(x)i, where g(x) = 1 iff

                  |x| > 0: Or,               |x| = n: And (Toffoli),            |x| ≥ 2n : majority,
            |x| mod q = 0: mod[q],            |x| ≥ t: threshold[t],             |x| = t: exact[t],

A counting gate is any gate that maps |xi|0m i → |xi| |x| i for m = dlog(n + 1)e.

                           T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                              84
                                    Q UANTUM FAN - OUT IS P OWERFUL


                                                                P
                                              =
                                         U           A    B     C


                     Figure 2: Implementing an arbitrary controlled one qubit gate

2.2   Quantum circuit classes
Definition 2.4. QNCf (d(n)) contains operators computed exactly (i.e. without error) by uniform fami-
lies of quantum circuits with fan-out of depth O(d(n)), polynomial size, and over a fixed basis. QNCkf =
QNCf (logk n). R-QNCkf contains operators approximated with one-sided, and B-QNCkf with two-sided,
polynomially small error.
Remark 2.5. The circuits below are over a fixed universal basis, unless explicitly mentioned otherwise.
Some of our circuits need arbitrary one-qubit gates to be exact. For simplicity, we sometimes include
several fixed-size gates (e.g. the binary Or gate and controlled one-qubit gates) in our set of basis gates.
This inclusion does not influence the asymptotic depth of our circuits, since every s-qubit quantum    gate
                                                                                          3  s
                                                                                               
can be decomposed into a sequence of one-qubit and controlled-not gates of length O s 4 [2].
    For every one-qubit gate U, there exist one-qubit gates A, B,C and a rotation P = Rz (α) such that
the controlled gate U is computed by the constant-depth circuit shown in Figure 2 [2, Lemma 5.1]. If a
qubit controls more one-qubit gates, then we can still use this method in constant depth. We just replace
the controlled-not gate by the fan-out gate and the rotations P are multiplied.


3     Parallelisation method
In this section, we describe a general parallelisation method for achieving very shallow circuits. We then
apply it to the rotation by Hamming weight and the rotation by value, and show how to compute them
in constant depth.

3.1   General method
The unbounded fan-out gate is universal for commuting gates in the following sense: Using fan-out,
gates can be applied to the same qubits at the same time whenever (1) they commute, (2) we know the
basis in which they all are diagonal, and (3) we can efficiently change into the basis. The method reduces
the depth, but may in general require the use of ancilla qubits.
Lemma 3.1. [13, Theorem 1.3.19] For every set of pairwise commuting unitary gates, there exists an
orthogonal basis in which all the gates are diagonal.
Theorem 3.2. [17, 10] Let {Ui }ni=1 be pairwise commuting gates on k qubits. Gate Ui is controlled
by qubit |xi i. Let T be a gate changing the basis according to Lemma 3.1. There exists a quantum
circuit with fan-out computing U = ∏ni=1 Uixi having depth maxni=1 depth(Ui ) + 4 · depth(T ) + 2, size
∑ni=1 size(Ui ) + (2n + 2) · size(T ) + 2n, and using (n − 1)k ancillas.

                         T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                              85
                                         P ETER H ØYER , ROBERT Š PALEK



             n      ...



             k        T      T†    U1     T     T†    U2         T   ...    T†    Un   T    T†

                                   V1                 V2                          Vn


                          Figure 3: A serial circuit with interpolated basis changes

                                   n          ...

                                   k      T                V1              T†
                                   |0i                      V2                   |0i
                               n
                                                     ...
                                   |0i                          Vn               |0i


                 Figure 4: A parallelised circuit performing U = T † (∏ni=1 Vixi )T = ∏ni=1 Uixi


Proof. Consider a circuit that applies all Ui sequentially. Put T T † = I between Ui and Ui+1 . The circuit
is shown in Figure 3. Take Vi = T †Ui T as new gates. They are diagonal in the computational basis,
hence they just impose some phase shifts. Multiple phase shifts on entangled states multiply, so can be
applied in parallel. We use fan-out gates twice: first to create n entangled copies of target qubits and
then to destroy the entanglement. The final circuit with the desired parameters is shown in Figure 4.

Example 3.3. As used in [16], it is simple to prove that mod[q] ∈ QNC0f . Each input qubit controls
one increment modulo q on a counter initialised to 0. At the end, we obtain |x| mod q. The modular
increments commute and thus can be parallelised. Since q is fixed, changing the basis and the increment
can both be done in constant depth.

3.2   Rotation by Hamming weight and value
In this paper, we often use a rotation by Hamming weight Rz (ϕ|x|) and a rotation by value Rz (ϕx),
where Rz (α) is one-qubit rotation around the z-axis by angle α: Rz (α) = |0ih0| + eiα |1ih1|. They can
both be computed in constant depth.
Lemma 3.4. For every angle ϕ, there exist constant-depth, linear-size quantum circuits with fan-out
computing Rz (ϕ|x|) and Rz (ϕx) on input x = xn−1 . . . x1 x0 .
Proof. The left circuit in Figure 5 shows how to compute the rotation by Hamming weight. Each input
qubit controls Rz (ϕ) on the target qubit, hence the total angle is ϕ|x|. These controlled rotations are

                           T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                           86
                                             Q UANTUM FAN - OUT IS P OWERFUL


      |x0 i                                        |x0 i                |x0 i                       |x0 i
      |x1 i                                        |x1 i                |x1 i                       |x1 i
                  ...                                                               ...
      |xn−1 i                                      |xn−1 i              |xn−1 i                     |xn−1 i
      |0i+|1i
        √                                          |0i+eiϕ|x| |1i       |0i+|1i                     |0i+eiϕx |1i
                                                       √
          2                    Rz (ϕ)                    2
                                                                          √
                                                                            2
                                                                                          Rz (ϕ)        √
                                                                                                          2


         |0i                   Rz (ϕ)                                     |0i             Rz (2ϕ)

              ancillas          ...                                      ancillas           ...

         |0i                   Rz (ϕ)                                     |0i         Rz (2n−1 ϕ)



                                      Figure 5: Rotation by Hamming weight and value


parallelised using the parallelisation method. The right circuit shows the rotation
                                                                                by value. It is similar
to the rotation by Hamming weight, only the input qubit |x j i controls Rz ϕ2 j , hence the total angle is
ϕ ∑n−1    j
     j=0 2 x j = ϕx.

Remark 3.5. The construction uses rotations Rz (ϕ) for arbitrary ϕ ∈ R. However, we are only allowed
to use a fixed set of one-qubit gates. It is easy to see that every rotation can be approximated with poly-
nomially small error by Rz (θ q) = (Rz (θ ))q , where sin θ = 53 and q is a polynomially large integer [1].
These q rotations commute, so can be applied in parallel and the depth is preserved. The approximation
can be kept down to polynomially small error while increasing the size of the circuit only polynomially.


4      Constant-depth approximate circuits
4.1     Or gate
It is easy to see that the rotation by Hamming weight of a string y of length m with angle ϕ = 2π   m can be
                                              m                                m
used to distinguish the zero string y = 0 from strings with approximately 2 ones. We, however, want
to distinguish the zero string from all nonzero strings. It turns out that if we compute m = O(n log n)
rotations by Hamming weight of the input x with angles distributed evenly around the circle, we obtain
a string y that is either zero (for x = 0n ), or has expected Hamming weight m2 (for x 6= 0n ). By combining
these two results, we can approximate the Or gate and, with a minor modification, also the exact[t] gate
in constant depth.
     Let w ∈ N0 and let ϕ be an angle. Define a notation for the following one-qubit state:

                                                                    1 + eiϕw       1 − eiϕw
                               |µ ϕw i = (H · Rz (ϕw) · H) |0i =             |0i +          |1i.               (4.1)
                                                                       2              2
                         |x|
By Lemma 3.4, |µ ϕ i can be computed in constant depth and linear size.

                                 T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                87
                                             P ETER H ØYER , ROBERT Š PALEK

                           0                                                             1
Theorem 4.1. Or ∈ R-QNC  f . In particular, Or can be approximated with one-sided error n in constant
                  2
depth and size O n log n .

Proof. Let n denote the size of the input x. Let m = a · n, where a will be chosen later. For all
                                                          |x|
k ∈ {0, 1, . . . , m − 1}, compute in parallel |yk i = |µ ϕk i for angle ϕk = 2π
                                                                              m k. If |yk i is measured in the
computational basis, the expected value of the outcome Yk ∈ {0, 1} is
                                         2
                          1 − eiϕk |x|                       eiϕk |x| + e−iϕk |x| − 2   1 − cos(ϕk |x|)
               E[Yk ] =                      = e−iϕk |x| ·                            =                 .
                              2                                          4                    2

If all these m qubits |yi are measured, the expected Hamming weight of all Y ’s is
                              "     #
                                m−1
                                          m 1 m−1
                                                                
                                                         2πk          0 if |x| = 0,
                 E[|Y |] = E ∑ Yk = − ∑ cos                  |x| =    m
                                k=0       2 2 k=0         m           2    if |x| 6= 0.

The qubits |yi are actually not measured, but their Hamming weight |y| controls another rotation on a
                                            |y|
new ancilla qubit |zi. So compute |zi = |µ 2π/m i. Let Z be the outcome after |zi is measured. If |y| = 0,
then Z = 0 with certainty. If |y| − m2 ≤ √mn , then

                                                  2                               2π
                                             2π
                                                        1 + cos 2π        1 − cos √
                                                                      
                            1 + ei m |y|                           |y|
                                                                                        
                                                                 m                 n    1
                 P[Z = 0] =                           =                 ≤            =O    .
                                2                              2               2        n

Assume that |x| =6 0. We want to upper-bound the probability of the bad event thatm|Y | is not close to m2 .
                                                                                                     1
Since 0 ≤ Yk ≤ 1, we can use Hoeffding’s Lemma 4.2 below and obtain P |Y | − 2 ≥ εm ≤ ε 2 m . Fix
                                                                                                   2
a = log n and ε = √1n . Now, P |y| − m2 ≥ √mn ≤ 2m/n   1
                                                          = 21a = n1 . The probability that we observe the
                                                

incorrect result Z = 0 is at most the sum of the probabilities of the two bad events, i.e. O n1 . Hence
                                                                                                

                                                
                                                   1  if |x| = 0,
                                    P[Z = 0] =
                                                   O 1n     if |x| 6= 0.

The circuit has constant depth and size O(mn) = O n2 log n . It is outlined in Figure 6. The figure is
                                                                

slightly simplified: unimportant qubits and uncomputation of ancillas are omitted.

Lemma 4.2 (Hoeffding [12]). If Y1 , . . . ,Ym are independent random variables bounded by ak ≤ Yk ≤ bk ,
then, for all ε > 0,

                                                             −2m2 ε 2
                   P [|S − E[S]| ≥ εm] ≤ 2 exp                              ,   where S = ∑m
                                                                                           i=k Yk .
                                                         ∑m
                                                          k=1 (bk − ak )
                                                                        2


Remark 4.3. Since the outcome z of the circuit in Figure 6 is a classical bit, we can save it in an ancilla
qubit by applying a controlled-not gate and clean |yi by uncomputation. It remains to prove that the
intermediate qubits |yi need not be measured, in order to be able to uncompute them. We show above
that the output qubit is a good approximation of the logical Or, provided |yi is immediately measured.

                          T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                88
                                                 Q UANTUM FAN - OUT IS P OWERFUL

                                                                           |yi                   |zi
                                                                 |x|                    |y|
                                |x1 . . . xn i             | µ 2π 0 i                | µ 2π i
                                                                 m                       m


                                                                 |x|
                                |00 . . . 0i               | µ 2π 1 i
                                                                 m


                                                            ...

                                                             |x|
                                |00 . . . 0i              | µ 2π (m−1) i
                                                                 m




                                  Figure 6: Constant depth circuit approximating Or


By the principle of deferred measurement, we can use controlled quantum operations and measure |yi
at the end. However, the output bit is close to a classical bit (the distance depends on the error of the
computation), thus it is only slightly entangled with |yi, and hence it does not matter whether |yi is
measured.

Definition 4.4. Let log(k) x denote the k-times iterated logarithm log log . . . log x. The log-star function,
log∗ x, is the maximum number of iterations k such that log(k) x exists and is real.1

Remark 4.5. If we require error n1c , we create c copies and compute the exact Or of them by a binary
tree of Or gates. The tree has depth log c = O(1). In Section 6.1, we show how to approximate Or in
constant depth and size O(n log(k) n) for any constant k. In Section 6.2, we show how to compute Or
exactly in log-star depth and linear size.

4.2     Exact[t] and threshold[t] gates
Theorem 4.6. exact[t] ∈ R-QNC0f .

Proof. We slightly modify the circuit for Or. As outlined in Figure 7, by adding the rotation Rz (−ϕt) to
                                                                   |x|−t           |x|
the rotation by Hamming weight in the first layer, we obtain |µ ϕ i instead of |µ ϕ i. The second layer
stays the same. If the output qubit |zi is measured, then
                                                
                                                   1  if |x| = t,
                                    P[Z = 0] =
                                                   O n1    if |x| 6= t.

We obtain an approximation of the exact[t] gate with one-sided polynomially small error.

Remark 4.7. Other gates are computed from the exact[t] gate by standard methods. For example,

                                             3
                                                     exact[t +1], . . . , exact[n]. The
threshold[t] can be computed as the parity of exact[t],
                                                                                       0
                                                                                         depth stays constant
and the size is just n-times bigger, i.e. O n log n , hence threshold[t] ∈ B-QNCf . In Section 6.3, we
show how to approximate exact[t], threshold[t], and counting in constant depth and size O(n log n).
   1 The log-star of the estimated number of atoms in the universe is 5.         Consequently, for the computational problems we
consider in this paper, the log-star is in practice at most 5.


                                T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                           89
                                             P ETER H ØYER , ROBERT Š PALEK

                                   |x0 i                                   |x0 i
                                   |x1 i                                   |x1 i
                                                ...
                                   |xn−1 i                                 |xn−1 i
                                                                               |x|−t
                                       |0i       H      Rz (ϕ)      H      |µϕ         i


                                       |0i              Rz (ϕ)

                                                ...

                                       |0i              Rz (ϕ)
                                                                     added
                                                                     rotation
                                       |0i            Rz (–ϕt)



                          Figure 7: Rotation by Hamming weight with added rotation


4.3     Arithmetic operations
Using threshold gates, one can do arithmetic operations in constant depth. The following circuits take as
part of the input an ancilla register in state |0i and output the result of the computation in that register.

Theorem 4.8. The following functions are in B-QNC0f : addition and multiplication of n integers, divi-
sion of integers with remainder, and sorting of n integers.

Proof. By [20], these functions are computed by constant-depth,2 polynomial-size threshold circuits. A
threshold circuit is built of weighted threshold gates. It is simple to prove that the weighted threshold
gate (with polynomially large integer weights) also is in B-QNC0f . One only needs to rotate the phase of
the quantum state in Lemma 3.4 by integer multiples of the basic angle.

      In the following section, we require a reversible version of modular addition.

Definition 4.9. Let q be an n-bit integer and x1 , . . . , xm ∈ Zq . The reversible addition gate maps addm :
|qi|x1 i . . . |xm i → |qi|x1 i . . . |xm−1 i|yi, where y = (∑m
                                                              i=1 xi ) mod q.

Lemma 4.10. addm ∈ B-QNC0f .

Proof. By Theorem 4.8, y = (∑m     i=1 xi ) mod q can be approximated in constant depth and polynomial
size. The result is, however, stored into ancilla qubits. Hence we have to erase xm , which we may achieve
by first negating the contents in y by |yi → | − yi, computing the sum w = y + ∑m−1 i=1 xi in a fresh ancilla,
do a bitwise control-not of w into xm , uncompute w, and finally re-negate y. We then swap the ancillas
|yi with the erased qubits in |xm i.
  2 The depths are really small, from 2 to 5.



                             T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                            90
                                         Q UANTUM FAN - OUT IS P OWERFUL

4.4     Quantum Fourier transform
The QFT is a very powerful tool used in several quantum algorithms, e.g. factoring of integers and
computing the discrete logarithm [19].
Definition 4.11. The quantum Fourier transform with respect to modulus q performs the Fourier trans-
form on the quantum amplitudes of the state, i.e. it maps

                                                  1 q−1 xy
                              Fq : |xi → |ψx i = √ ∑ ω |yi, where ω = e2πi/q ,                                   (4.2)
                                                   q y=0

for x ∈ {0, 1, . . . , q − 1} and it behaves arbitrarily on the other states.

4.4.1    QFT with a power-of-2 modulus
Let q = 2n . Coppersmith has shown in [6] how to compute the QFT in quadratic depth, quadratic size,
and without ancillas. The depth has further been improved to linear [folklore]. Cleve and Watrous
have shown in [5] that the QFT can be approximated with error ε in depth O log n + log log ε1 and
                                                                                                     

size O n log εn . They also show that if only gates acting on a constant number of qubits are allowed (in
particular, the fan-out gate is not allowed), logarithmic depth is necessary. We show that the approximate
circuit for the QFT from [5] can be compressed to constant depth, if we allow the fan-out gate.
Theorem 4.12. QFT ∈ B-QNC0f .
Proof. The operator F2n : |xi → |ψx i can be computed by composing:

          1.    Fourier state construction (QFS):             |xi|0i . . . |0i        → |xi|ψx i|0i . . . |0i
          2.    Copying Fourier state (COPY):                 |xi|ψx i|0i . . . |0i   → |xi|ψx i . . . |ψx i
          3.    Uncomputing phase estimation (QFP):           |ψx i . . . |ψx i|xi    → |ψx i . . . |ψx i|0i
          4.    Uncomputing COPY:                             |ψx i . . . |ψx i|0i    → |ψx i|0i . . . |0i

The following lemmas show that each of these individual operators is in B-QNC0f .

Lemma 4.13. QFS ∈ QNC0f .
                                                                          2πir
Proof. QFS maps |xi|0i → |xi|ψx i. Define a shortcut |ρr i = |0i+e√2 |1i . It is simple to prove that |ψx i =
|ρx/21 i|ρx/22 i . . . |ρx/2n i.
                                     n                n   n
                                1 2 −1         1 2 −1 O
                               √ ∑ ω xy |yi = √ ∑
                                                             n−k
                 |ψx i =                                 ω x2 yn−k |yn−k i
                                 n
                                2 y=0           n
                                               2 y=0 k=1
                                    n 1                       n                  k   n
                                1 O                              |0i + e2πix/2 |1i O
                                       ∑      2n−k x b
                                                             O
                         =     √           (ω       )  |bi =            √         =     |ρx/2k i.
                                2n k=1 b=0                   k=1          2         k=1

                                                                                            2π
                                                                                                   |0i+|1i
The n qubits |ρx/2k i can be computed from x in parallel as follows: |ρx/2k i = Rz          2 k x     √
                                                                                                        2
                                                                                                            is computed
by the rotation by value (Lemma 3.4) in constant depth and linear size.


                             T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                     91
                                        P ETER H ØYER , ROBERT Š PALEK

                                                        |ρ0.1 i
                                                   5              3
                                                   4π             4π
                                                                   |ρx/2k i
                                        |ρ0.11 i                    |ρ0.01 i

                                                   7              1
                                                   4π             4π
                                                        |ρ0.0 i


                              Figure 8: Measurement of |ρx/2k i in a random basis


Lemma 4.14. COPY ∈ B-QNC0f .

Proof. COPY maps |ψx i|0i . . . |0i → |ψx i . . . |ψx i. Take the reversible addition gate modulo 2n mapping
(add22n)|yi|xi = |yi|(x + y) mod 2n i. It is simple to prove that add−1 |ψy i|ψx i = |ψx+y i|ψx i.
                                    n
                                1 2 −1 ly+kx               1
              |ψy i|ψx i =       n  ∑
                                2 l,k=0
                                        ω    |li|ki →add−1 n ∑ ω ly+kx |li|k − li
                                                          2 l,k
                                1                           1
                                2n ∑
                          =            ω ly+(m+l)x |li|mi = n ∑ ω l(x+y)+mx |li|mi = |ψx+y i|ψx i.
                                   l,m                     2  l,m


Hence add−1 |ψ0 i|ψx i = |ψx i|ψx i. The state |ψ0 i = H ⊗n |0n i is easy to prepare in constant depth. Fur-
thermore, (addm    −1
                2n) |ψ0 i . . . |ψ0 i|ψx i = |ψx i . . . |ψx i|ψx i, because the addition of m − 1 numbers into
one register is equivalent to m − 1 consecutive additions of one number. Each such a reversible addi-
tion copies |ψx i into 1 register. Note that the addm        2n gate performs all these additions in parallel. By
Lemma 4.10, the reversible addition gate is in B-QNC0f .

Lemma 4.15. QFP ∈ B-QNC0f .

Proof. QFP maps |ψx i . . . |ψx i|0i → |ψx i . . . |ψx i|xi. By Cleve and Watrous [5, Section 3.3], we can
compute x with probability at least 1 − ε from O log εn copies of |ψx i in depth O log n + log log ε1 and
size O n log εn . Use ε = poly(n)    1
                                          . It is simple to convert their circuit into constant depth, provided we
have fan-out. The details are sketched below.
     The input consists of m = O log εn copies of |ψx i = |ρx/21 i|ρx/22 i . . . |ρx/2n i. Measure each |ρx/2k i
                                                 
m                                                     m
 2 times in the basis {|ρ0.01 i, |ρ0.11 i} and 2 times in the Hadamard basis {|ρ0.00 i, |ρ0.10 i}. The state
             1         2πi(0.x     ...x x  )
|ρx/2k i = √2 (|0i + e         k−1     1 0 ) lies on the middle circle of the Bloch sphere; it is shown in Figure 8.


If |ρx/2k i is in the white region, then the measurement in the first basis tells whether xk−1 = 0 or 1 with
probability at least 34 . If |ρx/2k i is in the shaded region, then the measurement in the Hadamard basis
tells whether xk−1 = xk−2 or ¬xk−2 (denoted by P, N) with probability at least 43 .
     For each k, perform the majority vote and obtain the correct answer zk ∈ {0, 1, P, N} with error
probability at most 21m = εn . The probability of having any error is at most n times bigger, i.e. at most ε.
Compute xn−1 . . . x1 x0 from zn−1 . . . z1 z0 in constant depth. The bit xk is computed as follows:

                           T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                    92
                                     Q UANTUM FAN - OUT IS P OWERFUL

   1. If zk zk−1 . . . zl+1 ∈ {P, N} and zl ∈ {0, 1}, compute the parity of the number of N’s and add it to zl
      (assuming z−1 = 0), otherwise return 0.
   2. Check and compute all prefixes l in parallel and take the logical Or of the results.
All the gates used (fan-out, parity, And, Or, majority) are in B-QNC0f .

4.4.2   QFT with an arbitrary modulus
Let q 6= 2n . Cleve and Watrous have shown in [5] that the QFT can be approximated with error ε in depth
O (log log q)(log log ε ) and size poly(log q + log ε1 ). We show that their circuit can also be compressed
                       1
                         

into constant depth, if we use the fan-out gate. The relation between quantum Fourier transforms with
different moduli was described in [11].
Remark 4.16. We actually implement a slightly more general operation, when q is not a fixed constant,
but an n-bit input number. This generalised QFT maps |qi|xi → |qi|ψx i. The register |qi is implicitly
included in all operations. We will henceforth omit it and the generalised operations are denoted simply
by QFTq , QFSq , COPYm  q , and QFPq .

Theorem 4.17. QFTq ∈ B-QNC0f .
Proof. Let |dummyq,x i denote an unspecified quantum state depending on two parameters q, x. The
operator Fq0 : |xi → |ψx i|dummyq,0 i can be computed by composing:

            1.   QFSq :                        |xi   → |xi|ψx i|dummyq,x i
                                                                                           ⊗m
            2.   COPYm+1q :                          → |xi|ψx i|dummyq,x i |ψx i|dummyq,0 i
                                                                           ⊗m
            3.   Uncomputing QFSq :                  → |xi |ψx i|dummyq,0 i
                                                                         ⊗m
            4.   Uncomputing QFPq :                  → |ψx i|dummyq,0 i
            5.   Uncomputing COPYm  q:               → |ψx i|dummyq,0 i,

where empty registers are omitted for clarity. The state |dummyq,0 i is not entangled with |xi and hence
it can be traced out. We obtain the quantum Fourier transform Fq . The following lemmas show that each
of these individual operators is in B-QNC0f .

Lemma 4.18. QFSq ∈ B-QNC0f .
Proof. QFSq maps |xi|0i → |xi|ψx i|dummyq,x i for some “garbage” state |dummyq,x i. We will show that
QFSq is well approximated by a QFS with a power-of-2 modulus of the magnitude q3 . Let n = dlog qe.
Take N = 3n and extend x by leading zeroes into N bits. Using Lemma 4.13, perform QFS2N and obtain
                      N      2πi
the state |xi √1 N ∑2y=0−1 e 2N xy |yi.
               2
    Set u = b2N /qc and apply integer division by u to the second register, i.e. map |yi → |y1 i|y2 i, where
y1 = by/uc ∈ {0, 1, . . . , q} and y2 = y mod u. This can be done reversibly in constant depth by a few
applications of Theorem 4.8 using the method from Lemma 4.10. The quantum state can be written as
                          N
                    1 2 −1 2πiN xy                 1 q−1 u−1 2πiN x(y1 u+y2 )
                   √   ∑
                    2N y=0
                           e 2     |y 1 i|y 2 i = √    ∑ ∑ e2
                                                   2N y1 =0 y2 =0
                                                                              |y1 i|y2 i + |wi ,


                          T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                               93
                                         P ETER H ØYER , ROBERT Š PALEK

                            2πi
                            x(qu+z)
where |wi = √1 N ∑v−1
                   z=0 e
                         2N         |qi|zi and v = 2N mod u = 2N − qu = 2N mod q < q. The sum has
               2
been rearranged using y = y1 u + y2 . Now, k |wi k= 2vN = O(2−n ) is exponentially small and so it can
                                                      p

be neglected. Decompose the quantum state into the tensor product

                                   1 q−1 2πi ( q u)xy             q u−1 2πiN xy2
                                                             r
                                  √ ∑ e q 2N 1 |y1 i ⊗
                                                                 2N y∑
                                                                          e2     |y2 i .
                                    q y1 =0                          2 =0


                                   N                   N
Now, u is exponentially close to 2q , because 2qN u = 2 2−v = 1 − O 2−2n . Since xyq1 = O(2n ), the replace-
                                                                        
                                                         N

ment of 2quN by 1 in the exponent causes only exponentially small error O(2−n ). Hence the quantum state
is exponentially close to

                             1 q−1 2πi xy  1 u−1 2πi xz
                            √ ∑ e q |yi ⊗ √ ∑ e 2N |zi = |ψx i|dummyq,x i .
                              q y=0         u z=0

The “garbage” state |dummyq,x i arises as a byproduct of the higher precision 3n-bit arithmetic. We
clean it up later by uncomputing QFSq after copying |ψx i; see the proof of Theorem 4.17. It actually
gets replaced by |dummyq,0 i = √1u ∑u−1
                                    z=0 |zi, which does not depend on x and it thus causes no harm. We
have approximated QFSq in constant depth.
                         0
Lemma 4.19. COPYm
                q ∈ B-QNCf .

Proof. COPYm                                                          ⊗(m−1) . The proof is similar to the proof of
                   q maps |ψx i|0i . . . |0i → |ψx i(|ψx i|dummyq,0 i)
Lemma 4.14. First, prepare m − 1 states |ψ0 i|dummyq,0 i by applying QFSq to |0i|0i (Lemma 4.18).
Second, use the inverse of the reversible addition modulo q to map (addm                −1 : |ψ i . . . |ψ i|ψ i →
                                                                                      q)        0         0   x
|ψx i . . . |ψx i|ψx i (Lemma 4.10).

Lemma 4.20. QFPq ∈ B-QNC0f .

Proof. QFPq maps |ψx i . . . |ψx i|0i → |ψx i . . . |ψx i|xi. We use an idea similar to the proof of Lemma 4.18.
Let n = dlog qe and N = 3n. Extend |ψx i by leading zeroes to N bits and apply F2†N to them (Theo-
rem 4.12). We obtain many copies of the state
                                                                               !
                                                         2N −1 q−1 −2πi
                                                   1
                                                          ∑ ∑ e 2N zy+ q xy |zi .
                                                                           2πi
                           F2†N (|0i|ψx i) = p
                                                   2N q z=0 y=0
                                                                                           N
The exponent can be rewritten to 2πi( qx − 2zN ) · y. Intuitively, if |z − 2N qx | ≤ 28q , then | qx − 2zN | ≤ 8q
                                                                                                                1
                                                                                                                  , the
                                                          π
absolute value of the angle in the exponent is at most 4 for every y ∈ {0, 1, . . . , q − 1}, and the amplitudes
sum up constructively. If z is not close to 2N qx , then the amplitudes interfere destructively. The quantum
state has most of its amplitude on the good z’s. So we compute reversibly by division with remainder an
estimate x0 = b 2zqN + 21 c. A detailed analyzis shows that P[x0 = x] ≥ 12 + δ for some constant δ [5, 11].
Here we do not present the details, because our goal is the compression of the circuit from [5] into
constant depth.

                            T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                      94
                                        Q UANTUM FAN - OUT IS P OWERFUL

    We transform all m = O log εn input quantum states |ψx i into m independent estimates |x0 i. We
                                    

then estimate all bits of x one-by-one from these m estimates by majority gates. Each bit of x is wrong
                                      n
with probability at most 2−m = 2− log ε = εn . The probability of having an error among the n bits of x is
thus at most ε. Finally, save the estimation of x in the target register and uncompute the divisions and
the quantum Fourier transforms. With probability at least 1 − ε, the mapping QFPq has been performed.
          1
Use ε = polyn .

4.5     Quantum phase estimation
The method of computing QFT2n can be also used for phase estimation.

Theorem 4.21. Given a gate Sx : |yi|φ i → |yiRz 2πx
                                                                  
                                                              2n y |φ i for basis states |yi, where x ∈ Z2 is unknown,
                                                                                                           n
                                                                                                    n
we can determine          x with  probability  at least   1 − ε in constant  depth,  size O   n log ε , and  using the Sx
                n
                   
gate O n log ε times.

Proof. Obtain an estimate of x by applying the QFP to O log εn copies of the quantum state |ψx i =
                                                                               

|ρx/21 i|ρx/22 i . . . |ρx/2n i. Each |ρx/2k i can be computed by one application of Sx to |2n−k i |0i+|1i √
                                                                                                             2
                                                                                                                , because
                 2πx |0i+|1i            2πx   n−k   |0i+|1i
                                                 
|ρx/2k i = Rz 2k            √
                             2
                                 = Rz 2n 2            √
                                                        2
                                                            .


5      Exact circuits of small depth
In the previous section, we have shown how to approximate the exact[t] gate in constant depth. In this
section, we show how to compute it exactly in log-star depth. The circuits in this section use arbitrary
one-qubit gates instead of a fixed basis, otherwise they would not be exact.

Lemma 5.1. The function Or on n qubits can be reduced exactly to Or on m = dlog(n + 1)e qubits in
constant depth and size O(n log n).

Proof. We use a technique similar to the proof of Theorem 4.1. Recall the quantum state |µ ϕw i defined
                                                                                               |x|
by Equation (4.1) on page 87. For k ∈ {1, 2, . . . , m}, compute in parallel |yk i = |µ ϕk i for angle ϕk = 2π
                                                                                                            2k
                                                                                                               .
Let |yi = |y1 y2 . . . ym i.

      • If |x| = 0, then hy|0m i = 1, because |yk i = |0i for each k.

               6 0, then hy|0m i = 0, because at least one qubit yk is one with certainty. Take the unique
      • If |x| =
        decomposition of |x| into a product of a power of 2 and an odd number: |x| = 2a (2b + 1) for
        a, b ∈ N0 . Then
                                                          2π   a
                               1 − eiϕa+1 |x| 1 − ei 2a+1 2 (2b+1) 1 − eiπ(2b+1) 1 − eiπ
                   h1|ya+1 i =               =                    =             =        =1 .
                                    2                   2                2          2

It follows that x is non-zero if and only if y is. Hence the original problem is exactly reduced to a
problem of logarithmic size.


                            T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                        95
                                    P ETER H ØYER , ROBERT Š PALEK

Theorem 5.2. exact[t] ∈ QAC0f .

Proof. Using the methods from Theorem 4.6 and Lemma 5.1, exact[t] can also be reduced to Or of log-
arithmic size. The reduction has constant depth and size O(n log n). Hence exact[t] is QNC0f -reducible
to Or, or simply exact[t] ∈ QAC0f , because QAC0f includes both QNC0f and the Or gate.

Theorem 5.3. exact[t] ∈ QNCf (log∗ n), i.e. exact[t] can be computed exactly in log-star depth and size
O(n log n).

Proof. Apply the reduction used in Lemma 5.1 in total (log∗ n)-times, until the input size is at most 2.
Compute and save the outcome, and clean ancillas by uncomputation. The circuit size is O(n log n).


6     Circuits of small size
In this section, we decrease the size of some circuits. We allow the use of arbitrary one-qubit gates
instead of a fixed basis.


6.1   Constant depth approximation of Or
In this section, we apply the reduction from Lemma 5.1 repeatedly to shrink the circuit for Or. We first
reduce the size of the circuit to O(n log n). We then develop a recurrent method that reduces the size
even further. Let us define a useful notation.

Definition 6.1. Let x = x1 x2 . . . xn . By Or-reduction n → m with error ε we mean a quantum circuit
mapping |xi|0m i → |xi|ϕi such that, if |x| = 0, then |ϕi = |0m i and, if |x| 6= 0, then h0m |ϕi ≤ ε.

   The Or-reduction preserves the logical Or of qubits, i.e. |x| = 0 iff |ϕ| = 0 with high probability.
Theorem 4.1 provides an Or-reduction n → 1 with error 1n , constant depth, and size n2 log n. Lemma 5.1
provides an Or-reduction n → log n with error 0, constant depth, and size n log n.

Lemma 6.2. There is an Or-reduction n → 1 with error n1 , constant depth, and size n log n.
                                 √                    √
Proof. Divide the input into lognn blocks of size n log n. First, reduce each block by Lemma 5.1
                                                                        √                         √
to 21 log n + log log n = O(log n) qubits in constant depth and size n log2 n. In total, we obtain n
new qubits in size n log n. Second, compute the logical Or by Theorem 4.1 in constant depth, size
√ 2       √
  n log n = O(n log n), and error √1n . To amplify the error to 1n , repeat the computation twice and
return 1 if any of them returns 1 (the error is one-sided). The circuit size is doubled.

     The circuit size can be reduced to O(n log(d) n) for any constant number d of iterations of the log-
arithm. The trick is to divide input qubits into small blocks and perform the reduction step on each of
them. The number of variables is reduced by a small factor and we can thus afford to apply a circuit of
a slightly bigger size. It we repeat this reduction step d times, we obtain the desired circuit.

                        T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                            96
                                    Q UANTUM FAN - OUT IS P OWERFUL

Theorem 6.3. There exist constants c1 , c2 such that for every d ∈ N, there is an Or-reduction n → 1 with
error n1 , depth c1 d, and size c2 dn log(d) n.
Proof. By induction on d: we have already verified the case d = 1 in Lemma 6.2. For the induction
step: Divide n input qubits into n/ log(d−1) n blocks of log(d−1) n qubits. Using Lemma 5.1, reduce each
block to log(d) n qubits in constant depth and size c2 log(d−1) n · log(d) n. Total size is c2 n log(d) n. We
           n
obtain (d−1)    log(d) n = o(n) new qubits. Using the induction hypothesis, compute their logical Or in
       log    n                                        
                                                   (d)
depth c1 (d − 1) and size c2 (d − 1)      n
                                         (d−1) log     n · log(d−1) o(n) ≤ c2 (d − 1)n log(d) n. Together, it
                                       log    n
takes depth c1 d and size c2 dn log(d) n.
    The only approximate step is the application of Lemma 6.2 for d = 1. It is applied on logn n log(d) n
variables, hence the error is O(log n/n). It can be amplified to 1n by running the computation twice.

6.2   Log-star depth computation of Or
Our best constant-depth circuit for Or is described by Theorem 6.3. It is approximate and it has slightly
super-linear size. In this section, we show that we can achieve an exact circuit of linear size if we
relax the restriction of constant depth. We consider d in Theorem 6.3 a slowly growing function of n
instead of a constant. Now we can use an Or-reduction better than Lemma 6.2. Theorem 5.3 provides
an Or-reduction n → 1 with error 0, log-star depth, and size n log n.
Lemma 6.4. There exist constants c1 , c2 such that for every d ∈ N, there is an Or-reduction n → 1 with
error 0, depth c1 d + log∗ n, and size c2 dn log(d) n.
Proof. The same as of Theorem 6.3, but use the Or-reduction from Theorem 5.3 instead of Lemma 6.2
in the last layer (for d = 1). The size stays roughly the same, the circuit becomes exact, and the depth is
increased by an additional term of log∗ n.

Theorem 6.5. There is an Or-reduction n → 1 with error 0, log-star depth, and linear size.
Proof. Divide the input into logn∗ n blocks of size log∗ n. Compute the logical Or of each block by a
balanced binary tree of depth log(log∗ n) < log∗ n and in linear size. UsingLemma 6.4 with d = log∗ n,
                                                                                                    ∗
compute the logical Or of logn∗ n new qubits in log-star depth and size O log∗ n · logn∗ n · log(log n) n =
O(n).

6.3   Approximation of counting and threshold[t]
In this section, we use the QFT for the parallelisation of increments. This allows us to approximate the
Hamming weight of the input in smaller size O(n log n).
Definition 6.6. The increment gate maps Incrn : |xi → |(x + 1) mod 2n i.
Lemma 6.7. The increment gate is diagonal in the Fourier basis and its diagonal form is in QNC0 .
                       n
Proof. Let ω = e2πi/2 and let |xi be any computational basis state. It is simple to prove the following
two equations:

                           T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                             97
                                          P ETER H ØYER , ROBERT Š PALEK

                                                      n
    1. Incrn = F2†n Dn F2n for diagonal Dn = ∑2y=0
                                                 −1 y
                                                   ω |yihy|.
                                   −1 xy   n        −1 (x+1)y              n

                   †           ∑2y=0
                                †
                                     ω |yi      ∑2y=0 ω       |yi
                 F DF|xi = F D     √       = F†       √           = |(x + 1) mod 2n i .
                                     2 n                2n


    2. D = Rz (π) ⊗ Rz (π/2) ⊗ · · · ⊗ Rz π/2n−1 .
                                                

                                               n                                    n
                                               O          n−k x                     O            k
                       D|xi = ω x |xi =              ω2           n−k
                                                                        |xn−k i =         (e2πi/2 )xn−k |xn−k i =
                                               k=1                                  k=1
                                 n
                                 O               
                                          Rz 2π/2k |xn−k i = Rz (π) ⊗ · · · ⊗ Rz π/2n−1 |xi.
                                                                                       
                            =
                                    k=1



We conclude that Incr = F † DF, and that D is a tensor product of one-qubit operators.

Remark 6.8. The addition of a fixed integer b is as hard as the increment. By Lemma 6.7, Incrb =
F † Db F and (Rz (ϕ))b = Rz (ϕb), hence the diagonal version of the addition of b is also in QNC0 .
Theorem 6.9. Counting can be approximated in constant depth and size O(n log n).
Proof. Compute the Hamming weight of the input. Each input qubit controls one increment on an
m-qubit counter initialised to 0, where m = dlog(n + 1)e. The increments Incrm are parallelised (Theo-
rem 3.2 and Lemma 6.7), so we apply the quantum Fourier transform F2m twice (Theorem 4.12) and the
n constant-depth controlled Dm gates in parallel. The size is O(poly(m) + nm) = O(n log n).


Remark 6.10. threshold[t] is equal to the most significant qubit of the counter if we align it to a power
of 2 by adding a fixed integer 2m − t. exact[t] can be computed by comparing the counter with t.


7     Concluding remarks
7.1    Comparison with randomised circuits
Let us compare our results for quantum circuits with similar results for classical randomised circuits.
We consider randomised circuits with bounded fan-in of Or and And gates, and unbounded fan-out and
parity (similar to the quantum model). Classical lower bounds are folklore and we attach the proofs for
the convenience of the reader in Appendix A.

                         Gate                                           Randomised            Quantum
                         Or and threshold[t] exactly                    Θ(log n)              O(log∗ n)
                         mod[q] exactly                                 Θ(log n)              Θ(1)
                         Or with error 1n                               Θ(log log n)          Θ(1)
                         threshold[t] with error 1n                     Ω(log log n)          Θ(1)


                          T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                        98
                                         Q UANTUM FAN - OUT IS P OWERFUL

7.2     Relations of quantum circuit classes
We have shown that B-QNC0f = B-QAC0f = B-QACC0 = B-QTC0f (Theorem 4.6). If we allow arbitrary
one-qubit gates, then also QTC0f = QAC0f ⊆ QNCf (log∗ n) (Theorem 5.2 and Theorem 5.3). Several
open problems of [10] have thus been solved. Only little is known about classes that do not include the
fan-out gate. For example, we do not know whether TC0 ⊆ QTC0 , we only know that TC0 ⊆ QTC0f .
It is simple to prove that parity is in TC0 . Take the logical Or of exact[1], exact[3], exact[5], . . . , and
compute exact[k] from threshold[k] and threshold[k + 1]. However, this method needs fan-out to copy
the input bits and hence it is not in QTC0 .
     Fang et al. proved [7] a lower bound for fan-out. In particular, they showed that logarithmic depth
is needed to approximate parity using only a constant number of ancillas. Unfortunately, their method
breaks down with more than a linear number of ancillas and it cannot be extended to other unbounded
fan-in gates such as majority or threshold[t].


7.3     Upper bounds for B-QNC0f
Shor’s original factoring algorithm [19] uses modular exponentiation and the quantum Fourier transform
modulo 2n followed by a polynomial-time deterministic algorithm. The modular exponentiation ax can
                                                                               n−1                   k
be replaced by multiplication of some subset of numbers a, a2 , a4 , . . . , a2 [5]. The n numbers a2 can
be quickly precomputed classically.
     Since both multiplication of n numbers (Theorem 4.8) and the QFT (Theorem 4.12) are in B-QNC0f ,
there is a polynomial-time bounded-error classical algorithm with oracle B-QNC0f factoring numbers,
i.e. factoring ∈ RP[B-QNC0f ]. If B-QNC0f ⊆ BPP,3 then factoring ∈ RP[BPP] ⊆ BPP[BPP] = BPP. Dis-
crete logarithms can be computed in a similar way using modular exponentiation and the quantum
Fourier transform modulo general q [19]. Since QFTq ∈ B-QNC0f (Theorem 4.17), we conclude that
also discrete-log ∈ RP[B-QNC0f ].


7.4     Open problems
We propose the following open problems on computational aspects of multi-qubit gates:

      i. Is there a constant-depth exact circuit for Or?

   ii. Is there a constant-depth linear-size circuit for Or?

  iii. Are there exact circuits with a fixed basis?

   iv. Can we simulate unbounded fan-out in constant depth using unbounded fan-in gates, e.g. thresh-
       old[t] or exact[t]?
   3 In this context, B-QNC0 denotes the set of languages decided with bounded error by constant-depth quantum circuits with
                           f
fan-out.


                             T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                          99
                                       P ETER H ØYER , ROBERT Š PALEK

A     Lower bounds on classical circuits
Using the polynomial method [3], we prove several lower bounds on the depths of deterministic circuits.
We consider circuits with fan-in of Or and And gates at most 2, and unbounded fan-out and parity, the
same as in the quantum model.
    Basically, the value of each bit computed by a circuit can be computed by a multi-linear polynomial
(over the field Z2 ) in the input bits. We are interested in the degree of such a polynomial; by proving a
lower bound on the degree, we also lower-bound the depth of the circuit. It is simple to prove that the
polynomial computing a Boolean function is unique.
    Each input bit xk ∈ {0, 1} is computed by the polynomial xk of degree 1. The Not gate computes the
polynomial 1− p(x), where p(x) is the polynomial computing its argument, and the degree is unchanged.
The And gate computes the polynomial p1 (x) · p2 (x) and the two degrees are summed. The parity gate
computes the polynomial (p1 (x) + · · · + pk (x)) mod 2 of degree equal to the maximum degree among
the arguments.
Lemma A.1. The output of a circuit of depth d has degree at most 2d .
Proof. By induction: by adding a new layer, we can at most double the degree when using the And
gate.

    And of n bits is computed by a (unique) polynomial x1 x2 . . . xn of degree n. Hence every circuit
computing And has depth at least log n. It is simple to prove by contradiction that also Or, threshold[t],
and exact[t] have full degree n. Smolensky has proved a much stronger result [21], which implies that
also the degree of mod[q] for q > 2 is n.

   Randomised circuits have access to random bits and may produce the result with a small error. Some
functions are computed in smaller depth in this model.
Lemma A.2. Or can be computed with one-sided error 21 by a randomised circuit of depth 2. The error
can be decreased to 1n in additional depth log log n.
Proof. Take n random bits and output the parity x1 r1 ⊕ x2 r2 ⊕ · · · ⊕ xn rn . If |x| = 0, then the circuit
always outputs 0. If |x| > 0, then the probability that the parity is odd is equal to 21 . If we perform
the computation (log n)-times using independent random bits, we decrease the probability of error to
( 12 )log n = 1n . This can be done in additional depth log log n by a balanced binary tree of Or gates.

    By Yao’s principle [24], if we have a randomised circuit with error less than 2−n , then there exists
an assignment of random bits such that the result is always correct. That is there exists a deterministic
circuit of the same shape. Hence also randomised circuits computing the logical Or with exponentially
small error have depth at least log n.
Lemma A.3. Every circuit computing Or with error 1n has depth at least log log n.
Proof. Assume the converse: there exists a circuit of depth d < log log n with error 1n . By computing
                                                                                n
the logical Or independently logn n -times, we can reduce the error to ( n1 ) log n = 2−n . This can be done in
additional depth log logn n = log n − log log n. The total depth of this circuit is log n − log log n + d < log n.
However, by Yao’s principle, the depth has to be at least log n.

                          T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                                  100
                                  Q UANTUM FAN - OUT IS P OWERFUL

Acknowledgements
We thank Harry Buhrman, Hartmut Klauck, and Hein Röhrig at CWI in Amsterdam, and Fred Green
at Clark University in Worcester for plenty of helpful discussions, and Ronald de Wolf at CWI for help
with writing the paper. We thank Klaus Mölmer and Brian King for discussions on physical implemen-
tations of multi-qubit gates. We are grateful to Schloss Dagstuhl, Germany, for providing an excellent
environment, where part of this work was carried out. We thank the anonymous reviewers for their
valuable comments.


References
 [1] * L. M. A DLEMAN , J. D E M ARRAIS , AND M. A. H UANG: Quantum computability. SIAM Journal
     on Computing, 26(5):1524–1540, 1997. [SICOMP:29363]. 2, 3.5

 [2] * A. BARENCO , C. B ENNETT, R. C LEVE , D. P. D I V INCENZO , N. M ARGOLUS , P. S HOR ,
     T. S LEATOR , J. A. S MOLIN , AND H. W EINFURTER: Elementary gates for quantum computa-
     tion. Physical Review A, 52:3457–3467, 1995. [PRA:10.1103/PhysRevA.52.3457, arXiv:quant-
     ph/9503016]. 2.5

 [3] * R. B EIGEL: The polynomial method in circuit complexity. In Proc. of 8th IEEE Structure in
     Complexity Theory Conf., pp. 82–95, 1993. [1993.336538]. A

 [4] * J. I. C IRAC AND P. Z OLLER: Quantum computations with cold trapped ions. Phys. Rev. Lett.,
     74:4091–4094, 1995. [PRL:10.1103/PhysRevLett.74.4091]. 1

 [5] * R. C LEVE AND J. WATROUS: Fast parallel circuits for the quantum Fourier transform. In Proc.
     of 41st IEEE FOCS, pp. 526–536, 2000. [FOCS:10.1109/SFCS.2000.892140]. 1, 4.4.1, 4.4.1,
     4.4.2, 4.4.2, 7.3

 [6] * D. C OPPERSMITH: An approximate Fourier transform useful in quantum factoring. IBM tech-
     nical report RC19642, quant-ph/0201067, 1994. [arXiv:quant-ph/0201067]. 4.4.1

 [7] * M. FANG , S. F ENNER , F. G REEN , S. H OMER , AND Y. Z HANG: Quantum lower bounds for
     fanout. 2003. [arXiv:quant-ph/0312208]. 7.2

 [8] * S. A. F ENNER: Implementing the fanout gate by a Hamiltonian.            2003.   [arXiv:quant-
     ph/0309163]. 1

 [9] * N. G ERSHENFELD AND I. C HUANG: Bulk spin resonance quantum computation. Science,
     275:350–356, 1997. [doi:10.1126/science.275.5298.350]. 1

[10] * F. G REEN , S. H OMER , C. M OORE , AND C. P OLLETT: Counting, fanout, and the complex-
     ity of quantum ACC. Quantum Information and Computation, 2(1):35–65, 2002. [arXiv:quant-
     ph/0106017]. 1, 3.2, 7.2

                       T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                         101
                                   P ETER H ØYER , ROBERT Š PALEK

[11] * L. H ALES AND S. H ALLGREN: Quantum Fourier sampling simplified. In Proc. of 31st ACM
     STOC, pp. 330–338, 1999. [STOC:301250.301336]. 1, 4.4.2, 4.4.2

[12] * W. H OEFFDING: Probability inequalities for sums of bounded random variables. J. Amer. Statist.
     Assoc., 58:13–30, 1963. 4.2

[13] * R. A. H ORN AND C. R. J OHNSON: Matrix Analysis. Cambridge University Press, 1985. 3.1

[14] * P. H ØYER AND R. Š PALEK: Quantum circuits with unbounded fan-out. In Proc. of 20th STACS,
     pp. 234–246, 2003. LNCS 2607. [STACS:80j4ju67n25kcf06].

[15] * K. M ØLMER AND A. S ØRENSEN: Multiparticle entanglement of hot trapped ions. Phys. Rev.
     Lett., 82:1835–1838, 1999. [PRL:10.1103/PhysRevLett.82.1835]. 1

[16] * C. M OORE: Quantum circuits: Fanout, parity, and counting. 1999. [arXiv:quant-ph/9903046].
     1, 2.2, 3.3

[17] * C. M OORE AND M. N ILSSON: Parallel quantum computation and quantum codes. SIAM Journal
     on Computing, 31(3):799–815, 2002. [SICOMP:35505, arXiv:quant-ph/9808027]. 1, 3.2

[18] * A. A. R AZBOROV: Lower bounds for the size of circuits of bounded depth with basis {&, ⊕}.
     Math. Notes Acad. Sci. USSR, 41(4):333–338, 1987. 1

[19] * P. W. S HOR: Algorithms for quantum computation: discrete logarithms and factoring. In Proc.
     of 35th IEEE FOCS, pp. 124–134, 1994. [FOCS:10.1109/SFCS.1994.365700]. 1, 4.4, 7.3

[20] * K.-Y. S IU , J. B RUCK , T. K AILATH , AND T. H OFMEISTER: Depth efficient neural networks for
     division and related problems. IEEE Transactions on Information Theory, 39(3):946–956, 1993.
     [TIT:10.1109/18.256501]. 1, 4.3

[21] * R. S MOLENSKY: Algebraic methods in the theory of lower bounds for Boolean circuit complex-
     ity. In Proc. of 19th ACM STOC, pp. 77–82, 1987. [STOC:28395.28404]. 1, A

[22] * R. Š PALEK: Quantum circuits with unbounded fan-out. Master’s thesis, Faculty of Sciences,
     Vrije Universiteit, Amsterdam, 2002. Shorter version and improved results in quant-ph/0208043.
     [arXiv:quant-ph/0208043].

[23] * W. K. W OOTTERS AND W. H. Z UREK: A single quantum cannot be cloned. Nature, 299:802–
     803, 1982. [doi:10.1038/299802a0]. 2.1

[24] * A. C-C. YAO: Probabilistic computations: Toward a unified measure of complexity. In Proc. of
     18th IEEE FOCS, pp. 222–227, 1977. A



                       T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                         102
                                Q UANTUM FAN - OUT IS P OWERFUL

AUTHORS

   Peter Høyer
   professor
   Department of Computer Science, University of Calgary
   Alberta, Canada
   hoyer cpsc ucalgary ca
   http://www.cpsc.ucalgary.ca/~hoyer/


   Robert Špalek
   graduate student
   Centrum voor Wiskunde en Informatica
   Amsterdam, The Netherlands
   sr cwi nl
   http://www.ucw.cz/~robert/


ABOUT THE AUTHORS

   P ETER H ØYER received his Ph.D. in Computer Science from the University of Southern
       Denmark, Odense Campus, under the supervision of Joan Boyar and Gilles Brassard.
       His research interests include algorithmics, data structures, and quantum information.
       He is currently working on having the algorithms of the present article to be included as
       DLL’s in the final release of Vista, formerly known as Longhorn.


   ROBERT Š PALEK received his Masters Degrees in Computer Science from Charles Uni-
     versity, Prague and Vrije Universiteit, Amsterdam. He is currently a graduate student
     at CWI, advised by Harry Buhrman. His research interests include quantum comput-
     ing, computational complexity, algorithms, data structures, and search engines. He also
     enjoys climbing, salsa, photography, travelling to distant countries, and playing guitar.




                     T HEORY OF C OMPUTING, Volume 1 (2005), pp. 81–103                            103