Trading Information Complexity for Error

Authors Yuval Dagan, Yuval Filmus, Hamed Hatami, Yaqiao Li,
Plaintext
                           T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73
                                        www.theoryofcomputing.org

                                        S PECIAL ISSUE : CCC 2017



    Trading Information Complexity for Error
         Yuval Dagan                  Yuval Filmus∗                 Hamed Hatami†                  Yaqiao Li
                     Received June 2, 2017; Revised April 16, 2018; Published May 15, 2018




       Abstract: We consider the standard two-party communication model. The central problem
       studied in this article is how much one can save in information complexity by allowing an
       error of ε.
           • For arbitrary functions, we obtain
                                              √ lower bounds and upper bounds indicating a gain
             that is of order Ω(h(ε)) and O(h( ε)), respectively. Here h denotes the binary entropy
             function.
           • We analyze the case of the two-bit AND function in detail to show that for this function
             the gain is Θ(h(ε)). This answers a question of Braverman et al. (STOC’13).
           • We obtain sharp bounds for the set disjointness function of order n. For the p   case of
             the distributional error, we introduce a new protocol that achieves a gain of Θ( h(ε))
             provided that n is sufficiently large. We apply these results to answer another question
             of Braverman et al. regarding the randomized communication complexity of the set
             disjointness function.

     A conference version of this paper appeared in the Proceedings of the 32nd IEEE Conference on Computational Complexity,
2017 [13].
   ∗ This research was supported by the Israel Science Foundation (grant no. 1337/16).
   † Supported by NSERC.




ACM Classification: F.1.3
AMS Classification: 68Q17, 68W20
Key words and phrases: communication complexity, information complexity


© 2018 Yuval Dagan, Yuval Filmus, Hamed Hatami and Yaqiao Li
c b Licensed under a Creative Commons Attribution License (CC-BY)                             DOI: 10.4086/toc.2018.v014a006
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

         • Answering a question of Braverman (STOC’12), we apply our analysis of the set dis-
           jointness function to establish a gap between the two different notions of the prior-free
           information cost. In the light of Braverman (STOC’12), this implies that the amor-
           tized randomized communication complexity is not necessarily equal to the amortized
           distributional communication complexity with respect to the hardest distribution.
           As a consequence, we show that the ε-error randomized communication complexity of
      the set disjointness function of order n is n[CDISJ − Θ(h(ε))] + o(n), where CDISJ ≈ 0.4827
      is the constant found by Braverman et al. (STOC’13).


Contents
1   Introduction                                                                                         4
    1.1 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     6
         1.1.1 Stability for the buzzer protocol . . . . . . . . . . . . . . . . . . . . . . . . . .     6
         1.1.2 The buzzer protocol as a random walk . . . . . . . . . . . . . . . . . . . . . . .        7
         1.1.3 Product parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       7
         1.1.4 Protocol completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       8
         1.1.5 Black-box modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      8
         1.1.6 Computing set disjointness with error . . . . . . . . . . . . . . . . . . . . . . .       9

2   Preliminaries                                                                                        10
    2.1 Notation and basic estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     10
    2.2 Communication complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       10
    2.3 Information complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     11
    2.4 The continuity of information complexity . . . . . . . . . . . . . . . . . . . . . . . . .       13
    2.5 Communication protocols as random walks on ∆(X × Y) . . . . . . . . . . . . . . . . .            14

3   Main results                                                                                         16
    3.1 Information complexity with point-wise error . . . . . . . . . . . . . . . . . . . . . . .       16
    3.2 Information complexity with distributional error . . . . . . . . . . . . . . . . . . . . . .     18
    3.3 Information complexity of the AND function with error . . . . . . . . . . . . . . . . . .        19
    3.4 Set disjointness function with error . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   22
    3.5 Prior-free information cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    23
    3.6 A characterization of trivial measures . . . . . . . . . . . . . . . . . . . . . . . . . . .     25

4   Proofs for general functions                                                                         26
    4.1 Information complexity with point-wise error . . . . . . . . . . . . . . . . . . . . . . .       26
         4.1.1 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      26
         4.1.2 Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      34
         4.1.3 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      37
    4.2 Information complexity with distributional error . . . . . . . . . . . . . . . . . . . . . .     39
    4.3 Non-distributional prior-free information cost . . . . . . . . . . . . . . . . . . . . . . .     40

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                              2
                           T RADING I NFORMATION C OMPLEXITY FOR E RROR

    4.4   A characterization of trivial measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5   Parametrization of all distributions as product distributions                                      46

6   The analysis of the AND function                                                                    48
    6.1 Stability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
    6.2 Lower bound on the information complexity of ICµ (AND, ε) . . . . . . . . . . . . . . . 55

7   The set disjointness function with error                                                          61
    7.1 Proof of Theorem 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
    7.2 A protocol for Set-Disjointness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8   Open problems and concluding remarks                                                               67




                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             3
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

1    Introduction
Communication complexity studies the amount of communication needed to compute a function whose
inputs are spread among several parties. It has many applications to different areas of complexity theory
and beyond, mostly as a technical tool used for proving lower bounds. Traditionally, communication
complexity has been studied through a combinatorial lens. Recently, information complexity, the approach
to study communication complexity via information theory, has successfully been used to resolve some
of the major problems of this field [11, 2, 3]. While communication complexity is concerned with
minimizing the amount of communication required for two players to evaluate a function, information
complexity is concerned with the amount of information that the communicated bits reveal about the
players’ inputs.
     The study of information complexity is motivated by fundamental questions regarding compressing
communication [3, 7, 4, 16] that extend the seminal work of Shannon [29] to the setting where interaction
is allowed. Moreover, it has important applications to communication complexity, and in particular to
the study of the direct-sum problem [2, 11, 18, 9, 8, 14, 11, 20, 17, 3, 22, 18, 19, 9, 8]. For example,
the only known direct-sum result for general randomized communication complexity is proven via
information-theoretic techniques in [3].
     One of the most spectacular applications of information complexity, due to Braverman et al. [5], is
determining the exact first order asymptotics of the communication complexity of set disjointness. Our
goal in this paper is to determine the rate of growth of the second-order term (see Equation (1.1) below).
Set disjointness, introduced in [31], is one of the most important functions in communication complexity,
and as a result it has been studied extensively in the past four decades (see the surveys [12, 30] and the
references therein). In this communication problem, which is denoted by DISJn , Alice and Bob each
receives a subset of {1, . . . , n} and they wish to determine whether their sets are disjoint. The goal, first
studied in [1], is to determine the asymptotic rate of growth of the randomized communication complexity
Rε (DISJn ) of set disjointness, defined as the smallest number of bits exchanged by the two players in a
protocol which computes the function correctly with probability at least 1 − ε on every input. The correct
asymptotic Rε (DISJn ) = Θ(n) was first proved by Kalyanasundaram and Schnitger [21]. Although later
Razborov [27] gave a shorter proof, still, despite several decades of research in this area, all known
proofs of this fact are intricate. It was thus a great breakthrough when Braverman et al. determined
the exact constant in the asymptotics of Rε (DISJn ) as ε → 0 by employing several recent results from
the area of information complexity. They proved that as the error parameter ε tends to 0, the quantity
limn→∞ Rε (DISJn )/n tends to a constant CDISJ ≈ 0.4827.
     Our main result determines the asymptotic rate of growth of the error term CDISJ − Rε (DISJn ) as
ε → 0: we show that for all ε > 0 and all sufficiently large n (depending on ε) we have
                                                Rε (DISJn )
                                      CDISJ −               = Θ(h(ε)).                                   (1.1)
                                                     n
Here the constants implied by the Θ notation are absolute.
    As in the work of Braverman et al., we obtain our result by analyzing the information complexity
of the 2-bit AND function (in which each player gets one bit). Roughly speaking, the information
complexity ICµ ( f , ε) of a function f with respect to a distribution µ on the inputs is the minimal amount
of information that the players need to leak in any protocol that computes f correctly with probability at

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 4
                                 T RADING I NFORMATION C OMPLEXITY FOR E RROR

least 1 − ε on every input.1 The asymptotic estimate on Rε (DISJn ) follows by analyzing IC0 (AND, ε) :=
min ICµ (AND, ε), where the minimum is taken over all distributions µ such that µ(1, 1) = 0. Specifically,
we prove the following bound:

                                           CDISJ − IC0 (AND, ε) = Θ(h(ε)).                                               (1.2)

The upper bound on IC0 (AND, ε) is attained by a protocol having one-sided error, which is only allowed
to make a mistake on the input (1, 1). It follows from a black-box modification of the optimal protocol
for AND found by Braverman et al. The lower bound is significantly harder, requiring several new ideas
which could have wider applicability. We sketch these ideas later on in the introduction.
     It is natural to ask whether a bound of the form (1.2) holds for arbitrary functions f . Braverman et
al. [5] considered this question in the context of distributional information complexity.2 The distributional
information complexity ICµ ( f , µ, ε) of a function f with respect to a distribution µ on the inputs is
the minimal amount of information that the players need to leak in any protocol that computes f
correctly with probability at least 1 − ε when the inputs are drawn according to µ. They showed that
ICµ ( f , µ, 0) − ICµ ( f , µ, ε) ≤ C( f , µ)h(ε 1/8 ), where C( f , µ) denotes a positive constant which depends
only on f and µ. We significantly improve this lower bound on ICµ ( f , µ, ε), and obtain the first non-trivial
upper and lower bounds for general functions:
                                                                                            √
                         C1 ( f , µ)h(ε) ≤ ICµ ( f , µ, 0) − ICµ ( f , µ, ε) ≤ C2 ( f , µ)h( ε),                         (1.3)
                                                                                          √
                            C1 ( f , µ)h(ε) ≤ ICµ ( f , 0) − ICµ ( f , ε) ≤ C2 ( f , µ)h( ε),                            (1.4)

for positive C1 ( f , µ) and C2 ( f , µ).
    Our results hold in both the non-distributional and distributional settings, as well as in the prior-free
settings explained below. The upper bounds on ICµ ( f , µ, ε) and ICµ ( f , ε) use the same black-box
technique used to prove the upper bound in (1.2). The lower bounds use protocol completion, a new
technique which also figures in the proof of the lower bound in (1.2).
    In classical communication complexity, the distributional setting arises from an application of Yao’s
minimax principle: Rε ( f ) is the maximum over µ of the communication complexity of deterministic
protocols which compute f correctly with probability at least 1 − ε when the inputs are drawn according
to µ. This connection suggests searching for an analog of Rε ( f ) in the setting of information complexity.
Braverman [4] defined two such notions of prior-free information complexity: IC( f , ε) = maxµ ICµ ( f , ε),
and ICD ( f , ε) = maxµ ICµ ( f , µ, ε). Using the minimax theorem, he showed that the two notions coincide
when ε = 0. He conjectured that the two notions coincide for all ε, but he could only prove the following
bound, for 0 < α < 1:
                                                                    ICD ( f , αε)
                                        ICD ( f , ε) ≤ IC( f , ε) ≤               .                     (1.5)
                                                                      1−α
   1 There are two different ways to measure information leakage. The usual notion, internal information complexity, measures

how much each player learns about the other player’s input. External information complexity, studied in this paper only in
passing, measures how much an external observer learns about the players’ input.
   2 Information complexity and distributional information complexity are often confused in the literature. One reason might be

that they are the same in the zero-error prior-free setting, as shown by Braverman [4] and explained further below.


                            T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                             5
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

We separate the two notions of prior-free information complexity, thus showing that this tradeoff is
essentially optimal for set disjointness:

                                     ICD (DISJn , ε)             p      
                              lim sup                ≤ CDISJ − Ω     h(ε) ,                             (1.6)
                                n→∞         n
                                    IC(DISJn , ε)
                                                  ≥ CDISJ − O (h(ε)) .                                  (1.7)
                                          n
The upper bound on ICD (DISJn , ε) follows from a new protocol for set disjointness which is asymptoti-
cally optimal in the distributional prior-free setting, while the lower bound on IC(DISJn ) follows from
the proof of (1.1).
    Since information complexity is amortized communication complexity, we can also state our sep-
aration in terms of communication complexity. Let Rm           m
                                                          ε ( f ) denote the randomized communication
complexity of computing m copies of f with an error of at most ε on each of the m inputs. Similarly, let
  µ,m
Dε ( f m ) denote the corresponding distributional notion, where the error is measured when the inputs
are drawn according to µ. Braverman [4] showed that

              IC( f , ε) = lim Rm    m
                                                         ICD ( f , ε) = lim max Dε ( f m )/m,
                                                                                   µ,m
                          m→∞   ε ( f )/m       and
                                                                       m→∞   µ


and so our separation of IC(DISJn , ε) and ICD (DISJn , ε) also separates

                                max Dε (DISJm                    Rm      m
                                        µ,m
                                 µ          n)          and       ε (DISJn ) .


    Finally, given a function f we characterize all measures µ such that ICµ ( f , 0) = 0. We also prove
a few results about external information complexity ICext (which we do not define here). Given a
function f we characterize all measures µ such that ICext    µ ( f , 0) = 0. We also show that the upper bound
ICµ ( f , ε) ≤ ICµ ( f , 0) − Ω(h(ε)) fails for external information complexity:

                                     ICext            ext
                                       µ (XOR, ε) ≥ ICµ (XOR, 0) − 3ε,

where the distribution µ is given by µ(0, 0) = µ(1, 1) = 1/2.

1.1     Techniques
1.1.1    Stability for the buzzer protocol
At the heart of the lower bound IC0 (AND, ε) ≥ CDISJ − O(h(ε)) lies a stability result for almost-optimal
protocols for AND.
    Braverman et al. [5] gave an optimal protocol for the AND function, which they call the buzzer
protocol. They also showed that this protocol is essentially the unique optimal protocol for the AND
function. We prove a stability version of this result: any ε-error protocol for AND whose information
cost is close to that of the buzzer protocol must be similar to the buzzer protocol.
    There are many possible notions of similarity, and ours (for reasons that will become clear below)
focuses on the leaf distribution of the protocol, which is the distribution of the terminal point of the

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                6
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

protocol. Our stability result roughly states that any ε-error protocol for AND whose information cost is
close to that of the buzzer protocol must have a leaf distribution which shares some similarity with the
leaf distribution of the buzzer protocol.
    We prove our stability result by strengthening the technique of local concavity constraints introduced
by Braverman et al. On the way, we also simplify the arguments of Braverman et al. by replacing the
discrete second derivatives used by Braverman et al. with their continuous counterparts.

1.1.2   The buzzer protocol as a random walk
One of our main insights is an alternative description of the buzzer protocol as a random walk.
     As part of their analysis of the AND function, Braverman et al. [5] introduced a new perspective
on communication protocols, viewing a communication protocol as a random walk on the space of
distributions. Given an initial distribution over the inputs, they associate with each node in the protocol
tree the a posteriori distribution of the inputs, which is the distribution of the inputs given that the protocol
arrives at the node. Instead of walking down the protocol tree, we can think of the protocol as a random
walk on these a posteriori input distributions.
     Such a random walk has two nice properties: firstly, it is a martingale. Intuitively, this means that
if the a posteriori distribution on the inputs is µ at some point in time, then the expected a posteriori
distribution after one more step and conditioned on all the history of the protocol is µ. Secondly, one can
define a potential function on these distributions such that the information complexity of the protocol is
the difference between the potential at the input distribution and the expected potential at the a posteriori
distribution where the protocol terminates.
     Braverman et al. describe the buzzer protocol as a continuous time protocol which ends abruptly when
one of the players buzzes. We give an alternative description of the buzzer protocol, as a random walk on
the space of distributions. Consider the case in which the input distribution µ is a product distribution
given by Pr[X = 1] = p and Pr[Y = 1] = q, where X,Y are the input bits of Alice and Bob, respectively;
we denote this distribution succinctly by (p, q). The buzzer protocol is the limit ε → 0 of a random walk
which starts at (p, q), and at each step moves either vertically or horizontally depending on the current
distribution (a, b): if a ≥ b it moves to (a, b + ε) or to (a, b − ε), with probability 1/2 each, and if a < b
it moves to (a + ε, b) or to (a − ε, b), with probability 1/2 each. In both cases we clip the protocol to
[0, 1]2 . The random walk terminates when a = 0 or b = 0, in which case it outputs 0, and when a = b = 1,
in which case it outputs 1.
     Our description of the buzzer protocol has two main advantages over the original one. First, the
a posteriori distribution varies continuously in our protocol. In contrast, in the original description the
a posteriori distribution “collapses” when one of the players presses the buzzer. Second, our protocol is
the same for all distributions, whereas the original buzzer protocol has an additional symmetrization step
to handle asymmetric initial distributions. Both of these properties simplify our analysis.

1.1.3   Product parametrization
Our most important technical innovation is a way of analyzing non-product distributions as if they were
product distributions. Since product distributions are often much easier to analyze, we believe this idea
could have many further applications, which we hope to explore in future work.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  7
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    If the input of a protocol is distributed as a product distribution, all the a posteriori input distributions
are also product distributions, hence they can be characterized by the marginals corresponding to the
distributions on the inputs of Alice and Bob. Additionally, any bit sent by a player changes only the
marginal of the a posteriori distribution corresponding to its input while the marginal corresponding to
the input of the other player does not change. This is not the case for non-product input distributions and
particularly, a bit sent by a player can also change the other marginal.
    We show how one can view a protocol with a non-product input distribution as a random walk over
the set of product distributions (each product distribution in this random walk can be extracted from the
corresponding a posteriori input distributions). This random walk retains the nice properties described
above and in Section 1.1.2: it is a martingale where any bit sent by a player changes only the marginal
of the a posteriori distribution corresponding to their input. Furthermore, the information complexity
of the protocol can be explained in terms of a potential difference between the potential on the initial
distribution of the random walk and the expected potential when it terminates.

1.1.4   Protocol completion
We prove the lower bounds on ICµ ( f , ε) and on IC0 (AND, ε) using the technique of protocol completion.
Given an ε-error protocol for f , we complete it to a zero-error protocol for f in a natural way: when
the protocol terminates at a posterior distribution ν (which is the distribution of the inputs given the
transcript of the protocol and the initial distribution µ), we run a zero-error protocol for f which is
information-efficient for√the distribution ν. Using the buzzer protocol, we give a protocol for f whose
information cost is O(h( α)), where 1 − α is the probability of the most probable output given ν. Since
E[α] = ε, this
             √ shows that we can complete the given ε-error protocol to a zero-error√ protocol for f at a
cost of O(h( ε)) in the information cost, implying the bound ICµ ( f , ε) + O(h( ε)) ≥ ICµ ( f , 0).
                                                                                                      √
    For the case f = AND, we are able to improve on this result, tightening the gap from O(h( ε))
to O(h(ε)), using the stability result for the buzzer protocol. The product parametrization allows us to
consider the posterior distribution ν as a product distribution√(a, b). If max(a, b) = Ω(1) then the buzzer
protocol has information cost O(h(α)) rather than just O(h( α)) (recall that 1 − α is the probability of
the most probable output given ν). Suppose now that we are given an ε-error protocol π for AND. Our
goal is to prove that ICµ (π) ≥ ICµ (AND, 0) −Cµ h(ε) for some Cµ > 0 (here√ICµ (π) is the information
cost of π). We can complete π to a zero-error√ protocol π0 at a cost of O(h( ε)). We can assume that
ICµ (π0 ) ≤ ICµ (AND, 0) −Cµ h(ε) + O(h( ε)), and so π0 is an almost-optimal protocol for AND. Our
stability result shows that a random leaf (a, b) of π0 satisfies max(a, b) ≥ cµ with high probability, for
some cµ > 0. It follows that the same holds for π, and so the cost of completion is only O(h(ε)).

1.1.5   Black-box modification
We prove the upper bounds on ICµ ( f , ε) and (as a special case) on IC0 (AND, ε) using a simple black-box
argument, which modifies an optimal zero-error protocol to a slightly more information-efficient ε-error
protocol. Given a zero-error protocol π for f , one way to create an ε-error protocol for f is to run π with
probability 1 − ε, and output some constant value with probability ε. However, this only saves O(ε) bits
of information. Our modification is different: we identify a player P and two inputs z0 , z1 , and run the
following protocol π 0 :

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  8
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR

    • With probability ε (sampled privately by P), if the input of P is z1 then P changes its input to z0 .

    • The players run π on their possibly modified inputs.

This is also an ε-error protocol, and for a suitable choice of the parameters, it turns out that it saves
Ω(h(ε)) bits of information compared to π.
    When the input distribution µ has full support, it is easy to choose the parameters, by finding two
inputs (x0 , y0 ), (x1 , y1 ) which differ on a single coordinate such that f (x0 , y0 ) 6= f (x1 , y1 ). Such a choice
might not exist when µ doesn’t have full support, and instead we rely on a rather delicate binary search
argument on the set of transcripts.
    We can apply this argument to the AND function, showing that ICµ (AND, ε) ≤ CDISJ − Ω(h(ε)).
However, when using this result to obtain a protocol for set disjointness, we encounter a difficulty: in order
to obtain an ε-error protocol for DISJn , it seems at first that we need a protocol for AND having error
ε/n. This would result in a saving of O(h(ε/n)) rather than O(h(ε)) per coordinate. A similar difficulty
was encountered by Molinaro et al. [26] in a similar context, and they overcame it using protocols that
abort. In our case there is a simpler solution: we consider ε-error protocols for AND which only make
one-sided error, outputting 0 when the correct answer is 1 (the black-box argument can be modified to
produce such protocols). If we apply such a protocol coordinatewise to compute the intersection of X,Y ,
then we always compute the intersection correctly when X,Y are disjoint, and we mistakenly compute
the intersection to be empty when X,Y are not disjoint with probability at most ε |X∩Y | ≤ ε. The resulting
protocol thus computes set disjointness correctly with probability at least 1 − ε on every input.

1.1.6   Computing set disjointness with error
The lower bound IC0 (AND, ε) ≥ CDISJ − O(h(ε)) implies a similar lower bound on the information
complexity of set disjointness: IC(DISJn , ε)/n ≥ CDISJ − O(h(ε)). In contrast, we can save more than
h(ε) in the distributional prior-free setting:

                                           ICD (DISJn , ε)             p    
                                 lim sup                   ≤ CDISJ − Ω   h(ε) .                                  (1.8)
                                  n→∞            n
A minimax argument of Braverman [4] shows that this bound is tight. We prove this upper bound on
ICD (DISJn , ε) using a new protocol for set disjointness. Given a distribution µ, we describe a protocol π
which has error ε with respect to µ, whose information cost satisfies
                                          h           p       i
                             ICµ (π) ≤ n CDISJ − Ω         h(ε) + O(log n).                           (1.9)

Let p be the probability the input sets X,Y are not disjoint, when (X,Y ) ∼ µ. The protocol proceeds as
follows:

    • Using public randomness, Alice and Bob sample a permutation σ on 1, . . . , n.

    • For i = 1, . . . , n, Alice and Bob run a protocol for AND on Xσ (i) ,Yσ (i) which has one-sided error
      ε/2p with respect to the conditional distribution of Xσ (i) ,Yσ (i) , declaring X,Y to be not disjoint
      (and halting the protocol) if the AND protocol answers Xσ (i) = Yσ (i) = 1.

                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                       9
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

      • Declare X,Y to be disjoint.

The protocol only makes an error when the inputs are not disjoint, and in that case it makes an error with
probability (ε/2p)|X∩Y | ≤ ε/2p. Since the inputs are non-disjoint with probability p, the overallp error
probability is ε/2 < ε. A tricky but standard argument shows that this protocol saves roughly Ω(n h(ε))
bits of information.


2      Preliminaries
In this section we introduce some basic notation and facts, and review the necessary background for the
paper.

2.1     Notation and basic estimates
We typically denote random variables by capital letters (e. g., A, B,C, Π). For the sake of brevity, we shall
write A1 . . . An to denote the random variable (A1 , . . . , An ) and not the product of the Ai ’s. We use [n] to
denote the set {1, . . . , n}, and supp µ to denote the support of a measure µ.
    For a finite set Ω, we denote by ∆(Ω), the set of all discrete probability distributions on Ω. For
µ, ν ∈ ∆(Ω), we denote their total variation distance with

                                                     1
                                        |µ − ν| :=      ∑ |µ(x) − ν(x)|.                                    (2.1)
                                                     2 x∈Ω

For every ε ∈ [0, 1], h(ε) = −ε log ε − (1 − ε) log(1 − ε) denotes the binary entropy, where here and
throughout the paper log(·) is in base 2, and 0 log 0 = 0.

2.2     Communication complexity
The notion of two-party communication complexity was introduced by Yao [31] in 1979. In this model
there are two players (with unlimited computational power), often called Alice and Bob, who wish to
collaboratively perform a task such as computing a given function f : X × Y → Z. Alice receives an
input x ∈ X and Bob receives y ∈ Y. Neither of them knows the other player’s input, and they wish to
communicate in accordance with an agreed-upon protocol π to compute f (x, y). The protocol π specifies
as a function of (only) the transmitted bits whether the communication is over, and if not, who sends the
next bit. Furthermore π specifies what the next bit must be as a function of the transmitted bits, and the
input of the player who sends the bit. We will assume that when the protocol terminates Alice and Bob
agree on a value as the output of the protocol. We denote this value by π(x, y). The communication cost
of π is the total number of bits transmitted on the worst case input. The transcript of an execution of π is
a string Π consisting of a list of all the transmitted bits during the execution of the protocol. As protocols
are defined using protocol trees, transcripts are in one-to-one correspondence with the leaves of this tree.
     In the randomized communication model, the players might have access to a shared random string
(public randomness), and their own private random strings (private randomness). These random strings
are independent, but they can have any desired distributions individually. In the randomized model the

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   10
                            T RADING I NFORMATION C OMPLEXITY FOR E RROR

transcript also includes the public random string in addition to the transmitted bits. Similar to the case of
deterministic protocols, the communication cost is the total number of bits transmitted on the worst case
input and random strings. The average communication cost of the protocol is the expected number of bits
transmitted on the worst case input.
    For a function f : X × Y → Z and a parameter ε > 0, we denote by Rε ( f ) the communication cost of
the best randomized protocol that computes the value of f (x, y) correctly with probability at least 1 − ε
for every (x, y).

2.3   Information complexity
The setting is the same as in communication complexity, where Alice and Bob (having infinite computa-
tional power) wish to mutually compute a function f : X × Y → Z. To be able to measure information,
we also need to assume that there is a prior distribution µ on X × Y.
     For the purpose of communication complexity, once we allow public randomness, it makes no
difference whether we permit the players to have private random strings or not. This is because the
private random strings can be simulated by parts of the public random string. However, for information
complexity, it is crucial to permit private randomness, and once we allow private randomness, public
randomness becomes inessential. Indeed, one of the players can use her private randomness to generate
the public random string, and then transmit it to the other player. Although this might have very large
communication cost, it has no information cost, as it does not reveal any information about the players’
inputs.
     Probably the most natural way to define the information cost of a protocol is to consider the amount
of information that is revealed about the inputs X and Y to an external observer who sees the transmitted
bits and the public randomness. This is called the external information cost and is formally defined as the
mutual information between XY and the transcript of the protocol (recall that the transcript also contains
the public random string). While this notion is interesting and useful, it turns out there is a different way
of defining the information cost that enjoys certain desirable properties that the external information cost
lack. This is called the internal information cost or just the information cost for short, and is equal to the
amount of information that Alice and Bob learn about each other’s inputs from the communication. Note
that Bob knows Y , the public randomness R, and his own private randomness RB , and thus what he learns
about X from the communication can be measured by the conditional mutual information I(X; Π | Y RRB ).
Similarly, what Alice learns about Y from the communication can be measured by I(Y ; Π | XRRA ) where
RA is Alice’s private random string. It is not difficult to see [3] that conditioning on the public and
private randomness does not affect these quantities. In other words I(X; Π | Y RRB ) = I(X; Π | Y ) and
I(Y ; Π | XRRA ) = I(Y ; Π | X). We summarize these in the following definition.
Definition 2.1. The internal information cost and the external information cost of a protocol π with
respect to a distribution µ on inputs from X × Y are defined as
                                    ICµ (π) = I(Π; X | Y ) + I(Π;Y | X),                                (2.2)
and
                                           ICext
                                             µ (π) = I(Π; XY ),                                         (2.3)
respectively, where Π = ΠXY is the transcript of the protocol when it is executed on XY ∼ µ.

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                11
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    We will be interested in certain communication tasks. Let [ f , ε] denote the task of computing the
value of f (x, y) correctly with probability at least 1 − ε for every (x, y). Thus a protocol π performs this
task if
                                Pr[π(x, y) 6= f (x, y)] ≤ ε, ∀ (x, y) ∈ X × Y.                         (2.4)
Given another distribution ν on X × Y, let [ f , ν, ε] denote the task of computing the value of f (x, y)
correctly with probability at least 1 − ε if the input (x, y) is sampled from the distribution ν. A protocol π
performs this task if
                                           Pr [π(x, y) 6= f (x, y)] ≤ ε.                                 (2.5)
                                           (x,y)∼ν

Note that a protocol π performs [ f , 0] if it computes f correctly on every input while performing [ f , ν, 0]
means computing f correctly on the inputs that belong to the support of ν.
    We will also need a one-sided version of the task [ f , ε]. Let [ f , ε, z1 → z0 ] denote the task of computing
the value of f (x, y) correctly with probability at least 1 − ε for every (x, y), allowing the protocol to err
only if it outputs z0 instead of z1 . Thus a protocol π performs this task if it performs the task [ f , ε], and
additionally
                             π(x, y) 6= f (x, y) =⇒ f (x, y) = z1 and π(x, y) = z0 .                          (2.6)
    The information complexity of a communication task T with respect to a measure µ is defined as

                                         ICµ (T ) =        inf          ICµ (π).                                 (2.7)
                                                      π: π performs T

It is essential here that we use infimum rather than minimum as there are tasks for which there is no
protocol that achieves ICµ (T ) while there is a sequence of protocols whose information cost converges to
ICµ (T ). The external information complexity of a communication task T is defined similarly. We will
abbreviate ICµ ( f , ε) = ICµ ([ f , ε]), ICµ ( f , ν, ε) = ICµ ([ f , ν, ε]), etc. It is important to note that when µ
does not have full support, ICµ ( f , µ, 0) can be strictly smaller than ICµ ( f , 0).

Remark 2.2 (A warning regarding notation). In the literature of information complexity it is common to
use “ICµ ( f , ε)” to denote the distributional error case, i. e., what we denote by ICµ ( f , µ, ε). Unfortunately
this has become the source of some confusions in the past, as sometimes “ICµ ( f , ε)” is used to denote
both of the distributional error and the point-wise error cases. To avoid ambiguity we distinguish the two
cases by using the different notations ICµ ( f , µ, ε) and ICµ ( f , ε).

    Similar to the fact that the maximal distributional communication complexity over all measures equals
the public coin randomized communication complexity (see, e. g., [23, Section 3.4]), below we prove a
lemma that establishes a similar relation between ICµ ( f , ν, ε) and ICµ ( f , ε).

Lemma 2.3. ICµ ( f , ε) = maxν ICµ ( f , ν, ε) holds for all ε ≥ 0.

    Note that the maximum exists due to continuity of ICµ ( f , ν, ε) with respect to ν, a fact that is
discussed later in Section 2.4. (For ε = 0 one can take any full-support ν.)

Proof. We only need to show ICµ ( f , ε) ≤ maxν ICµ ( f , ν, ε) as the other direction is obvious. The proof
is an application of von Neumann’s minimax theorem.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                       12
                                 T RADING I NFORMATION C OMPLEXITY FOR E RROR

       Pick a small δ > 0, let
                                       Cδ = {π : ICµ (π) ≤ ICµ ( f , ε) − δ }.
Although Cδ is an infinite set, we can approximate it by a finite set by considering only the protocols with
bounded communication cost that use only a bounded number of unbiased random bits. This process
does not affect the validity of the proof, and hence the minimax theorem is still applicable.
    Consider a two-player zero-sum game in which Alice chooses a protocol π ∈ Cδ and Bob chooses an
input (x, y) ∈ X × Y, and define the utility for Alice to be Pr[π(x, y) = f (x, y)]. Note that a mixed strategy
for Alice is still just a protocol, and a mixed strategy for Bob corresponds to a probability measure on
X × Y. By our definition of Cδ and the minimax theorem, we have

 min max E Pr[π(x, y) = f (x, y)] = max min E Pr[π(x, y) = f (x, y)] = 1 − ε −t(δ ) < 1 − ε, (2.8)
   ν      π   (x,y)∼ν                             π       ν (x,y)∼ν

where t(δ ) > 0 is a positive quantity. This means that there exists a measure νδ∗ such that for all π ∈ Cδ ,

                                             E        Pr[π(x, y) 6= f (x, y)] > ε.
                                          (x,y)∼νδ∗

Letting δ → 0 gives maxν ICµ ( f , ν, ε) ≥ ICµ ( f , ε) as desired.

    Finally let us recall the two definitions of the prior-free notions of information complexity introduced
in [4]. The max-distributional information complexity of a function f : X × Y → Z is defined as

                                           ICD ( f , ε) = max ICµ ( f , µ, ε).                           (2.9)
                                                               µ

The information complexity of f with error ε is defined as

                                             IC( f , ε) = inf max ICµ (π),                              (2.10)
                                                               π      µ

where the infimum is over all protocols π that perform the task [ f , ε]. It is possible [4] to use a minimax
argument and the concavity of ICµ (π) with respect to µ to show that

           IC( f , ε) = inf max ICµ (π) = max inf ICµ (π) = max ICµ ( f , ε) = max ICµ ( f , ν, ε),     (2.11)
                        π    µ                µ       π                   µ          µ,ν

where the last equality follows from Lemma 2.3.

2.4      The continuity of information complexity
It is shown in [6, Lemma 4.4] that for every communication task T , ICµ (T ) is uniformly continuous with
respect to µ. More precisely, for every two measures µ1 and µ2 with |µ1 − µ2 | ≤ δ (the distance is in
total variation distance), we have

                                 | ICµ1 (T ) − ICµ2 (T )| ≤ 2 log(|X × Y|)δ + 2h(2δ ).                  (2.12)

    The information complexity functions ICµ ( f , ε) and ICµ ( f , ν, ε) are both continuous with respect
to ε. The following simple lemma from [4] proves continuity for ε ∈ (0, 1]. The continuity at 0 is more
complicated and is proven in [5]. (See also Theorem 3.5 and Theorem 3.6 below.)

                            T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                            13
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Lemma 2.4. [4] For every f : X × Y → Z, ε2 > ε1 > 0 and measures µ, ν on X × Y, we have

                           ICµ ( f , ν, ε1 ) − ICµ ( f , ν, ε2 ) ≤ (1 − ε1 /ε2 ) log |X × Y|,           (2.13)

and
                              ICµ ( f , ε1 ) − ICµ ( f , ε2 ) ≤ (1 − ε1 /ε2 ) log |X × Y|.              (2.14)

Proof. Consider a protocol π with information cost I, and error ε2 > 0. Here we can consider the
distributional error as in (2.13) or the point-wise error as in (2.14). Set δ = 1 − ε1 /ε2 , and let τ be the
protocol that with probability 1 − δ runs π, and with probability δ Alice and Bob exchange their inputs
and compute f (x, y) correctly. The theorem follows as the new protocol has error at most (1 − δ )ε2 = ε1 ,
and information cost at most I + δ log |X × Y|.

      Note that ICµ ( f , µ, 0) is not always continuous with respect to µ. For example, let the matrices
                                       1−ε     1−ε
                                                                                1     1
                                                                                            
                              µε =       3       3        ,    µ = lim µε =         3   3               (2.15)
                                        1−ε                                         1
                                         3       ε                   ε→0            3   0

represent distributions on {0, 1}2 . Here the entry at the i-th row and j-th column corresponds to the
measure of the point (i − 1, j − 1) ∈ {0, 1}2 . Now for the 2-bit AND function, we have ICµ (AND, µ, 0) =
0, while ICµε (AND, µε , 0) = ICµε (AND, 0) as µε has full support. Thus

                        lim ICµε (AND, µε , 0) = lim ICµε (AND, 0) = ICµ (AND, 0),                      (2.16)
                        ε→0                            ε→0

which is known to be bounded away from 0.
    Finally, note that Lemma 2.4 also implies the continuity of ICµ ( f , ν, ε) with respect to ν when ε > 0.
Indeed if |ν1 − ν2 | ≤ δ ≤ ε, then a protocol that has distributional error ε with respect to ν2 , will have
error at most ε + δ and at least ε − δ with respect to ν1 . Thus

                            ICµ ( f , ν1 , ε + δ ) ≤ ICµ ( f , ν2 , ε) ≤ ICµ ( f , ν1 , ε − δ ),        (2.17)

which establishes the desired continuity. A similar example to (2.15) shows that ICµ ( f , ν, 0) is not
necessarily continuous with respect to ν.

2.5     Communication protocols as random walks on ∆(X × Y)
Recall that ∆(X × Y) denotes the set of probability distributions on X × Y. Consider a protocol π and
a prior distribution µ on the set of inputs X × Y. Suppose that in the first round Alice sends a random
signal B to Bob. We can interpret this as a random update of the prior distribution µ to a new distribution
µ0 = µ|B=0 or µ1 = µ|B=1 depending on the value of B. It is not difficult to see that µb (x, y) = pb (x)µ(x, y)
for b = 0, 1, where pb (x) = Pr[B = b | x]/Pr[B = b]. In other words, µb is obtained by multiplying the
rows of µ by non-negative numbers. From the law of total expectation,

                                  µ = E[µ | B] = Pr[B = 0]µ0 + Pr[B = 1]µ1 .                            (2.18)
                                        B


                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               14
                              T RADING I NFORMATION C OMPLEXITY FOR E RROR

   Similarly if Bob is sending a message, then µb is obtained by multiplying the columns of µ by the
numbers pb (y) = Pr[B = b | y]/Pr[B = b]. That is µb (x, y) = µ(x, y)pb (y).
   The opposite direction is also true: given a distribution µ, distributions µ0 , µ1 , and 0 ≤ p0 , p1 ≤ 1
such that

      • p0 + p1 = 1,

      • µ0 and µ1 are obtained from µ by scaling its rows,

      • µ = p0 µ0 + p1 µ1 ,

one can define a random bit B that can be sent by Alice such that µb is µ conditioned on B = b for
b ∈ {0, 1}, and pb = Pr[B = b]. A similar statement holds for the case where µ0 and µ1 are obtained from
µ by scaling its columns and B is a signal that will be sent by Bob.
    Therefore, we can think of a protocol as a random walk on ∆(X × Y) that starts at µ, and every time
that a player sends a message, it moves to a new distribution. Equation (2.18) implies that this random
walk is without drift.
    Let Π denote the transcript of the protocol. Note that when the protocol terminates, the random walk
stops at µΠ := µ|Π . Since Π itself is a random variable, µΠ is a random variable that takes values in
∆(X × Y). Interestingly, both the internal and external information costs of the protocol depend only on
the distribution of µΠ (this is a distribution on the set ∆(X × Y), which itself is a set of distributions) [10].
It does not matter how different the steps of two protocols are, and as long as they both yield the same
distribution on ∆(X × Y), they have the same internal and external information cost. Consequently, one
can directly work with this random walk, instead of working with the actual protocols.
    In order to study the relation between the information complexity and the distribution of µΠ , define the
concealed information and external concealed information of a protocol π with respect to µ, respectively,
as
                   CIµ (π) = H(X | ΠY ) + H(Y | ΠX) = H(X | Y ) + H(Y | X) − ICµ (π),                     (2.19)
and
                                CIext                            ext
                                  µ (π) = H(XY | Π) = H(XY ) − ICµ (π).                                  (2.20)

   With this definition it is easy to see that the information cost of a protocol π with transcript Π only
depends on the distribution of µΠ . Indeed

        CIµ (π) = HXY ∼µ (X | ΠY ) + HXY ∼µ (Y | ΠX) = E HXY ∼µΠ (X | Y ) + E HXY ∼µΠ (Y | X).           (2.21)
                                                           Π                    Π

Another nice property of concealed information is that if π0 and π1 are the two branches of the protocol π
corresponding respectively to B = 0 and B = 1 where B is the first bit sent, then

                         CIµ (π) = Pr[B = 0] CIµ|B=0 (π0 ) + Pr[B = 1] CIµ|B=1 (π1 ).                    (2.22)

Thus, the expected value of CI is preserved throughout the execution of the protocol. Similar results hold
for CIext
      µ (π).


                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 15
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

3      Main results
In this section, we state and discuss our main results in full detail. Simpler proofs are presented in this
section, but the proofs of the more involved results are postponed to later sections.
    We will use the following simple estimate:
                                                          1                1
                                 x ∈ [0, 1/2] =⇒ x log      ≤ h(x) ≤ 2x log ,                           (3.1)
                                                          x                x
which holds since in that range −x log x ≥ −(1 − x) log(1 − x).
   Denote
                                         h(x) = h(min(x, 1/2)).                                         (3.2)
It satisfies h(x) ≥ h(x) and x ≤ h(x). It is easy to see that h is concave. Therefore, h is also concave as it
is piecewise differentiable with non increasing derivative. Additionally, h(0) = h(0) = 0. We will next
show how to utilize these two properties of h and h: for any concave function g : R+ → R for which
g(0) = 0, and for any x > 0 and 0 < q < 1, it holds that

                                   g(qx) ≥ qg(x) + (1 − q)g(0) = qg(x).                                 (3.3)

This implies the subadditivity of g: for all a1 , a2 > 0, g(a1 + a2 ) ≤ g(a1 ) + g(a2 ), as
                                                       ai
                                         g(ai ) ≥           g(a1 + a2 ) ,
                                                    a1 + a2
for all i = 1, 2.

3.1     Information complexity with point-wise error
Consider a communication problem f : X × Y → Z, and a distribution µ. How close can ICµ ( f , ε) be to
ICµ ( f , 0)? A simple argument shows that ICµ ( f , ε) ≤ ICµ ( f , 0) − Ω(ε).
Proposition 3.1. Let f : X × Y → Z, and let µ be a measure on X × Y. Denoting c = ICµ ( f , 0), we have

                              ICµ ( f , ε) ≤ (1 − ε) ICµ ( f , 0) = ICµ ( f , 0) − cε.                  (3.4)

Proof. Let π be a zero-error protocol for f . Consider a protocol π 0 in which Alice and Bob use their
public randomness to run with probability 1 − ε the protocol π, or to terminate with an arbitrary output
with probability ε. Let Π and Π0 be respectively the transcripts of π and π 0 on the random input (X,Y ).
We have

I(X; Π0 | Y ) = H(X | Y ) − H(X | Π0Y ) = H(X | Y ) − εH(X | Y ) − (1 − ε)H(X | ΠY ) = (1 − ε)I(X; Π | Y ).
                                                                                                    (3.5)
                            0
The same holds for I(Y ; Π | X), and the statement follows.

      Our first main theorem shows that this trivial bound can be improved to

                                     ICµ ( f , ε) ≤ ICµ ( f , 0) − Ω(h(ε)) .


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               16
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

Theorem 3.2. Consider a function f : X × Y → Z and a probability measure µ on X × Y such that
ICµ ( f , 0) > 0. There exist positive constants τ, ε0 , depending on f and µ (and thus on |X|, |Y|, |Z|), such
that for every ε ≤ ε0 ,
                                        ICµ ( f , ε) ≤ ICµ ( f , 0) − τh(ε).                              (3.6)
Moreover:

Non-constant case: Suppose that f (a) 6= f (b) for two points a, b in the support of µ, and on the same
     row or column. Then one can take τ = µ(a)2 µ(b)/32, and ε0 depends only on min(µ(a), µ(b))
     and |X × Y|.

AND case: Let x0 , x1 ∈ X and y0 , y1 ∈ Y. Suppose that f (x0 y0 ) = f (x0 y1 ) = f (x1 y0 ) = z0 and f (x1 y1 ) =
    z1 6= z0 , and that x0 y0 , x0 y1 , x1 y0 ∈ supp µ. Then one can take

                                            µ(x0 y0 )2
                                       τ=              min(µ(x0 y1 ), µ(x1 y0 )) ,
                                              64
      and ε0 depends only on |X × Y| and the minimum of µ(x0 y0 ), µ(x0 y1 ), µ(x1 y0 ).

Proof. See Section 4.1.1.

Remark 3.3. We prove Theorem 3.2 by taking a zero-error protocol for f , and turning it into an ε-error
protocol that has an Ω(h(ε)) gain in the information cost over the original protocol. The high-level idea
is that one of the players checks her/his input and if it is equal to a certain value x1 , then with probability
ε changes to a different value x0 . This obviously creates an error of at most ε. In the Non-constant case
of Theorem 3.2, the points a and b are used to determine x0 and x1 , and in the AND case, the same x0 and
x1 as they are described in the statement of the theorem can be used. Note that this modification can only
create errors that erroneously output f (x0 , y) instead of f (x1 , y) for some values of y. This allows us to
obtain a one-sided error for many functions. We shall use this later in Corollary 3.9 to obtain an upper
bound on the information complexity of the AND function when only one-sided error is allowed.

    Despite the simplicity of the idea described in the above remark, the proof is rather involved, and
uses some of our other results such as characterization of internal-trivial measures. The heart of the proof
is of course showing the existence of appropriate values of x0 and x1 that can lead to the desired gain of
Ω(h(ε)).
    Let XOR denote the 2-bit XOR function. The next result shows that the analogue of Theorem 3.2
does not hold for the external information complexity.

Proposition 3.4. Let µ be the distribution defined as

                                                     1/2 0
                                               µ=           .                                               (3.7)
                                                      0 1/2

Then ICext            ext
       µ (XOR, ε) ≥ ICµ (XOR, 0) − 3ε.

Proof. See Section 4.1.3.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   17
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

      For the lower bound we prove the following result.

Theorem 3.5. For all f , µ, ε, we have
                                                                         √
                                  ICµ ( f , ε) ≥ ICµ ( f , 0) − 4|X||Y|h( ε).                             (3.8)

Proof. See Section 4.1.2.

    Theorem 3.5 is obtained by taking an ε-error protocol and completing it to a zero-error protocol.
Here Alice and Bob first run the protocol that performs [ f , ε], but when this protocol terminates, instead
of returning the output, they continue their interaction to verify that the value that they have obtained is
correct. We will be able to show that these additional interactions can be performed at a small information
cost, and thus the total information cost of the new protocol is not going to be much larger than that of
the original protocol. This method, that we call protocol completion, is used in the proofs of other results
such as Theorem 3.7 as well.
    Finally let us remark that we do not know whether the bound in Theorem 3.5 is tight. In fact we are
not aware of any examples of f and µ that refutes the possibility that there exists C( f , µ) > 0 such that
ICµ ( f , ε) ≥ ICµ ( f , 0) −C( f , µ)h(ε).

3.2     Information complexity with distributional error
In Section 3.1 we considered the amount of gain one can obtain by allowing point-wise error. Next we
turn to distributional error. How much can one gain in information cost by allowing a distributional error
of ε? Small modifications in the proofs of Theorem 3.2 and Theorem 3.5 imply the following bounds.

Theorem 3.6. Let µ be a probability measure on X × Y, and let f : X × Y → Z satisfy ICµ ( f , µ, 0) > 0.
We have
                                                                        α2
  ICµ ( f , µ, 0) − 4|X||Y|h( ε/α) ≤ ICµ ( f , µ, ε) ≤ ICµ ( f , µ, 0) − h (εα/4) + 3ε log |X × Y|, (3.9)
                             p
                                                                        4
where α = minxy∈supp µ µ(x, y).

Proof. See Section 4.2.

    It is also possible to prove the upper bound of Theorem 3.6 using a different approach by “truncating”
a zero-error protocol. Unfortunately this approach requires some assumptions on the support of µ.
Nevertheless we sketch this proof, as the idea seems to be new, and it might have other applications.
    Let ∆0 ⊆ ∆(X × Y) be the set of all measures ν such that ICν ( f , ν, ε) = 0. Consider a protocol π that
performs [ f , µ, 0]. First we simulate π with another protocol π 0 such that no signal of π 0 jumps from
outside of ∆0 to the interior of ∆0 . In other words if some partial transcript t satisfies µt 6∈ ∆0 , then when
the next signal B is sent, µtB is either still outside of ∆0 or it is on the boundary ∂ ∆0 . The simulation can
be done in a perfect manner so that if Π and Π0 denote, respectively, the transcripts of π and π 0 , then µΠ0
has the same distribution as µΠ . The new protocol π 0 might not necessarily have bounded communication,
but it will terminate with probability 1. We refer the reader to [15, Signal Simulation Lemma] and [5,
Claim 7.14] for more details on such simulations.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 18
                              T RADING I NFORMATION C OMPLEXITY FOR E RROR

     We will truncate π 0 in the following manner to obtain a new protocol π0 that performs [ f , µ, ε].
Whenever the corresponding random walk of π 0 reaches a distribution ν that is on the boundary ∂ ∆0 , the
two players stop the random walk, and use ICν ( f , ν, ε) = 0 to output a value that creates a distributional
error of at most ε with respect to ν at no information cost. Obviously the distributional error of the
protocol π0 is at most ε. To analyze its information cost, denote the transcript of π0 by P, and note that P
is a partial transcript for π 0 . Let πP0 be the continuation of π 0 when one starts at this partial transcript. It is
not difficult to see that
                                   ICµ (π) = ICµ (π 0 ) = ICµ (π0 ) + E[ICµP (πP0 )].
                                                                      P

Since π 0 performs [ f , µ, 0], the tail protocol πP must perform [ f , µP , 0]. Hence in order to finish the
proof, it suffices to show that ICν ( f , ν, 0) = Ω(h(ε)) for every ν ∈ ∂ ∆0 , as this would imply the desired
ICµ (π) ≥ ICµ (π0 ) + Ω(h(ε)). This can be proven with some work when µ is of full support, however it
is not true for general measures. For example, consider the AND function, and let µ be the distribution on
{0, 1}2 defined as µ(0, 0) = 1 − 2ε and µ(1, 0) = µ(1, 1) = ε. Note that although µ is on the boundary of
∆0 , we have ICµ (AND, µ, 0) ≤ 2ε. Indeed, since µ(0, 1) = 0, Bob with probability 1 knows the correct
output by looking at his own input Y , and so if he sends his bit to Alice, they will both know the correct
output. This will have information cost at most H(Y | X) = Pr[X = 1]H(Y | X = 1) = 2ε.

3.3    Information complexity of the AND function with error
Building upon the previous works of Ma and Ishwar [24, 25], Braverman et al. [5] developed a method
for proving the optimality of information complexity and applied it to determine the internal and external
information complexity of the two-bit AND function. They introduced a “continuous-time” protocol
for this task, and proved that it has optimal internal and external information cost for any underlying
distribution. Although this protocol is not a conventional communication protocol as it has access to a
continuous clock, it can be approximated by conventional communication protocols through dividing
the time into finitely many discrete units. Then in [5, Problem 1.1] they considered the case where error
is allowed, and conjectured a gain of IC(AND) − IC(AND, ε) = Θ(h(ε)). In this section, we conduct
a thorough analysis of the information complexity of the AND function when error is permitted, and
among other results, prove the aforementioned conjecture.
     Applying our general bounds from Section 3.1 and Section 3.2 (i. e., Theorems 3.2, 3.5, and 3.6) we
already obtain the following for positive functions C1 (µ) and C2 (µ).

  (i). For every distribution µ satisfying ICµ (AND, 0) > 0, we have
                                      √
                ICµ (AND, 0) −C1 (µ)h( ε) ≤ ICµ (AND, ε) ≤ ICµ (AND, 0) −C2 (µ)h(ε).                          (3.10)

 (ii). For every distribution µ satisfying ICµ (AND, µ, 0) > 0, we have
                                    √
           ICµ (AND, µ, 0) −C1 (µ)h( ε) ≤ ICµ (AND, µ, ε) ≤ ICµ (AND, µ, 0) −C2 (µ)h(ε).                      (3.11)

   We show that under some conditions on the support of µ, the above lower bounds can be improved to
match the upper bounds.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                     19
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Theorem 3.7. The following hold:

  (i). For every distribution µ which is full support, except perhaps for µ(1, 1), we have

                                    ICµ (AND, 0) − ICµ (AND, ε) ≤ C(µ)h(ε),                             (3.12)

        where C(µ) is bounded whenever µ(0, 0), µ(0, 1), µ(1, 0) are bounded away from 0.

 (ii). For every distribution µ of full support, we have

                                 ICµ (AND, µ, 0) − ICµ (AND, µ, ε) ≤ C(µ)h(ε),                          (3.13)

        where C(µ) is a positive function.

      Note that for every distribution µ of full support, we have

                                   ICµ (AND, µ, 0) = ICµ (AND, 0) > 0,

and
                            ICµ (AND, ε/α) ≤ ICµ (AND, µ, ε) ≤ ICµ (AND, ε)
where α = minxy µ(xy). Thus Theorem 3.7 (ii) follows from (i).
     From a technical point of view, Theorem 3.7 is perhaps our most involved result in this article, and its
proof occupies the bulk of Section 6. The first idea that facilitates the proof substantially shows that it is
possible to parametrize the space of the distributions ∆(X × Y) so that the changes that occur in the prior
distribution by the players’ interactions can be captured by product measures. This idea, that is discussed
in details in Section 5, allows us to first prove the lower bound of Theorem 3.7 for the product measures,
and then add minor adjustments to adopt it for non-product distributions. The second component of the
proof is a stability result. Recall from Section 2.5 that the information cost of every protocol π depends
only on its “leaf distribution,” i. e., the distribution of µΠ , where Π is the transcript of π or equivalently
µ` where ` is a random leaf of the protocol tree. Our stability result, Theorem 6.2, shows that the leaf
distribution of every almost optimal protocol π for [AND, 0] shares certain similarities with that of the
buzzer protocol. Note that since π does not make any errors, by the end of the protocol, either both players
know that the input is (1, 1), or one of them has revealed that her input is 0. Theorem 6.2 formalizes
the intuition that in this latter case, the other player must not have revealed that his input is very likely
to be 0. This is achieved through defining a potential function that depends only on the distribution of
µΠ and proving that it is bounded by the so called information wastage ICµ (π) − ICµ (AND, 0). With
these results in hand, in order to complete the lower bound of Theorem 3.7, we start with a protocol π
performing [AND, ε] with almost optimal information complexity. First we show that π can be completed
to a protocol that performs [AND, 0] at a small additional information cost, though possibly larger than the
desired O(h(ε)). Then we apply the stability result to deduce certain properties for the leaf distribution
of π. This will imply that one indeed needs only an additional cost of O(h(ε)) to extend π to a protocol
that solves [AND, 0].
     Braverman et al. [5] showed that IC(AND, 0) = maxµ ICµ (AND, 0) is attained on a distribution
having full support. This enables us to derive the following corollary on prior-free information complexity.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               20
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR

Corollary 3.8. The following hold:

  (i). C1 h(ε) ≤ IC(AND, 0) − IC(AND, ε) ≤ C2 h(ε);

 (ii). C1 h(ε) ≤ IC(AND, 0) − ICD (AND, ε) ≤ C2 h(ε);

where C1 and C2 are positive constants.

Proof. The measure µ that maximizes ICµ (AND, 0) has full support [5], and thus

                               IC(AND, 0) = ICµ (AND, 0) = ICµ (AND, µ, 0).

By Theorem 3.7 (ii),

IC(AND, ε) ≥ ICD (AND, ε) ≥ ICµ (AND, µ, ε) ≥ ICµ (AND, µ, 0) −C2 h(ε) = IC(AND, 0) −C2 h(ε).
                                                                                       (3.14)
Moreover by a general upper bound that we prove later in Theorem 3.15, we have

                             ICD (AND, ε) ≤ IC(AND, ε) ≤ IC(AND, 0) −C1 h(ε).                        (3.15)

Both items in the corollary follow.

    Since the difficult distributions for the set disjointness function are the ones in which the inputs
typically have small or no intersections at all, the distributions for the AND function that assign a very
small or 0 mass to the point (1, 1) are of particular importance. Let

                         ICδ (AND, ε, 1 → 0) =       sup         ICµ (AND, ε, 1 → 0).                (3.16)
                                                  µ : µ(1,1)≤δ

The following corollary is used in Section 3.4 to analyze the information complexity of the set disjointness
problem.

Corollary 3.9. There exist positive constants C1 ,C2 and C3 such that

  (i). C1 h(ε) ≤ IC0 (AND, 0) − IC0 (AND, ε) ≤ C2 h(ε);

 (ii). C1 h(ε) ≤ IC0 (AND, 0) − IC0 (AND, ε, 1 → 0) ≤ C2 h(ε);

(iii). for every ε, δ > 0,

                              ICδ (AND, ε, 1 → 0) ≤ IC0 (AND, 0) −C1 h(ε) +C3 h(δ ).                 (3.17)

Proof. Assume that ε ≤ 1/2. Let µ be the distribution maximizing ICµ (AND, 0) under the constraint
µ(1, 1) = 0; This measure, which is described in [5], has full support except for µ(1, 1) = 0. Thus by
Theorem 3.7 (i),

        IC0 (AND, ε) ≥ ICµ (AND, ε) ≥ ICµ (AND, 0) −C2 h(ε) = IC0 (AND, 0) −C2 h(ε).                 (3.18)

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                              21
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Consequently, since IC0 (AND, ε) ≤ IC0 (AND, ε, 1 → 0), both (i) and (ii) will follow if we prove
IC0 (AND, ε, 1 → 0) ≤ IC0 (AND, 0) − C1 h(ε). To prove this, we would like to apply the AND case
of Theorem 3.2, however to be able to obtain a uniform upper bound on IC0 (AND, ε, 1 → 0), we need
to have a uniform lower bound on the probabilities µ(0, 0), µ(0, 1), µ(1, 0). Let α > 0 to be determined
later, and consider any distribution µ with µ(1, 1) = 0 and µ(a) < α for some input a 6= (1, 1). Pick
b ∈ {0, 1}2 \ {a, (1, 1)}, and obtain the distribution µ 0 from µ by transferring all the probability mass
on a to b. That is µ 0 (b) = µ(a) + µ(b) and µ 0 (a) = 0, and otherwise µ and µ 0 are identical. Obviously
|µ − µ 0 | ≤ α. Now (2.12) implies

    ICµ (AND, ε, 1 → 0) ≤ ICµ (AND, 0) ≤ ICµ 0 (AND, 0) + 4α + 2h(2α) = 4α + 2h(2α) ≤ 4h(2α),
                                                                                                           (3.19)
where we used the fact that ICµ 0 (AND, 0) = 0 as supp µ 0 contains only two points. Setting α = 0.001
for example yields ICµ (AND, 0) ≤ 4h(2α) < 0.1 < IC0 (AND, 0) ≈ 0.4827. It remains to prove the
statement for the distributions µ with µ(0, 0), µ(0, 1), µ(1, 0) ≥ α. In this case Theorem 3.2 (see
Remark 3.3 regarding the one-sidedness) implies that exists a constant C1 > 0 such that ICµ (AND, ε, 1 →
0) ≤ IC0 (AND, 0) −C1 h(ε). This finishes the proof of (i) and (ii).
     To prove (iii), consider an arbitrary distribution µ with µ(1, 1) ≤ δ , and let µ 0 be the distribution that
is obtained from µ by moving the probability mass on (1, 1) to a different point so that µ 0 (1, 1) = 0 and
|µ − µ 0 | ≤ δ . Similar to (3.19), we obtain

        ICµ (AND, ε, 1 → 0) ≤ ICµ 0 (AND, ε, 1 → 0) + 4h(2δ ) ≤ IC0 (AND, ε, 1 → 0) + 4h(2δ ),

and thus (iii) follows from (ii).

3.4    Set disjointness function with error
In this section we focus on the set disjointness function. Firstly it is not hard to obtain the following
result.
Corollary 3.10. There exists a constant C > 0 such that the following holds:

                              IC(DISJn , ε) ≥ n IC0 (AND, 0) −Ch(ε) .
                                                                     
                                                                                                               (3.20)

Proof sketch. By the argument that proves the additivity of information complexity (see, e. g., [7]),
one can prove that IC(DISJn , ε) ≥ n IC0 (AND, ε). Then apply Corollary 3.9. The essential idea is the
following. Consider a distribution µ on {0, 1}2 with µ(1, 1) = 0, and let (a, b) ∈ {0, 1}2 be an input for
the AND function. Let XY ∈ {0, 1}n × {0, 1}n be such that for some randomly selected J ∈ {1, . . . , n} we
have (X j ,Y j ) = (a, b), and for i ∈ {1, . . . , n} \ {J}, the pairs (Xi ,Yi ) are i.i.d. random variables, each with
distribution µ. Since µ(1, 1) = 0, we have DISJn (X,Y ) = 1 − AND(a, b) with probability 1. Thus one
can take a protocol π for DISJn and use it to solve AND(a, b) correctly for every (a, b). By sampling XY
in a clever way, using both public and private randomness, one can guarantee that the information cost of
the new protocol that solves AND(a, b) will be the information cost of π divided by n.

    As a result one also obtains that Rε (DISJn ) ≥ n[IC0 (AND, 0) − Ω(h(ε))]. It turns out that by using
techniques from [5] and [4], one can prove the following upper bound.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                       22
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

Theorem 3.11. For the set disjointness function DISJn on inputs of length n, we have

                                          Rε (DISJn )
                                lim sup               ≤ IC0 (AND, 0) −Ch(ε),                           (3.21)
                                  n→∞          n

for some constant C > 0.

Proof. See Section 7.1.

      We conjecture that in fact the exact constant is given by IC0 (AND, ε, 1 → 0). In other words:

Conjecture 3.12. For the set disjointness function DISJn on inputs of length n, we have

                                Rε (DISJn ) = n IC0 (AND, ε, 1 → 0) + o(n).                            (3.22)

      Braverman [4] proved that for all 0 < α < 1 and for all functions f ,
                                                               ε
                                     ICD ( f , ε) ≥ (1 − α) IC f ,   .                                 (3.23)
                                                                   α
When f = DISJn , Corollary 3.10 gives

   ICD (DISJn , ε)
                   ≥ (1 − α)(IC0 (AND, 0) − Θ(h(ε/α))) ≥ IC0 (AND, 0) − Θ(α + h(ε/α)).                 (3.24)
           n
                 p
Substituting α = ε log(1/ε) yields

                                ICD (DISJn , ε)                    p
                                                ≥ IC0 (AND, 0) − Θ( h(ε)).                             (3.25)
                                      n
In Theorem 3.13 below, which is one of our main contributions, we show that this bound is sharp. The
proof relies on introducing a new protocol for set disjointness problem, and analyzing its information
cost.

Theorem 3.13. For the set disjointness function DISJn on inputs of length n, we have
                                                           p       
                    ICD (DISJn , ε) ≤ n IC0 (AND, 0) −C1 h(ε) +C2 log n,                               (3.26)

for constants C1 ,C2 > 0.

Proof. See Section 7.2.

3.5     Prior-free information cost
                                    p               p
Theorem 3.13 shows that for α =      ε log(1/ε) = Θ( h(ε)), and sufficiently large n, we have

                             ICD (DISJn , ε)
                                             = IC(DISJn , ε/α) < IC(DISJn , ε),                        (3.27)
                               1 − Θ(α)

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             23
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

and thus proves a separation between distributional and non-distributional prior-free information com-
plexity. As we discussed in the introduction this has the important implication that amortized randomized
communication complexity is not necessarily equal to the amortized distributional communication
complexity with respect to the hardest distribution. More precisely, there are examples for which
         µ,n
maxµ Dε ( f n ) 6= Rnε ( f n ).
    Next we turn to proving general lower bounds and upper bounds for the prior-free information
complexity. Theorem 3.5 immediately implies a lower bound for non-distributional prior-free information
complexity.

Corollary 3.14 (corollary of Theorem 3.5). For every function f , we have
                                                                         √
                                     IC( f , ε) ≥ IC( f , 0) − 4|X × Y|h( ε).                           (3.28)

    Since unless µ satisfies certain conditions, Theorem 3.2 does not provide an upper bound on ICµ ( f , ε)
that is uniform on µ, we cannot apply it directly to bound IC( f , ε). However, we will get around this
problem by proving that the “difficult distributions” satisfy these conditions and hence we obtain the
desired upper bound.

Theorem 3.15. If f : X × Y → Z is non-constant, then

                                        IC( f , ε) ≤ IC( f , 0) −C( f )h(ε),                            (3.29)

where C( f ) > 0.

Proof. See Section 4.3.

    The same upper bound and lower bound hold for ICD ( f , ε).

Theorem 3.16. If f : X × Y → Z is non-constant, then
                                               √
                       ICD ( f , 0) −C1 ( f )h( ε) ≤ ICD ( f , ε) ≤ ICD ( f , 0) −C2 ( f )h(ε),         (3.30)

where C1 ( f ),C2 ( f ) > 0.

Proof. It is shown in [4] that ICD ( f , 0) = IC( f , 0), and thus the upper bound follows from Theorem 3.15
as ICD ( f , ε) ≤ IC( f , ε).
    To prove the lower bound, choose a measure µ that maximizes ICµ ( f , µ, 0), and let

                                               α = min µ(x, y).
                                                    xy∈supp µ

Applying Theorem 3.6, we get
                                                                 p                               √
     ICD ( f , ε) ≥ ICµ ( f , µ, ε) ≥ ICµ ( f , µ, 0) − 4|X||Y|h( ε/α) = ICD ( f , 0) −C1 ( f )h( ε).   (3.31)


                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             24
                              T RADING I NFORMATION C OMPLEXITY FOR E RROR

3.6   A characterization of trivial measures
We start with a few definitions. Let f : X × Y → Z be an arbitrary function, and µ a distribution on
X × Y. We say that µ is external-trivial if ICextµ ( f , µ, 0) = 0. We say that µ is strongly external-trivial if
there exists a protocol π computing f correctly on all inputs satisfying ICext      µ (π) = 0. We say that µ is
structurally external-trivial if f is constant on SA × SB , where SA is the support of the marginal of µ on
Alice’s input and SB is the support of the marginal of µ on Bob’s input.
     Similarly we say that µ is internal-trivial if ICµ ( f , µ, 0) = 0. We say that µ is strongly internal-trivial
if there exists a protocol π computing f correctly on all inputs satisfying ICµ (π) = 0. We say that µ is
structurally internal-trivial if the marginals of µ can be partitioned as SA = i Xi and SB = i Yi so that
                                                                                     S                S

the support of µ is contained in i Xi × Yi , and f is constant on each Xi × Yi .
                                   S

     Theorem 3.17 below shows that all our definitions of internal triviality are equivalent. In particular, if
ICµ ( f , 0) = 0, then the infimum in the definition of ICµ is achieved by a finite protocol.

Theorem 3.17. Let f : X × Y → Z be an arbitrary function, and µ a distribution on X × Y. The
distribution µ is internal-trivial iff it is strongly internal-trivial iff it is structurally internal-trivial.

Proof. See Section 4.4.

    In order to prove Theorem 3.17, we first obtain a characterization of measures that are not structurally
internal-trivial, by defining a graph Gµ on the support of every distribution µ on X × Y.

Definition 3.18. Let G be the graph whose vertex set is X × Y, and two vertices are connected if they
agree on one of their coordinates. That is, xy, xy0 are connected for every x ∈ X and y 6= y0 ∈ Y, and xy, x0 y
are connected for every x 6= x0 ∈ X and y ∈ Y. In short, G is the Cartesian product of the complete graphs
KX and KY . Let Gµ be the subgraph of G induced by the support of µ. For every connected component C
of Gµ , define

                                    CA = {x ∈ X : xy ∈ C for some y ∈ Y},
                                    CB = {y ∈ Y : xy ∈ C for some x ∈ X}.

    The following lemma shows that if µ is not structurally internal-trivial, then there exists a connected
component C of Gµ such that f is not constant on CA ×CB . We will use this fact later in Section 4.1.1 in
the proof of Theorem 3.2.

Lemma 3.19. Let f : X × Y → Z be an arbitrary function, and µ a distribution on X × Y. Then the
distribution µ is structurally internal-trivial iff for every connected component C of Gµ , the function f is
constant on CA ×CB .

Proof. Suppose first that µ is structurally internal-trivial. Thus there exist partitions SA = i Xi and
                                                                                                        S

SB = i Yi such that the support of µ is contained in i Xi × Yi and f is constant on Xi × Yi on each i.
     S                                                 S

Any connected component C of Gµ must lie in some Xi × Yi , hence CA ×CB ⊆ Xi × Yi and f is constant
on CA ×CB for every connected component C.
    Conversely, suppose that for every connected component C of Gµ , the function f is constant on
CA ×CB . If C,C0 are two different connected components then CA ,CA0 are disjoint: otherwise, if (say) xy ∈

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   25
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

C and xy0 ∈ C0 then xy is connected to xy0 and so C = C0 . Thus {CA : C a connected component of Gµ }
partitions the support of X. Similarly, {CB : C a connected component of Gµ } partitions the support of Y.
These partitions serve as a witness that µ is structurally internal-trivial.

      Finally we note that the analogue of Theorem 3.17 holds for the external case as well.

Theorem 3.20. Let f : X × Y → Z be an arbitrary function, and µ a distribution on X × Y. The
distribution µ is external-trivial iff it is strongly external-trivial iff it is structurally external-trivial.

Proof. See Section 4.4.


4      Proofs for general functions
In this section we present the proofs of the main results on general functions presented in Section 3.

4.1     Information complexity with point-wise error
4.1.1    Proof of Theorem 3.2
We discuss some notation before the proof. Consider a protocol π. For an input xy, let Πxy denote the
random variable corresponding to the transcript of π when it is executed on the input xy. Let Π denote
the random variable for transcripts of π, whose distribution is given as

                              Pr[Π = t] = E Pr[Πxy = t] = ∑ Pr[xy] Pr[Πxy = t],                          (4.1)
                                           xy
                                                            xy

where Pr[Πxy = t] = Pr[Π = t | XY = xy]. As usual we abbreviate Pr[xy] = Pr[XY = xy], and Pr[x | y] =
Pr[X = x | Y = y], and so on.
    The next lemma shows that under some conditions, if we modify a protocol π to a new protocol π 0
according to Figure 1, then the information cost will have a significant drop.

    On input XY :

        • Alice privately samples a Bernoulli random variable B with parameter ε.

        • If X = x1 and B = 1, Alice sets X 0 = x0 , otherwise she sets X 0 = X.

        • The players run π on X 0Y .



                 Figure 1: The protocol π 0 is obtained from a protocol π using x0 , x1 ∈ X.


Lemma 4.1. Let µ be a distribution on X × Y, and π be a protocol with input set X × Y. Suppose there
is a set L of transcripts of π that satisfies, for some C1 ∈ [0, 1],

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               26
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

  (1) Pr[Π ∈ L] ≥ C1 ;
and there are x0 y, x1 y, both in the support of µ, and C2 ∈ (0, 1], δ ∈ [0, 1] with C2 > 2δ , such that for
every t ∈ L,
  (2) Pr[XY = x0 y | Π = t] ≥ C2 ;

  (3) Pr[XY = x1 y | Π = t] ≤ δ .
Let K = log |X × Y|. Then for sufficiently small ε > 0 (depending on µ,C2 , δ ), the protocol π 0 defined in
Figure 1 satisfies
                                                                  
                       0                       ε           Pr[x1 y]
                 ICµ (π ) ≤ ICµ (π) −C1C2 h      min 1,C2              + 3εK + h(δ /C2 ).              (4.2)
                                               2           Pr[x0 y]
Explicitly, the upper bound holds as long as
                                       Pr[x1 y]
                                                ε + (1 − ε)δ /C2 ≤ 1/2.
                                       Pr[x0 y]
     Intuitively, this condition says that π has a set of transcripts L that happen with significant probability,
and every transcript in L probabilistically differentiates between x0 y and x1 y. In other words, if we we see
a transcript in L, then we know that the input was much more likely to be x0 y than to be x1 y. One point to
note here is that we require the two points x0 y and x1 y to be in the same column. By symmetry, if there
are two points in the same row satisfying the same properties, then the claim of Lemma 4.1 also holds.

Proof. Consider the protocols π and π 0 as described in Figure 1. Note that ΠX 0Y is the transcript of π 0 .
We shorthand Π0 = ΠX 0Y . The information cost of π 0 is given by

    ICµ (π 0 ) = I(X; Π0 | Y ) + I(Y ; Π0 | X) = H(X | Y ) + H(Y | X) − H(X | Π0Y ) − H(Y | Π0 X),         (4.3)

while

        ICµ (π) = I(X; Π | Y ) + I(Y ; Π | X) = H(X | Y ) + H(Y | X) − H(X | ΠY ) − H(Y | ΠX).             (4.4)

Hence
                ICµ (π) − ICµ (π 0 ) = H(X | Π0Y ) − H(X | ΠY ) + H(Y | Π0 X) − H(Y | ΠX).                 (4.5)
Note that

 H(Y | Π0 X) ≥ H(Y | Π0 XB) ≥ (1 − ε) H(Y | Π0 X, (B = 0)) = (1 − ε) H(Y | ΠX) ≥ H(Y | ΠX) − εK.
                                                                                             (4.6)
Similarly, for every y ∈ Y and every possible transcript t, we have

                                 H(X | Π0Y = ty) ≥ H(X | ΠY = ty) − εK.                                    (4.7)

We will show that for Y = y and every transcript t ∈ L,
                                                                       
                  0                                  ε          Pr[x1 y]
          H(X | Π Y = ty) ≥ H(X | ΠY = ty) + h         min 1,C2             − h(δ /C2 ) − εK.              (4.8)
                                                     2          Pr[x0 y]

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  27
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Note that Condition (2) implies that for t ∈ L,
            Pr[ΠY = ty] ≥ Pr[ΠXY = tx0 y] = Pr[XY = x0 y | Π = t] Pr[Π = t] ≥ C2 Pr[Π = t].            (4.9)
Hence
                                 Pr[Π ∈ L,Y = y] ≥ C2 Pr[Π ∈ L] ≥ C1C2 .                              (4.10)
This together with (4.7) and (4.8) would show that
    H(X | Π0Y ) = ∑ ∑ Pr[Π0Y = ty]H(X | Π0Y = ty) ≥ ∑ ∑ (1 − ε) Pr[ΠY = ty]H(X | Π0Y = ty)
                    t y∈Y                                     t y∈Y

                 ≥ ∑ ∑ Pr[ΠY = ty]H(X | ΠY = ty)
                    t y∈Y
                                                                          
                                            ε           Pr[x1 y]
                    + Pr[Π ∈ L,Y = y] h       min 1,C2              − h(δ /C2 ) − 2εK
                                            2           Pr[x0 y]
                                                          
                                        ε          Pr[x1 y]
                 ≥ H(X | ΠY ) +C1C2 h     min 1,C2             − 2εK − h(δ /C2 ).
                                        2          Pr[x0 y]
Applying (4.6) would immediately give the claimed bound.
   Our aim, then, is to show (4.8). From now on we consider exclusively t ∈ L.
   The idea is to consider the indicator variable C := 1[X6=x1 ] . Since C is a deterministic function of X,
we have
        H(X | Π0Y = ty) = H(XC | Π0Y = ty) = H(X | C, (Π0Y = ty)) + H(C | Π0Y = ty).                  (4.11)
We obtain by Conditions (2) and (3) that
                                       Pr[XY = x1 y | Π = t] Pr[XY = x1 y | Π = t]   δ
              Pr[X = x1 | ΠY = ty] =                        ≤                       ≤ .               (4.12)
                                          Pr[Y = y | Π = t]    Pr[XY = x0 y | Π = t] C2
Hence using (4.12), the first term in (4.11) can be bounded as
                   H(X | C, (Π0Y = ty)) ≥ (1 − ε)H(X | C, (BΠ0Y = 0ty))                               (4.13)
                                           ≥ H(X | C, (ΠY = ty)) − εK                                 (4.14)
                                           = H(XC | ΠY = ty) − H(C | ΠY = ty) − εK                    (4.15)
                                           ≥ H(X | ΠY = ty) − h(δ /C2 ) − εK.                         (4.16)
    Next, we bound the second term H(C | Π0Y = ty) in (4.11):
  Pr[C = 0 | Π0Y = ty] = Pr[X = x1 | Π0Y = ty]                                                        (4.17)
                          Pr[Π0 XY = tx1 y]
                        =                                                                             (4.18)
                            Pr[Π0Y = ty]
                            Pr[Π0 = t | XY = x1 y] Pr[x1 y]
                        =                                                                             (4.19)
                                   Pr[Π0Y = ty]
                                                                                          Pr[x1 y]
                        = (ε Pr[Π = t | XY = x0 y] + (1 − ε) Pr[Π = t | XY = x1 y])                   (4.20)
                                                                                       Pr[Π0Y = ty]
                            ε Pr[ΠXY = tx0 y] Pr[x 1 y]
                                              Pr[x0 y] + (1 − ε) Pr[ΠXY = tx1 y]
                        =                                                          .                  (4.21)
                                               Pr[Π0Y = ty]

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                              28
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR

We start by upper bounding (4.21):

                               ε Pr[ΠXY = tx0 y] Pr[x 1 y]
                                                 Pr[x0 y] + (1 − ε) Pr[ΠXY = tx1 y]
                 (4.21) ≤                                                                                 (4.22)
                                                 Pr[Π0 XY = tx0 y]
                               ε Pr[ΠXY = tx0 y] Pr[x 1 y]
                                                 Pr[x0 y] + (1 − ε) Pr[ΠXY = tx1 y]
                           =                                                                              (4.23)
                                                 Pr[ΠXY = tx0 y]
                               ε Pr[XY = x0 y | Π = t] Pr[x1 y]
                                                       Pr[x0 y] + (1 − ε) Pr[XY = x1 y | Π = t]
                           =                                                                              (4.24)
                                                    Pr[XY = x0 y | Π = t]
                               Pr[x1 y]
                           ≤            ε + (1 − ε)δ /C2 .                                                (4.25)
                               Pr[x0 y]

Next, we lower bound (4.21). First, obtain an upper bound on its denominator:

    Pr[Π0Y = ty] = ∑ Pr[Π0 XY = txy] + Pr[Π0 XY = tx1 y]                                                  (4.26)
                   x6=x1
                                                                     Pr[x1 y]
                 = ∑ Pr[ΠXY = txy] + ε Pr[ΠXY = tx0 y]                        + (1 − ε) Pr[ΠXY = tx1 y]   (4.27)
                   x6=x1                                             Pr[x0 y]
                                                                  Pr[x1 y]
                 ≤ ∑ Pr[ΠXY = txy] + ε Pr[ΠXY = tx0 y]                                                    (4.28)
                    x                                             Pr[x0 y]
                                                   Pr[x1 y]
                 = Pr[ΠY = ty] + ε Pr[ΠXY = tx0 y]                                                        (4.29)
                                                   Pr[x0 y]
                                                                
                                                        Pr[x1 y]
                 ≤ 2 max Pr[ΠY = ty], Pr[ΠXY = tx0 y]              ,                                      (4.30)
                                                        Pr[x0 y]

where (4.27) follows from the calculations between (4.18) and (4.21). Hence,

                                                  ε Pr[ΠXY = tx0 y] Pr[x 1 y]
                                                                    Pr[x0 y]
                           (4.21) ≥                                                                       (4.31)
                                     2 max{Pr[ΠY = ty], Pr[ΠXY = tx0 y] Pr[x  1 y]
                                                                          Pr[x y] }
                                                                       0
                                     ε      Pr[ΠXY = tx0 y] Pr[x1 y]
                                    = min                            ,1 .                                 (4.32)
                                     2        Pr[ΠY = ty] Pr[x0 y]
                                                                            
                                     ε                           Pr[x1 y]
                                    = min Pr[X = x0 | ΠY = ty]            ,1 .                            (4.33)
                                     2                           Pr[x0 y]
                                                                            
                                     ε      Pr[XY = x0 y | Π = t] Pr[x1 y]
                                    = min                                  ,1 .                           (4.34)
                                     2        Pr[Y = y | Π = t] Pr[x0 y]
                                                                            
                                     ε                           Pr[x1 y]
                                    ≥ min Pr[XY = x0 y | Π = t]           ,1 .                            (4.35)
                                     2                           Pr[x0 y]
                                                        
                                     ε        Pr[x1 y]
                                    ≥ min C2           ,1 .                                               (4.36)
                                     2        Pr[x0 y]


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 29
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Combining (4.21), (4.25) and (4.36):
                                     
               ε             Pr[x1 y]                            Pr[x1 y]    (1 − ε)δ
                  min 1,C2              ≤ Pr[C = 0 | Π0Y = ty] ≤          ε+          .                    (4.37)
               2             Pr[x0 y]                            Pr[x0 y]       C2

This together with (4.11) and (4.16) gives (4.8) as desired, as long as ε > 0 is small enough such that the
upper bound in (4.37) is at most 1/2.

Theorem 3.2 (restated). Consider a function f : X × Y → Z and a probability measure µ on X × Y such
that ICµ ( f , 0) > 0. There exist positive constants τ, ε0 , depending on f and µ, such that for every ε ≤ ε0 ,

                                        ICµ ( f , ε) ≤ ICµ ( f , 0) − τh(ε).                               (4.38)

Moreover:

Non-constant case: Suppose that f (a) 6= f (b) for two points a, b in the support of µ, and on the same
     row or column. Then one can take τ = µ(a)2 µ(b)/32, and ε0 depends only on min(µ(a), µ(b))
     and |X × Y|.

AND case: Let x0 , x1 ∈ X and y0 , y1 ∈ Y. Suppose that f (x0 y0 ) = f (x0 y1 ) = f (x1 y0 ) = z0 and f (x1 y1 ) =
    z1 6= z0 , and that x0 y0 , x0 y1 , x1 y0 ∈ supp µ. Then one can take

                                             µ(x0 y0 )2
                                       τ=               min(µ(x0 y1 ), µ(x1 y0 )),
                                               64
      and ε0 depends only on |X × Y| and the minimum of µ(x0 y0 ), µ(x0 y1 ), µ(x1 y0 ).

Proof. In order to apply the assumption ICµ ( f , 0) > 0, we will need to use our characterization of
internal-trivial measures. Consider the graph Gµ defined on X × Y as given in Definition 3.18. By
Theorem 3.17 and Lemma 3.19, the assumption ICµ ( f , 0) > 0 implies the existence of a connected
component C of Gµ such that f is not constant on CA ×CB . Note that C ⊆ supp µ, and CA ×CB is the
corresponding rectangle given by C.

Case I: f is not constant on C.
    As C is connected, there must be two adjacent points a, b ∈ C such that f (a) 6= f (b). By our definition
of adjacency in Definition 3.18, without loss of generality we can assume that a, b are in the same column.
Now consider any protocol π that solves [ f , 0]. Let L0 be the set of the transcripts that can occur when π
runs with input a; formally,
                                       L0 = {t : Pr[Πa = t] > 0}.                                      (4.39)
Clearly Pr[Π ∈ L0 ] ≥ µ(a). As f (a) 6= f (b) and π has no error, for every t ∈ L0 ,

                                            Pr[XY = b | Π = t] = 0.                                        (4.40)

Let
                               L = {t ∈ L0 : Pr[XY = a | Π = t] ≥ µ(a)/2}.                                 (4.41)

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   30
                                T RADING I NFORMATION C OMPLEXITY FOR E RROR

We claim
                                                Pr[Π ∈ L] ≥ µ(a)/2.                                    (4.42)
Indeed, note

               ∑ Pr[Π = t] Pr[XY = a | Π = t] = ∑ Pr[Π = t] Pr[XY = a | Π = t] = µ(a).                 (4.43)
               t∈L0                                        t

Using the trivial bound Pr[XY = a | Π = t] ≤ 1, we have

           µ(a) = ∑ Pr[Π = t] Pr[XY = a | Π = t] +                ∑ Pr[Π = t] Pr[XY = a | Π = t]
                      t∈L                                       t∈L0 \L
                                       µ(a)                              µ(a)
                ≤ ∑ Pr[Π = t] +             ∑    Pr[Π = t] ≤ Pr[Π ∈ L] +      (1 − Pr[Π ∈ L]),
                      t∈L               2 t∈L \L                          2
                                                 0


which gives Pr[Π ∈ L] ≥ µ(a)/(2 − µ(a)) ≥ µ(a)/2, as claimed. For small enough ε, the set L and the
points a, b satisfy the three conditions in Lemma 4.1 with C1 = C2 = µ(a)/2 and δ = 0, respectively
from (4.42), (4.41) and (4.40). We conclude that

                                     µ(a)2
                                                    
                                               µ(b)                      µ(b)
       ICµ ( f , ε) ≤ ICµ ( f , 0) −       h        ε + 3εK whenever          ε ≤ 1/2,       (4.44)
                                       4        4                        µ(a)

where K = log |X × Y|. Hence when ε ≤ 1/2, by (3.3) we have

                                                                µ(a)2 µ(b)
                                ICµ ( f , ε) ≤ ICµ ( f , 0) −              h(ε) + 3εK.                 (4.45)
                                                                   16
We can thus find ε0 > 0, depending only on µ(a), µ(b), K, such that for ε ≤ ε0 ,

                                                                    µ(a)2 µ(b)
                                    ICµ ( f , ε) ≤ ICµ ( f , 0) −              h(ε).                   (4.46)
                                                                       32

Case II: f is constant on C but not on CA ×CB .
   We first make a simple observation:

Property A: For any protocol π that performs [ f , 0], and for every transcript t of π, there exists at least
     one point b ∈ C (which can depend on t) such that Pr[XY = b | Π = t] = 0.

Indeed, otherwise f would be constant on CA ×CB by the rectangle property of protocols (i. e.,

                      Pr[Π = t | x1 y1 ] Pr[Π = t | x2 y2 ] = Pr[Π = t | x1 y2 ] Pr[Π = t | x2 y1 ]

for all x1 , x2 , y1 , y2 ).
    Given a protocol π that performs [ f , 0] and a point a ∈ C, let the set L(π, a) of transcripts be defined
as
                             L(π, a) = {t : Pr[XY = a | Π = t] ≥ µ(a)/2}.                               (4.47)

                            T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                           31
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

The same argument as in Case I shows that Pr[Π ∈ L(π, a)] ≥ µ(a)/2. For any other point b ∈ C, define
                           L(π, a, b) = {t ∈ L(π, a) : Pr[XY = b | Π = t] = 0}.                        (4.48)
Let k := |C|; necessarily k ≥ 3. By Property A, we have
                                          L(π, a) =          L(π, a, b).
                                                       [
                                                                                                       (4.49)
                                                       b∈C
This implies the existence of a point b ∈ C with
                           Pr[Π ∈ L(π, a, b)] ≥ Pr[Π ∈ L(π, a)]/k ≥ µ(a)/2k .
To sum up, we have shown that there exist two different points a, b ∈ C ⊆ supp µ such that the set of
transcripts L(π, a, b) satisfies the following properties:
 (1’) Pr[Π ∈ L(π, a, b)] ≥ µ(a)/2k;
 (2’) Pr[XY = a | Π = t] ≥ µ(a)/2 for every t ∈ L(π, a, b);
 (3’) Pr[XY = b | Π = t] = 0 for every t ∈ L(π, a, b).
    Now consider a sequence of protocols πn that all perform [ f , 0] and limn→∞ ICµ (πn ) = ICµ ( f , 0). Fix
(arbitrarily) a point a ∈ C. For every protocol πn we construct L(πn , a, bπn ) as above. Since there are only
k − 1 different values of b, by picking a subsequence of πn if necessary, without loss of generality, we
may assume that for some point b ∈ C, bπn = b for all πn . Hence for every πn we have a set of transcripts
L(πn , a, b) such that properties (1’), (2’) and (3’) are all satisfied.
    If we compare these three conditions with the conditions in Lemma 4.1, we find that the only issue is
that we do not know whether a and b are in the same row or column (in terms of the graph Gµ , whether a
and b are adjacent).

    Case IIa: a, b are adjacent in Gµ . As we expand on below, we can guarantee that this case happens
in the AND case (see theorem statement) by choosing a = x0 y0 .
    For small enough ε, the set L(π, a, b) and the points a, b satisfy the three conditions in Lemma 4.1
with C1 = µ(a)/2k, C2 = µ(a)/2 and δ = 0, respectively from (1’), (2’) and (3’). We conclude that
                                     µ(a)2
                                                    
                                               µ(b)                            µ(b)
       ICµ ( f , ε) ≤ ICµ ( f , 0) −       h        ε + 3εK     whenever            ε ≤ 1/2,      (4.50)
                                      4k        4                              µ(a)
where K = log |X × Y|. Repeating the calculations of Case I, we can find ε0 > 0, depending only on
µ(a), µ(b), K, such that for ε ≤ ε0 ,
                                                               µ(a)2 µ(b)
                                 ICµ ( f , ε) ≤ ICµ ( f , 0) −            h(ε).                         (4.51)
                                                                   32k
    Suppose now that we are in the AND case. Choosing a = x0 y0 , we see that Property A must hold
for some b ∈ {x0 y1 , x1 y0 }, since a transcript having positive probability on both x0 y1 and x1 y0 also has
positive probability on x1 y1 , whereas f (x0 y1 ) 6= f (x1 y1 ) by assumption. Property (1’) thus holds with
k = 2, and we conclude that for ε ≤ ε0 ,
                                                                 µ(a)2 µ(b)
                                 ICµ ( f , ε) ≤ ICµ ( f , 0) −              h(ε).                      (4.52)
                                                                    64

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               32
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

    Case IIb: a, b are not adjacent in Gµ . To handle this case, we run a binary search along a shortest
path connecting a and b in C.
    Pick an arbitrary point c ∈ C in some shortest path connecting a and b. For every πn , sort the
transcripts in L(πn , a, b) according to

                                        pn,t,c := Pr[XY = c | Πn = t]

in increasing order, where Πn is the random variable representing the transcript of πn . Let mn be the median
of the sequence pn,t,c according to the conditional probability measure νn (t) := Pr[Πn = t | t ∈ L(πn , a, b)],
i. e.,
               νn ({t ∈ L(πn , a, b) : pn,t,c ≤ mn }), νn ({t ∈ L(πn , a, b) : pn,t,c ≥ mn }) ≥ 1/2.    (4.53)
Such a median always exists: if mn is the smallest value such that νn ({t ∈ L(πn , a, b) : pn,t,c ≤ mn }) ≥ 1/2
then
          νn ({t ∈ L(πn , a, b) : pn,t,c ≥ mn }) = 1 − νn ({t ∈ L(πn , a, b) : pn,t,c < mn }) ≥ 1/2 .

     As trivially mn ∈ [0, 1], the sequence mn must have a convergent subsequence. Again by picking
a subsequence from mn if necessary, we may assume that the sequence mn itself is convergent, say
limn→∞ mn = m; moreover, if m > 0, by picking another subsequence we can assume that mn ≥ m/2 for
all n. The binary search algorithm is then given as:

    • If m = 0, update the set of transcripts to

                                   L(πn , a, c) := {t ∈ L(πn , a, b) : pn,t,c ≤ mn },                    (4.54)

      and continue the algorithm with b replaced by c;

    • If m > 0, update the set of transcripts to

                                   L(πn , c, b) := {t ∈ L(πn , a, b) : pn,t,c ≥ mn },                    (4.55)

      and continue the algorithm with a replaced by c.

We argue that the three properties are roughly preserved. In the case m = 0, Property (2’) is kept, while
Properties (1’) and (3’) change to

      Pr[Πn ∈ L(πn , a, c)] ≥ µ(a)/4k      and     Pr[XY = c | Πn = t] ≤ mn , ∀ t ∈ L(πn , a, c),        (4.56)

respectively. In the case m > 0, Property (3’) is preserved while Properties (1’) and (2’) change to

      Pr[Πn ∈ L(πn , c, b)] ≥ µ(a)/4k     and      Pr[XY = c | Πn = t] > m/2, ∀ t ∈ L(πn , c, b).        (4.57)

In either case, we have seen that the new set of transcripts L(πn , a, b) together with the new two points a
and b satisfy Condition (1), (2) and (3) in Lemma 4.1 with proper constants (e. g., δn in Condition (3)
is at most mn for protocol πn , and mn → 0). After finitely many steps, the binary search algorithm has

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 33
                       Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

to stop and return two adjacent points a and b. Suppose that it stops after s steps; note that s ≤ dlog ke.
Lemma 4.1 then gives the upper bound

                                                 µ(a)        ε            
                    ICµ ( f , ε) ≤ ICµ (πn ) −          C2 h    min{1,C2 R}  + 3εK + h(δn /C2 )        (4.58)
                                                 2s+1 k       2
for some C2 , R, K > 0 (where C2 , R depend on µ) and a sequence δn tending to zero, assuming that

                                    Rε + (1 − ε)δn /C2 ≤ 1/2 and δn /C2 ≤ 1/2.                         (4.59)

By picking a subsequence, we can assume that δn ≤ C2 /4 for all n. Lemma 4.1 then applies for all
ε ≤ 1/(4R). Taking the limit of the right-hand side of (4.58) as n → ∞, we obtain

                                         µ(a)        ε            
         ICµ ( f , ε) ≤ ICµ ( f , 0) −          C2 h    min{1,C2 R}  + 3εK = ICµ ( f , 0) − Ω(h(ε)).   (4.60)
                                         2s+1 k       2

4.1.2    Proof of Theorem 3.5

Theorem 3.5 (restated). For all f , µ, ε, we have
                                                                                √
                                         ICµ ( f , ε) ≥ ICµ ( f , 0) − 4|X||Y|h( ε).                   (4.61)



Proof of Theorem 3.5. Without loss of generality assume that µ is a full-support distribution as otherwise
we can approximate it by a sequence of full-support distributions and appeal to the continuity of ICν ( f , ε)
with respect to ν. Consider a protocol π that performs [ f , ε]. For every leaf ` of π, let z` and µ`
respectively denote the output of the leaf, and the distribution of the inputs conditioned on the leaf `. We
will complete it into a protocol π 0 that performs [ f , 0], as follows.

        On input (X,Y ):

        • Alice and Bob run the protocol π and reach a leaf `;

        • For every (x, y) ∈ Ω` := {(x, y) : f (x, y) 6= z` }, Alice and Bob verify whether XY = xy, as
          follows:

              – If µ` (x) ≤ µ` (y), Alice reveals whether X = x to Bob, and if yes, Bob reveals whether
                Y = y to Alice. If XY = xy, they terminate.
              – If µ` (x) > µ` (y), Bob initiates the verification process.


    Clearly, in the end, either both Alice and Bob already revealed their inputs to each other, or otherwise
they know XY ∈   / Ω` , and hence z` is the correct output. Therefore π 0 performs the task [ f , 0].

                           T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                            34
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

    Next we analyze ICµ (π 0 ). Let π`,xy denote the sub-protocol that starts with the distribution µ` and
verifies whether XY = xy. In the case when Alice initiates the verification procedure, we have
                                                                    
                                                  µ` (x, y)
              ICµ` (π`,xy ) ≤ h(µ` (x)) + µ` (x)h                         ≤ h(µ` (x)) + µ` (x) ≤ 2h(µ` (x)),              (4.62)
                                                   µ` (x)

where by an abuse of notation we are denoting by µ` (x) the marginal of µ` on x. We can obtain a similar
bound for the case where Bob initiates the process, and hence

                ICµ` (π`,xy ) ≤ 2 min{h(µ` (x)), h(µ` (y))}
                                                                                     
                              = 2h µ` (x, y) + min{Pr[X 6= x,Y = y], Pr[X = x,Y 6= y]}
                                                    µ`               µ`
                                                                                          
                              ≤ 2h(µ` (x, y)) + 2h min{Pr[X 6= x,Y = y], Pr[X = x,Y 6= y]}
                                                                µ`                       µ`

                                                                                                           √
by the subadditivity of h. Using the monotonicity of h together with min{a, b} ≤                            ab, we obtain that
                                                                                       !
                                                     r
                  ICµ` (π`,xy ) ≤ 2h(µ` (x, y)) + 2h  Pr[X = x,Y 6= y] Pr[X 6= x,Y = y]                                   (4.63)
                                                                µ`                      µ`


holds for every leaf ` and (x, y) ∈ Ω` . Let Π`,xy denote the transcript of π`,xy . Since π`,xy is a deterministic
protocol, we have Hµ` (Π`,xy | XY ) = 0, and thus

             ICµ` (π`,xy ) = I(Π`,xy ;Y | X) + I(Π`,xy ; X | Y ) = Hµ` (Π`,xy | X) + Hµ` (Π`,xy | Y ).                    (4.64)

Thus the sub-additivity of entropy implies that the information cost of running all the protocols π`,xy (for
all x, y ∈ Ω` ) is bounded by the sum of their individual information cost. Let ` be a leaf of π sampled by
running π on a random input. By (4.63),

        ICµ (π 0 ) − ICµ (π) ≤ E ∑ ICµ` (π`,xy ) =              ∑         E 1z` 6= f (x,y) ICµ` (π`,xy )                  (4.65)
                                ` xy∈Ω                                     `
                                       `                     (x,y)∈X×Y

                            ≤      ∑        2 E 1z` 6= f (x,y) h(µ` (x, y)) +                                             (4.66)
                                                 `
                                (x,y)∈X×Y
                                                                                                  !
                                                                r
                                  ∑        2 E 1z` 6= f (x,y) h  Pr[X = x,Y 6= y] Pr[X 6= x,Y = y]                        (4.67)
                                             `                       µ`                       µ`
                              (x,y)∈X×Y
                                                                        
                            ≤      ∑        2h E 1z` 6= f (x,y) µ` (x, y) +                                               (4.68)
                                                     `
                                (x,y)∈X×Y
                                                                                                !
                                               r
                                  ∑        2h E 1z` 6= f (x,y) Pr[X = x,Y 6= y] Pr[X 6= x,Y = y]                          (4.69)
                                                     `               µ`                       µ`
                              (x,y)∈X×Y


where we used the concavity of h in the last step.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                                 35
                             Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

      For the first summand, we have that for every (x, y),

        E 1z` 6= f (x,y) µ` (x, y) = ∑ Pr[XY = xy, π reaches `]1z` 6= f (x,y)                                            (4.70)
        `
                                              `
                                        = ∑ Pr[π reaches ` | XY = xy]µ(xy)1z` 6= f (x,y)                                 (4.71)
                                              `
                                        = µ(xy) ∑ Pr[πx,y reaches `]1z` 6= f (x,y) = µ(xy) Pr[π(x, y) 6= f (x, y)]       (4.72)
                                                        `
                                        ≤ µ(xy)ε ≤ ε,                                                                    (4.73)

where we used that by definition µ` (xy) = Pr[XY = xy | π reaches `], and the fact that the protocol π
performs the task [ f , ε].
    Now consider the second summand in (4.69). Since µ` is obtained by scaling rows and columns of µ,
for every x0 6= x, y0 6= y, we have (recall that we assumed µ is of full support)

            Prµ [X = x,Y = y] Prµ [X = x0 ,Y = y0 ] Prµ` [X = x,Y = y] Prµ` [X = x0 ,Y = y0 ]
                                                   =                                          ,                          (4.74)
            Prµ [X = x0 ,Y = y] Prµ [X = x,Y = y0 ] Prµ` [X = x0 ,Y = y] Prµ` [X = x,Y = y0 ]

hence
                                                                                                !                    !
            Pr[X = x,Y 6= y] Pr[Y = y, X 6= x] =                        ∑ Pr[X = x,Y = y0 ] · ∑ Pr[X = x0 ,Y = y]
            µ`                           µ`                               µ      `              µ             `
                                                                       y0 6=y                        x0 6=x

            =       ∑        Pr[X = x,Y = y0 ] Pr[X = x0 ,Y = y]
                              µ                             µ`
                 x0 6=x,y0 6=y `
                               Prµ` [X = x,Y = y] Prµ` [X = x0 ,Y = y0 ]
            =       ∑                                                     Pr[X = x0 ,Y = y] Pr[X = x,Y = y0 ]
                                                              0 ,Y = y0 ] µ
                 x0 6=x,y0 6=y Pr µ [X = x,Y = y] Pr µ [X = x                               µ

                 Prµ` [X = x,Y = y]             Prµ` [X = x0 ,Y = y0 ]
            =                                                           Pr[X = x0 ,Y = y] Pr[X = x,Y = y0 ].
                 Prµ [X = x,Y = y] x0 6=∑  0
                                        x,y 6=y Pr µ [X = x 0 ,Y = y0 ] µ                 µ


      Define
                                                                                Prµ` [X = x,Y = y]
                                                        a` := 1z` 6= f (x,y)                                             (4.75)
                                                                                Prµ [X = x,Y = y]
and
                                                     Prµ` [X = x0 ,Y = y0 ]
                          b` :=          ∑                                   Pr[X = x0 ,Y = y] Pr[X = x,Y = y0 ].
                                                                 0 ,Y = y0 ] µ
                                                                                                                         (4.76)
                                       x0 6=x,y0 6=y Pr µ [X = x                               µ

Then
                                           1z` 6= f (x,y) Pr[X = x,Y 6= y] Pr[Y = y, X 6= x] = a` b` .                   (4.77)
                                                            µ`                       µ`

      Now for a` , by (4.73) we have

                                                    1
                                   E a` =               E1              µ` (x, y) = Pr[π(x, y) 6= f (x, y)] ≤ ε,         (4.78)
                                   `              µ(xy) ` z` 6= f (x,y)

                                   T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                     36
                                     T RADING I NFORMATION C OMPLEXITY FOR E RROR

and for b` , by the fact E` Prµ` [X = x0 ,Y = y0 ] = Prµ [X = x0 ,Y = y0 ] we have

                                            E` Prµ` [X = x0 ,Y = y0 ]
                  E b` =          ∑                      0 ,Y = y0 ]
                                                                      Pr[X = x0 ,Y = y] Pr[X = x,Y = y0 ]
                   `
                              x0 6=x,y0 6=y  Pr µ [X = x              µ                 µ

                          =       ∑           Pr[X = x0 ,Y = y] Pr[X = x,Y = y0 ]
                                              µ                   µ
                              x0 6=x,y0 6=y

                          = Pr[X = x,Y 6= y] Pr[Y = y, X 6= x] ≤ 1.
                              µ                         µ

Hence we can bound the second summand in (4.69) using the Cauchy-Schwarz inequality
                                  p         q             √
                                 E a` b` ≤ E a` E b` ≤ ε.                                                               (4.79)
                                                    `                 `   `

    Using (4.69), (4.73), (4.79), and the monotonicity of h, we have
                                                                                 √               √
        ICµ ( f , 0) − ICµ (π) ≤ ICµ (π 0 ) − ICµ (π) ≤ 2|X × Y|h(ε) + 2|X × Y|h( ε) ≤ 4|X × Y|h( ε).

4.1.3     Proof of Proposition 3.4

Proposition 3.4 (restated). Let µ be the distribution defined as

                                                                  1/2 0
                                                            µ=           .                                              (4.80)
                                                                   0 1/2

Then ICext            ext
       µ (XOR, ε) ≥ ICµ (XOR, 0) − 3ε.

Proof of Proposition 3.4. The distribution µ is supported on the inputs (0, 0), (1, 1), on which the output
is 0. It is easy to check (and follows from the analysis below) that ICext
                                                                         µ (XOR, 0) = 1, since at the end of
any protocol that performs [XOR, 0], we know whether the input is (0, 0) or (1, 1).
     Consider a protocol π having at most ε error on every input, where ε ≤ 1/3. Let Lz be the set of
transcripts on which the output is z; Every transcript is either in L0 or L1 .
     For each transcript t achievable from the initial distribution, the distribution of XY | t is of the form

                                                                 p  0
                                                                 0 1− p

for some p = p(t). Bayes’ law shows that

                       Pr[00 | t] Pr[t]                                         Pr[11 | t] Pr[t]
     Pr[t | 00] =                       = 2p(t) Pr[t],           Pr[t | 11] =                    = 2(1 − p(t)) Pr[t].   (4.81)
                           Pr[00]                                                   Pr[11]

For each transcript t, the rectangle property says Pr[t | 00] Pr[t | 11] = Pr[t | 10] Pr[t | 01]. Hence

  Pr[t | 01] + Pr[t | 10] p                       p                         p
                         ≥ Pr[t | 01] Pr[t | 10] = Pr[t | 00] Pr[t | 11] = 2 p(t)(1 − p(t)) Pr[t]. (4.82)
             2

                              T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                         37
                       Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    The protocol π has distributional error at most ε, and so

                          Pr[L1 ] = ∑ Pr[t] ≤ ε,         and     Pr[L0 ] = ∑ Pr[t] ≥ 1 − ε.                         (4.83)
                                       t∈L1                                  t∈L0

On the other hand, since π has point-wise error at most ε, we have
                                p                            1       Pr[t | 01] + Pr[t | 10] ε
                          ∑         p(t)(1 − p(t)) Pr[t] ≤      ∑                           ≤ .                     (4.84)
                         t∈L0                                2 t∈L 0
                                                                                2            2

Finally,
                            I(XY ; Π) = H(XY ) − H(XY | Π) = 1 − ∑ Pr[t]h(p(t)).                                    (4.85)
                                                                              t

    Let T be a random transcript conditioned on belonging to L0 , and consider the random variable
P := p(T ). On the one hand,

                  1 − I(XY ; Π) = ∑ Pr[t]h(p(t)) ≤ Pr[L0 ] E[h(P)] + Pr[L1 ] ≤ E[h(P)] + ε.                         (4.86)
                                        t

On the other hand, by (4.84)
                                        p                   ε         ε
                                      E[ P(1 − P)] ≤              ≤         ≤ ε,                                    (4.87)
                                                         2 Pr[L0 ] 2(1 − ε)

as we assumed ε ≤ 1/3. Thus p    it suffices to verify that E[h(P)] ≤ 2ε for any random variable P that takes
values in [0, 1] and satisfies E[ P(1 − P)] ≤ ε. Indeed this would imply

                                            1 − I(XY ; Π) ≤ E[h(P)] + ε ≤ 3ε,                                       (4.88)

alternatively, ICext
                 µ (XOR, ε) ≥ 1 − 3ε for all ε ≤ 1/3, which in turn shows that ICext
                                                                                  µ (XOR, 0) = 1.
                                      p
    Apply the change of variablepQ = P(1 − P), so that the assumption simplifies to E[Q] ≤ ε; note
that 0 ≤ Q ≤ 1/2, and P = (1 ± 1 − 4Q2 )/2. Since h(P) = h(1 − P), we conclude that
                                                                   p         !
                                                               1 + 1 − 4Q2
                      E[h(P)] = E[φ (Q)], where φ (Q) = h                       .               (4.89)
                                                                      2

It is routine to check that the function φ is monotonically increasing and strictly convex. Since φ is
continuous and the domain of Q is restricted to [0, 1/2], the maximum of E[φ (Q)] under the constraint
E[Q] ≤ ε is achieved.3 Since φ is increasing, the maximum value of E[φ (Q)] is achieved when E[Q] = ε.
Since φ is strictly convex, the maximum value of E[φ (Q)] is achieved on a measure supported on the
endpoints 0, 1/2. Thus this measure must be Pr[Q = 1/2] = 2ε and Pr[Q = 0] = 1 − 2ε. So

                              E[h(P)] = E[φ (Q)] ≤ (1 − 2ε)φ (0) + 2εφ (1/2) = 2ε.                                  (4.90)

   3 This follows from Prokhorov’s theorem, which implies that the set of probability measures over [0, 1/2] is compact with

respect to the weak-* topology. The same result also follows from the Riesz representation theorem [28].


                           T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                          38
                            T RADING I NFORMATION C OMPLEXITY FOR E RROR

4.2   Information complexity with distributional error

Theorem 3.6 (restated). Let µ be a probability measure on X × Y, and let f : X × Y → Z satisfy
ICµ ( f , µ, 0) > 0. We have
                                                                       α2
 ICµ ( f , µ, 0) − 4|X||Y|h( ε/α) ≤ ICµ ( f , µ, ε) ≤ ICµ ( f , µ, 0) − h (εα/4) + 3ε log |X × Y|, (4.91)
                            p
                                                                       4
where α = minxy∈supp µ µ(x, y).
Proof of Theorem 3.6. Lower bound: The proof is almost identical to the proof of Theorem 3.5, however
now we start from a distribution µ that possibly does not have full support. Consider a protocol π that
performs [ f , µ, ε], and define z` and µ` as in the proof of Theorem 3.5. Now the new protocol π 0 that
performs [ f , µ, 0], is defined similar to the one in the proof of Theorem 3.5 with the only difference that
the verification is only performed on the set
                                   Ω0` := {(x, y) : f (x, y) 6= z` } ∩ supp µ.                          (4.92)
Obviously π 0 solves [ f , µ, 0]. Note that π has point-wise error at most ε/α on every point in supp µ.
Thus the same analysis of Theorem 3.5 shows
                   ICµ ( f , µ, 0) − ICµ (π) ≤ ICµ (π 0 ) − ICµ (π) ≤ 4|X × Y|h( ε/α).
                                                                                p
                                                                                                  (4.93)
     Upper bound: For every z ∈ Z, let Xz denote the set of all x ∈ X such that for some xy ∈ supp µ,
we have f (x, y) = z. Similarly let Yz denote the set of all y ∈ Y such that for some xy ∈ supp µ, we have
f (x, y) = z. The assumption ICµ ( f , µ, 0) > 0 implies the existence of distinct z1 , z2 ∈ Z such that either
Xz1 ∩ Xz2 6= 0/ or Yz1 ∩ Yz2 6= 0, / otherwise, Alice and Bob can exchange the unique values of z determined
by their inputs, and since with probability 1, these two values coincide, they can perform [ f , µ, 0] with
zero information cost. Hence without loss of generality assume there exists x0 y, x1 y ∈ supp µ such that
f (x0 , y) 6= f (x1 , y) and µ(x0 y) ≥ µ(x1 y). We will apply Lemma 4.1. Consider a protocol π with transcript
Π that performs [ f , µ, 0], and define the set of transcripts
                                     L := {t | Pr[x0 y | t] ≥ Pr[x0 y]/2},
and note that
                                                                                   Pr[x0 y]
                    Pr[x0 y] = ∑ Pr[x0 y | t] Pr[Π = t] ≤ Pr[Π ∈ L] + Pr[Π 6∈ L]            ,
                               t                                                      2
which implies
                                                      Pr[x0 y] α
                                         Pr[Π ∈ L] ≥           ≥ .
                                                         2         2
Note that the protocol π 0 defined in Figure 1 performs [ f , µ, ε]. Furthermore we can set C1 = C2 = α/2
and δ = 0, to obtain
                                                   α 2  εα 
                             ICµ (π 0 ) ≤ ICµ (π) − h           + 3ε log |X ×Y |,
                                                    4     4
for ε ≤ 1/2. As
                                         α2
                                      − h (εα/4) + 3ε log |X × Y| ≥ 0
                                         4
for ε ≥ 1/2, this finishes the proof for all 0 ≤ ε ≤ 1.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                39
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

4.3   Non-distributional prior-free information cost
In this section we prove Theorem 3.15, that is

                                        IC( f , ε) ≤ IC( f , 0) − Ω(h(ε)).                                    (4.94)

First we present some lemmas, and the proof of Theorem 3.15 will appear at the end of this section.
    While Theorem 3.2 does not give a uniform bound on the parameters C, ε0 for every distribution µ, it
does for distributions in which there exist two elements with different outputs, that are in the same row or
column and whose probabilities are Ω(1). We will show that for any non-constant function, the worst
distribution is of this form; this might be of independent interest.
    We start with the following simple lemma.

Lemma 4.2. Let f : X × Y → Z. Suppose that supp µ ⊆ i Xi × Yi , where the Xi are disjoint and the Yi
                                                               S

are disjoint. Then
                          ICµ ( f , 0) = ∑ µ(Xi × Yi ) ICµ|Xi ×Yi ( f |Xi ×Yi ).              (4.95)
                                                i

Proof. The upper bound is easy to see: the players exchange which block they are in, and assuming that
they are in the same block, they run an almost optimal protocol for that block. If they are not in the same
block, then they exchange inputs, but this happens with probability zero.
    In the other direction, let J be the block in which Alice’s input lies. Since the value of J is determined
by the value of X, for a protocol π with transcript Π, we have

 I(Y ; Π | X) = I(Y ; Π | XJ) = ∑ Pr[J = j]I(Y ; Π | X, J = j) = ∑ µ(X j × Y j )I(Y ; Π | X, J = j). (4.96)
                                   j                                    j


With probability 1, J is also the block in which Bob’s input lies, and so

 ICµ (π) = ∑ µ(X j × Y j )[I(X; Π | Y, J = j) + I(Y ; Π | X, J = j)] ≥ ∑ µ(X j × Y j ) ICµ|X j ×Y j ( f |X j ×Y j ).
              j                                                               j
                                                                                                              (4.97)


   We can therefore restrict our attention (for now) to distributions based on a single block. The crucial
observation is the following.

Lemma 4.3. Let f : X × Y → Z, and let µ be a distribution such that f is constant on its support, each
atom in the support has probability at least α, and the marginals of the support are X, Y. If f is not
constant then there is a distribution ν such that ICν ( f , 0) ≥ ICµ ( f , 0) +C(α), where C(α) > 0 depends
only on α, |X|, |Y|.

Proof. Let (x0 , y0 ) be any point not in the support of µ such that f (x0 y0 ) is different from the constant
value of f on supp µ. Since the marginals of the support are X, Y and every atom in the support has
probability at least α, we see that Pr[X = x0 ], Pr[Y = y0 ] ≥ α.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                       40
                                T RADING I NFORMATION C OMPLEXITY FOR E RROR

    Let ν = εδx0 y0 + (1 − ε)µ, where ε is a parameter to be determined later, and δx0 y0 denotes the Dirac
measure concentrated on the point (x0 , y0 ). Note that X 0Y 0 ∼ ν can be sampled in the following manner.
First we pick XY ∼ µ and an independent Bernoulli random variable B with Pr[B = 1] = ε. Then
                                                (
                                          0 0     XY if B = 0,
                                        XY =                                                         (4.98)
                                                  x0 y0 if B = 1.

Let π be a protocol that performs the task [ f , 0], and let Πxy denote the transcript of this protocol when it
is run on the input xy. Note that with probability 1, the value of B is determined by the value of X 0Y 0 , and
thus

                I(X 0 ; ΠX 0Y 0 | Y 0 ) = I(X 0 B; ΠX 0Y 0 | Y 0 ) = I(B; ΠX 0Y 0 | Y 0 ) + I(X 0 ; ΠX 0Y 0 | Y 0 B)
                                      = I(B; ΠX 0Y 0 | Y 0 ) + (1 − ε)I(X; ΠXY | Y ).

Moreover, since f (x0 , y0 ) is different from the constant value of f on the support of µ, the value of B is
determined by ΠX 0Y 0 . Thus I(B; ΠX 0Y 0 | Y 0 ) = H(B | Y 0 ), and

                              I(X 0 ; ΠX 0Y 0 | Y 0 ) = H(B | Y 0 ) + (1 − ε)I(X; ΠXY | Y ).                            (4.99)

    To lower-bound H(B | Y 0 ), note that
                                                Pr[B = 1,Y 0 = y0 ]              ε
                  Pr[B = 1 | Y 0 = y0 ] =              0
                                                                    =                         ≥ ε,                     (4.100)
                                                   Pr[Y = y0 ]        (1 − ε) Pr[Y = y0 ] + ε
and on the other hand,
                                                                             ε
                                         Pr[B = 1 | Y 0 = y0 ] ≤                    ,                                  (4.101)
                                                                       (1 − ε)α + ε
                 √
which for ε ≤ α/2 will be at most 1 − ε. Since Pr[Y 0 = y0 ] = (1 − ε) Pr[Y = y0 ] + ε ≥ α, we conclude
that H(B | Y 0 ) ≥ αh(ε). We deduce that

   I(X 0 ; ΠX 0Y 0 | Y 0 ) ≥ αh(ε) + (1 − ε)I(X; ΠXY | Y ) ≥ I(X; ΠXY | Y ) + αh(ε) − ε log |X × Y|.                   (4.102)

The gain is
                                                                                      
        0           0                            1                       1
    I(X ; ΠX 0Y 0 | Y ) − I(X; ΠXY | Y ) ≥ αε log − ε log |X × Y| = α log − log |X × Y| ε,   (4.103)
                                                 ε                       ε
                                                                                           √
and so when ε ≤ ε0 := |X × Y|−2/α , the gain is at least ε log |X × Y|. Taking ε = min(ε0 , α/2), we
obtain a constant C(α) > 0, depending on |X × Y|, such that

                                      I(X 0 ; ΠX 0Y 0 | Y 0 ) ≥ I(X; ΠXY | Y ) +C(α),                                  (4.104)

and similarly
                                      I(Y 0 ; ΠX 0Y 0 | X 0 ) ≥ I(X; ΠXY | Y ) +C(α).
This shows that
                                            ICν ( f , 0) ≥ ICµ ( f , 0) + 2C(α).                                       (4.105)


                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                             41
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    We obtain the following important consequence.

Lemma 4.4. Let f : X × Y → Z be a non-constant function. There exist constants c, δ > 0, depending
only on the function f such that if for all µ, ICµ ( f , 0) ≥ IC( f , 0) − δ then there exist points P, Q, on the
same row or column, such that µ(P), µ(Q) ≥ c and f (P) 6= f (Q).

Proof. Call a distribution ν on X × Y optimal if IC( f , 0) = ICν ( f , 0). Braverman et al. [6] showed that
ICν ( f , 0) is continuous in ν, and this implies that optimal distributions exist, and moreover the set of
optimal distributions is closed. It is also convex, due to the concavity of ICν ( f , 0) (see [5]).
    For a distribution ν, let β (ν) be the maximal value β such that there exist two points P, Q, on the
same row or column, such that ν(P), ν(Q) ≥ β and f (P) 6= f (Q). Note that β (ν) is continuous in ν.
    Suppose that β (ν) = 0. For z ∈ Z, let Xz be the set of rows on which some point P ∈ supp ν satisfies
f (P) = z, and define Yz analogously. We claim that the sets Xz for z ∈ Z are disjoint, similarly Yz
are disjoint. Indeed, if x ∈ Xz1 ∩ Xz2 , then the row x contains two points P, Q in the support such that
f (P) 6= f (Q), and so β (ν) > 0. Next we show that supp ν ⊆ z Xz × Yz . Indeed if P ∈ Xz1 × Yz2 is in the
                                                                S

support of ν, and f (P) 6= z1 , then there exists some point Q on the same row as P which is in the support
and satisfies f (Q) = z1 , showing that β (ν) > 0; a similar conclusion is reached if f (P) 6= z2 .
    We will show that ν is not optimal. Since f is non-constant, IC( f , 0) > 0, hence we can assume that
ICµ ( f , 0) > 0. From the characterization of trivial measures (Theorem 3.17) there exists a block Xz × Yz
on which f is not constant. Lemma 4.3 shows that we can modify the component of ν on that block
so as to increase the information complexity, and Lemma 4.2 shows that this increases the information
complexity over the entire domain. This implies ν is not optimal, as required.
    For ρ ≥ 0, let Oρ = {ν : ICν ( f , 0) ≥ IC( f , 0) − ρ}. Continuity of ICν ( f , 0) shows that Oρ is closed.
We define b(ρ) = inf{β (ν) : ν ∈ Oρ }; since β is continuous and Oρ is closed, the infimum is achieved.
In view of the preceding paragraph, b(0) > 0. Continuity of β (ν) and ICν ( f , 0) shows that b(ρ) is
continuous as well, and so b(δ ) > 0 for some δ > 0. The proof is complete by taking c = b(δ ).

    We can now apply Theorem 3.2 to deduce that IC( f , ε) ≤ IC( f , 0) − Ω(h(ε)).

Theorem 3.15 (restated). If f : X × Y → Z is non-constant then

                                       IC( f , ε) ≤ IC( f , 0) −C( f )h(ε),                              (4.106)

where C( f ) > 0.

Proof. Let c, δ be the parameters from Lemma 4.4. For a distribution µ, either ICµ ( f , 0) ≤ IC( f , 0) − δ
or Theorem 3.2 shows that

                      ICµ ( f , ε) ≤ ICµ ( f , 0) − (c3 /32)h(ε) ≤ IC( f , 0) − (c3 /32)h(ε)

for all ε ≤ ε0 where ε0 depends only on c and |X×Y|. Choose ε sufficiently small such that (c3 /32)h(ε) ≤
δ and ε ≤ ε0 ; we conclude in both cases that ICµ ( f , ε) ≤ IC( f , 0) −C( f )h(ε).


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  42
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR

4.4    A characterization of trivial measures
First we present the proof of the external case, i. e., Theorem 3.20, as it is simpler.
Theorem 3.20 (restated). Let f : X × Y → Z be an arbitrary function, and µ a distribution on X × Y.
The distribution µ is external-trivial iff it is strongly external-trivial iff it is structurally external-trivial.

Proof of Theorem 3.20. If µ is external-trivial then µ is structurally external-trivial. Suppose that
µ is external-trivial but not structurally external-trivial. We will reach a contradiction.
    We start by showing that if µ is external-trivial then f has to be constant on the support of µ. Indeed,
suppose that the protocol π computes f correctly, and denote by Π the transcript of π. The data processing
inequality shows that

                   I(Π; XY ) ≥ I(Π; f (XY )) = H( f (XY )) − H( f (XY ) | Π) = H( f (XY )).                      (4.107)

This shows that µ can only be external-trivial if H( f (XY )) = 0, that is, if f is constant on the support of
µ. From now, we assume that this is indeed the case.
     Let ab be an arbitrary point in the support of µ, and let c = f (ab). Since µ is not structurally
external-trivial, there must be some input x0 y0 ∈ SA × SB for which f (x0 y0 ) 6= c. Note that x0 y0 is not in
the support of µ. Since x0 ∈ SA , x0 y1 is in the support of µ for some y1 ∈ SB . Similarly, x1 y0 is in the
support of µ for some x1 ∈ SA .
     Since µ is external-trivial, there is a sequence πn of protocols computing f correctly on every input
such that I(XY ; Πn ) → 0, where XY ∼ µ. We think of πn also as a distribution over transcripts t. Since
 f (XY ) = c with probability 1, if πn (t) > 0 then the transcript t indicates that the output is c. Let pn be
the joint distribution of X,Y,t. Recall that

                                    D(pn (x, y,t) k µ(x, y)πn (t)) = I(XY ; Πn ),

hence D(pn (x, y,t) k µ(x, y)πn (t)) → 0.
   For two distributions µ and ν on a finite space, Pinsker’s inequality states that
                                                         1
                                               D(µ k ν) ≥ kµ − νk21 .
                                                         2
This implies that kpn (x, y,t) − µ(x, y)πn (t)k1 → 0. On the other hand, for every transcript t appearing
with positive probability, either pn (x0 , y1 ,t) = 0 or pn (x1 , y0 ,t) = 0: otherwise pn (x0 , y0 ,t) > 0 (due to
the rectangular property of protocols), contradicting the correctness of πn (since f (x0 y0 ) 6= c). Therefore

   |µ(x0 , y1 )πn (t) − pn (x0 , y1 ,t)| + |µ(x1 , y0 )πn (t) − pn (x1 , y0 ,t)| ≥ πn (t) min(µ(x0 , y1 ), µ(x1 , y0 )).
                                                                                                                   (4.108)
Summing over all transcripts having positive probability, we deduce that

  kpn (x, y,t) − µ(x, y)πn (t)k1 ≥ ∑ πn (t) min(µ(x0 , y1 ), µ(x1 , y0 )) = min(µ(x0 , y1 ), µ(x1 , y0 )), (4.109)
                                       t

contradicting our assumption that

                                           kpn (x, y,t) − µ(x, y)πn (t)k1 → 0.

                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                         43
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

     If µ is structurally external-trivial then µ is strongly external-trivial. Consider the following
protocol. Alice tells Bob whether her input is in SA . Bob tells Alice whether his input is in SB . If the input
is in SA × SB , then the output is known. Otherwise, the players reveal their inputs (but this happens with
probability zero). It’s not difficult to check that this protocol has zero external information cost.
     If µ is strongly external-trivial then µ is external-trivial. This is obvious.

     We comment that our proof gives an explicit lower bound on ICext  µ ( f , 0) whenever µ is not external-
trivial.
     Next we present the proof of Theorem 3.17, showing that all our definitions of internal triviality are
equivalent. As before, we can get an explicit lower bound on ICµ ( f , 0) whenever µ is not internal-trivial.

Theorem 3.17 (restated). Let f : X × Y → Z be an arbitrary function, and µ a distribution on X × Y.
The distribution µ is internal-trivial iff it is strongly internal-trivial iff it is structurally internal-trivial.

Proof of Theorem 3.17. If µ is internal-trivial then µ is structurally internal-trivial. Suppose that
µ is internal-trivial but not structurally internal-trivial. We will reach a contradiction.
    Since µ is internal-trivial, there is a sequence of protocols πn such that I(X; Πn | Y ) + I(Y ; Πn | X) → 0.
In particular,
                                         I(X; Πn | Y ), I(Y ; Πn | X) → 0.

Moreover, for every x ∈ SA and for every y ∈ SB ,

                                    I(X; Πn | Y = y), I(Y ; Πn | X = x) → 0.

    Let pn (x, y,t) be the joint probability of the input and of the transcript of πn being t. We also think
of πn as a distribution over transcripts. As in the proof of Theorem 3.20, using Pinsker’s inequality we
deduce that for all y ∈ SB ,
                                    kpn (x,t | y) − µ(x | y)πn (t | y)k1 → 0,

and so for all y ∈ SB ,
                                  By := ∑ |pn (x, y,t) − µ(x, y)πn (t | y)| → 0.                          (4.110)
                                         x,t

Similarly, for all x ∈ SA we have

                                  Ax := ∑ |pn (x, y,t) − µ(x, y)πn (t | x)| → 0.                          (4.111)
                                         y,t


    According to Lemma 3.19, there exists a connected component C of Gµ such that f is not constant on
CA ×CB . Suppose first that there is an edge (P, Q) on which f is not constant. Without loss of generality,
assume P = (a, y0 ) and Q = (a, y1 ). Thus

                 ∑ |pn (a, y0 ,t) − µ(a, y0 )πn (t | a)| + |pn (a, y1 ,t) − µ(a, y1 )πn (t | a)| → 0.     (4.112)
                  t


                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 44
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR

On the other hand, for each transcript t either pn (a, y0 ,t) = 0 or pn (a, y1 ,t) = 0, since f (ay0 ) 6= f (ay1 ).
Thus

  ∑ |pn (a, y0 ,t) − µ(a, y0 )πn (t | a)| + |pn (a, y1 ,t) − µ(a, y1 )πn (t | a)| ≥
   t

                                               ∑ πn (t | a) min(µ(a, y0 ), µ(a, y1 )) = min(µ(a, y0 ), µ(a, y1 )),
                                                t

contradicting the assumption that the left-hand side tends to zero.
      Suppose next that f is constant across all edges (and so on the entire connected component), say
 f (x, y) = c for all (x, y) ∈ C. Since f is not monochromatic on CA ×CB , there must exist a point P ∈ CA ×
CB such that f (P) 6= c. There must be points PA , PB ∈ supp µ with the same row and column (respectively)
as P. Since PA , PB are in the same connected component, there is some path PA = Q0 , Q1 , . . . , Qm = PB
connecting them: for every i < m, Qi , Qi+1 are either in the same row or in the same column. We
can assume that m ≤ M := |X| + |Y|. No transcript can have positive probability for both Q0 and
Qm , since otherwise it would have positive probability for P as well, and this cannot happen since
 f (Q0 ) = f (Qm ) = c while f (P) 6= c.
      Let t be any transcript satisfying pn (Q0 ,t) > 0. Since pn (Qm ,t) = 0, there must be an index i such
that
                             pn (t | Qi ) − pn (t | Qi+1 ) ≥ pn (t | Q0 )/m ≥ pn (t | Q0 )/M .

Assume without loss of generality that Qi = (a, y0 ) and Qi+1 = (a, y1 ). The contribution of t to Aa is

  |µ(a, y0 )πn (t | a) − pn (a, y0 ,t)| + |µ(a, y1 )πn (t | a) − pn (a, y1 ,t)| =
                    µ(a, y0 )|πn (t | a) − pn (t | a, y0 )| + µ(a, y1 )|πn (t | a) − pn (t | a, y1 )| ≥
                                      min(µ(a, y0 ), µ(a, y1 ))                   min(µ(a, y0 ), µ(a, y1 ))
                                                                 pn (t | Q0 ) ≥                             pn (Q0 ,t),
                                                  M                                           M
using the triangle inequality in the form |α − γ| + |γ − β | ≥ |α − β |.
    Denoting by δ the minimum of µ(x, y) over the support of µ, we conclude that ∑x Ax + ∑y By is at
least
                                        δ              δ            δ2
                                     ∑ M pn (Q0 ,t) = M µ(Q0 ) ≥ M ,                        (4.113)
                                      t

contradicting our assumption that
                                                    ∑ Ax + ∑ By → 0.
                                                    x      y

    If µ is structurally internal-trivial then µ is strongly internal-trivial. Consider the following
protocol. Alice tells Bob which block Xi her input belongs to. Bob tells Alice which block Yi his input
belongs to. If the input is in Xi × Yi , then the output is known. Otherwise, the players reveal their inputs
(but this happens with probability zero). It’s not difficult to check that this protocol has zero internal
information cost.
    If µ is strongly internal-trivial then µ is internal-trivial. This is obvious.

                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                        45
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

5    Parametrization of all distributions as product distributions
In Section 2.5 we discussed how a communication protocol can be interpreted as a random walk on the set
of distributions on X × Y. Every time a player sends a signal, we update the underlying distribution based
on the information provided by the sent signal. These updates are by scaling either the X marginal or the
Y marginal of the distribution. This restricted way in which the underling distribution can be updated will
allow us to parametrize the set of all reachable distributions from a specific distribution µ in such a way
that the changes are captured by product measures. First note that each reachable distribution µ 0 can be
identified by the constants that multiplied µ to obtain µ 0 .
    To formalize this intuition, we have the following definition.

Definition 5.1. For two distributions µ, ν ∈ ∆(X, Y), define
                                                         µ ·ν
                                            µ    ν :=          ,                                     (5.1)
                                                        hµ, νi

where µ · ν is the usual point-wise product of the two measures.

     Clearly, µ ν ∈ ∆(X, Y) unless hµ, νi = 0, in which case the product is undefined. For our purposes,
we will consider decompositions of the form µ = ν µ, where µ is a product measure. The statement
“µ is a distribution obtained from ν by scaling its rows and columns” is equivalent to “there exists a
product measure µ such that µ = ν µ.” Note that if µ is the uniform distribution, then ν = µ ν for
all distributions ν.
     Let µ be the prior distribution on X × Y in a communication protocol. We fix a decomposition
µ = ν µ, where µ is a product distribution. For every distribution µ 0 reachable from µ there is a
product distribution µ 0 such that µ 0 = ν µ 0 , for the same distribution ν. This follows from the fact
that µ 0 is obtained from µ by scaling its rows and columns; therefore if we scale the rows and columns
of µ by the same constants and then normalize it, we obtain the desired µ 0 . In such a decomposition
µ = ν µ, µ is called the real distribution, ν the reference distribution and µ the pretend distribution.
     We would like to work with product distributions since they are simpler, and easier to analyze, as we
will demonstrate in Section 6. Therefore, we define a pretend random walk, which is a random walk on
pretend distributions, as opposed to the normal random walk presented in Section 2.5, which we call the
real random walk to distinguish it from the pretend one. It starts from a product measure µ = (µ X , µ Y ),
where µ X and µ Y are the X and Y marginals of µ. At each step we either move by scaling the ∆(X)
marginal or the ∆(Y) marginal. The transition in ∆(X) is performed by moving with probability λ0 to
(µ0 , µ Y ) and with probability λ1 to (µ1 , µ Y ), where 0 < λ0 , λ1 < 1, λ0 + λ1 = 1 and

                                             ∑ λ b µb = µ X .
                                            b=0,1

A step in the ∆(Y) direction is performed similarly.
    Every pretend random walk corresponds to a real random walk performed by some protocol. Given
such a pretend random walk, and a reference distribution ν, if we replace every distribution µ encountered
in the random walk by ν µ, and scale the transition probabilities, we obtain a real random walk
performed by some protocol. Here ν can be any distribution such that ν µ is defined for every µ

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             46
                                 T RADING I NFORMATION C OMPLEXITY FOR E RROR

encountered in the protocol (e. g., if supp ν includes the support of the initial distribution). The inverse
transformation is also possible.
    To formalize this idea, consider a pretend random walk step, from µ to µ0 and µ1 with transition
probabilities λ0 and λ1 , respectively. Fix a reference distribution ν. Then

                          ν ·µ        ν · µb       hν, µb i
                ν   µ=         = ∑ λb        = ∑            λb (ν                 µb ) = ∑ λb (ν   µb )   (5.2)
                         hν, µi b=0,1 hν, µi b=0,1 hν, µi                                 b=0,1

for the values
                                                             hν, µb i
                                                    λb =              λb .                                (5.3)
                                                             hν, µi
A calculation shows

                                             hν, µb i         hν, ∑b=0,1 λb µb i hν, µi
                          ∑ λb = ∑ hν, µi λb =                     hν, µi
                                                                                =
                                                                                  hν, µi
                                                                                         = 1.             (5.4)
                         b=0,1       b=0,1


Furthermore, if the pretend random walk step is performed in the ∆(X) direction, then ν µb is obtained
by scaling the rows of µ, and if in the ∆(Y) direction, then by scaling the columns. Therefore, there
exists a real random walk step where we move from ν µ to ν µ0 and ν µ1 with probabilities λ0 and
λ1 respectively. The conversion in the opposite direction, from the real world to the pretend world, is
possible due to essentially the same calculations.
    Let π0 and π1 be the two branches of the protocol π corresponding to the value of the first bit that
was sent. Let µ be an input distribution that moves either to µ 0 or to µ 1 with probabilities λ0 and λ1 ,
respectively. Recall the notion of concealed information we discussed in Section 2.5, the following
equation regarding the concealed information,

                                             CIµ (π) = ∑ λb CIµ b (πb )                                   (5.5)
                                                            b=0,1

translates to
                                                              hν, µb i
                                     CIν µ (π) = ∑                     λb CIν µb (πb ).                   (5.6)
                                                        b=0,1 hν, µi

Multiplying by hν, µi we get

                                  CIν µ (π)hν, µi = ∑ λb hν, µb i CIν µb (πb ).                           (5.7)
                                                           b=0,1


    This motivates the following definition.

Definition 5.2. Let ν be a fixed reference distribution. Define the scaled information of a protocol π
with respect to a product distribution µ as

                                         SIMµ (π) := hν, µi CIν µ (π).                                    (5.8)


                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               47
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    Equation (5.8) allows us to write

                                 SIMµ (π) = λ0 SIMµ0 (π0 ) + λ1 SIMµ1 (π1 ).                              (5.9)

    Recall that CI is the expected amount of entropy that the players have concealed from each other by
the end of the protocol. To formally state this, let µ be a distribution over the inputs, π some protocol and
Π the random variable representing the transcript of the protocol. Let µ Π be the random variable that
represents the distribution over the inputs given the transcript Π, as defined in Section 2.5. Then

                                                                         
                                  CIµ (π) = E Hµ Π (X | Y ) + Hµ Π (Y | X) .                             (5.10)
                                             Π

    We will translate (5.10) to a formula involving the pretend random walk. Let µ = ν µ, and denote
by µΠ the pretend distribution where the pretend random walk ends if its associated protocol has the
transcript Π. Or, in a more formal way, µΠ is the distribution such that ν µΠ = µ Π . Equation (5.8)
implies
                                                                              
                        SIMµ (π) = Ehν, µΠ i H(ν µ)Π (X | Y ) + H(ν µ)Π (Y | X) ,             (5.11)
                                      Π

where the probability for each transcript Π is according to the pretend random walk rather than the real
one.
     One should ask: What is the probability of a transcript t in the pretend random walk, given its
probability λ in the real world? The answer turns out to be very simple. Let µ 0 , . . . , µ k be the real
distributions encountered in the real random walk, where µ 0 is the input distribution and µ k = µ t is the
last distribution encountered. For all 1 ≤ i ≤ k, let λ i be the transition probability from µ i−1 to µ i in the
real random walk, so that λ = λ 1 · · · λ k . Let µ i be the pretend distribution associated with µ i such that
µ i = ν µ i for all i . Then, the transition probability from µ i−1 to µ i in the pretend world equals

                                                     hν, µ i−1 i i
                                              λi =              λ,                                       (5.12)
                                                      hν, µ i i

using the conversion in (5.3). Multiplying all together, we get that the probability of t in the pretend
world is
                                    k       k
                                                hν, µ i−1 i i hν, µ 0 i
                               λ = ∏λi = ∏             i
                                                           λ =           λ.                       (5.13)
                                   i=1     i=1 hν, µ i         hν, µ k i

This equation also shows how one can derive (5.11) from (5.10) by multiplying the equation by hν, µ 0 i.


6    The analysis of the AND function
This section is mainly devoted to proving the only remaining case of Theorem 3.7, i. e., the lower bound
on ICµ (AND, ε). This is presented below separately as Theorem 6.5. Our general strategy for this proof
was sketched in Section 3.3 following Theorem 3.7.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 48
                            T RADING I NFORMATION C OMPLEXITY FOR E RROR

Preliminaries and notations. The section relies strongly on the parametrization of distributions as
product distributions, as presented in Section 5. A real distribution is usually denoted as µ, and it
is usually decomposed as µ = ν µ, where ν is a symmetric reference distribution and µ a pretend
distribution. Pretend distributions are always product ones. We will use the shorthand notation µ = (p, q)
for the product distribution in which p = µ(1, 0) + µ(1, 1) and q = µ(0, 1) + µ(1, 1). The distribution µ
will usually be assumed to be of full support, which in turn forces ν and µ to be so too.
     We are usually going to be working in a pretend world, dealing with the pretend distributions, and
keeping the reference distributions in the background. Furthermore, reference distributions are usually
kept fixed. We regard protocols as pretend random walks, as presented in Section 5.
     Suppose that we run a protocol π starting at a distribution µ = ν µ. As we explained in Section 5,
for each transcript t of the protocol, there is a product distribution µt such that ν µt is the distribution of
the players’ inputs conditioned on the protocol terminating at the leaf t. Let Π be the random transcript
of the pretend random walk associated with an execution of π on input distribution µ. Therefore, for any
transcript t, Pr[Π = t] is the probability for the transcript t in the pretend random walk, which might be
different than the corresponding probability in the real random walk. Throughout this section our view of
the protocol is only by the pretend random walk, therefore all random variables that correspond to Π are
assumed to be distributed according to the pretend random walk. Since µΠ , the pretend distribution on
the random transcript Π, is a product distribution, it can be written as µΠ = (p, q), where p, q are random
variables. We call (p, q) the leaf distribution of π. We define a crucial random variable, ` = max(p, q).
     If π is a zero-error protocol, then the leaf distribution is supported on product distributions of the
form (p, 0), (0, q) or (1, 1), since in order to know the AND of the two players’ inputs we need to know
that one of the players has input 0, or that both inputs are 1.
     Since we are concerned with almost-optimal protocol, we would like to quantify optimality. Given a
protocol π, define its wastage with respect to a distribution µ by

                      IWµ (π) = ICµ (π) − ICµ (AND, 0) = CIµ (AND, 0) − CIµ (π).                         (6.1)

6.1   Stability results
Braverman et al. [5], studying the complexity of the AND function, suggested a continuous protocol
whose information complexity equals ICµ (AND, 0), called the buzzer protocol. This protocol is defined
differently for any input distribution µ. Here we denote this protocol by π ∗ . The buzzer protocol is not a
conventional communication protocol as it has access to a continuous clock, however, it can be viewed as
a limit of a sequence of genuine protocols. The information complexity of the protocols in that sequence
converge to that of the buzzer protocol, and their leaf distribution converges in distribution.
    We start by presenting the leaf distribution of the buzzer protocol. We assume that the input reference
distribution is symmetric; its importance will become apparent later on.

         Table 1: The leaf distribution of the buzzer protocol starting from (p, q), where p ≥ q.

                                                                         (`, 0), (0, `)
                           Distribution µΠ                    (p, 0)                      (1, 1)
                                                                         (p < ` < 1)
              The probability to reach that distribution    1 − q/p        pq/`3 d`         pq


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                49
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

     As it can be seen in Table 1, this is a mix of discrete probabilities and a continuous density. To verify
that the above formulas are correct, we can convert the leaf distribution of the buzzer protocol as it is
calculated in [5] for the real random walk to its corresponding leaf distribution in the pretend random
walk. The formulas that are discussed in Section 5 can be used to calculate the appropriate scaling of the
probabilities as we convert the real random walk to the pretend one.
     There is also a second and more intuitive way to obtain these formulas. This is done by considering a
sequence of protocols that converges to the buzzer protocol. We describe the protocols in that sequence
by their pretend random walk. The initial distribution in the pretend world of a protocol in that sequence
is (p, q), where p, q ∈ {0, 1/n, 2/n, . . . , 1}. In each step, the pretend random walk moves to one of two
adjacent grid points, each with probability half. If we are currently in a distribution (a/n, b/n) where
a ≥ b, then the step moves to one of (a/n, (b + 1)/n) and (a/n, (b − 1)/n). Otherwise, the protocol
moves to one of ((a + 1)/n, b/n) and ((a − 1)/n, b/n).
     Therefore, starting at the point (a/n, b/n) where a ≥ b, the random walk moves in the y axis, until
it ends up either at (a/n, 0) or at (a/n, (a + 1)/n). Since this walk is balanced, the probabilities to get
to these points are 1 − b/(a + 1) and b/(a + 1), respectively. Then, from that point the random walk
moves in the x axis, until it either gets to the point (0, (a + 1)/n) or to ((a + 1)/n, (a + 1)/n), with
probabilities 1/(a + 1) and a/(a + 1) respectively. Then again, it ends up either at ((a + 1)/n, 0) or at
((a + 1)/n, (a + 2)/n), then at (0, (a + 2)/n) or ((a + 2)/n, (a + 2)/n) and continues this way, until it
either gets to the point (1, 1), or to a point of the form (0, i/n) or (i/n, 0). Calculating the leaf distribution
of each pretend random walk in that sequence, and taking the limit as n → ∞, results in a leaf distribution,
which equals that of the buzzer protocol, as will be explained below.
     The buzzer protocol can also be defined similarly as a sequence of converging protocols, where
for each protocol in the sequence, the real-world analogue of moving in the y direction is performed
whenever Pr[X = 1] ≥ Pr[Y = 1], while the analogue of moving in the x direction is performed otherwise.
In order for our limit protocol to behave identical to the buzzer protocol, we would like the region
Pr[X = 1] ≥ Pr[Y = 1] to correspond to the region p ≥ q. This is done by using a symmetric reference
distribution.
     Next, we would like to show a stability result, proving that every protocol performing the task
[AND, 0] with nearly optimal information complexity is similar to the buzzer protocol. We measure
similarity in terms of the leaf distribution (p, q), and define the following potential function:
Definition 6.1. Given a protocol π for [AND, 0], a constant 0 < c < 1, and a pretend distribution µ, let

                                     Φc,µ (π) = E ((c − `)+ )2 ,
                                                              
                                                                                                   (6.2)

where (·)+ = max{·, 0}, and ` = max(p, q). Denote Φc,µ = Φc,µ (π ∗ ), where π ∗ is the buzzer protocol.
    The following theorem shows that the value of the potential function is small for nearly optimal
protocols.
Theorem 6.2. Let µ be a full support distribution, and µ = ν µ be its decomposition, where ν is
a symmetric reference distribution and µ = (p, q) is the product pretend distribution. Assume that
c ≤ max{p, q}. Let π be a protocol performing [AND, 0]. Then

                           Φc,µ (π) = O(ICµ (π) − ICµ (AND, 0)) = O(IWµ (π)).                               (6.3)

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   50
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

The constant in the O(·) is uniform whenever ν(0, 0), ν(0, 1), ν(1, 0), p, q are bounded away from 0 and
1.

    In order to prove this theorem, we measure how each performed step contributes both to the wastage
and to the potential function. To measure the wastage, we work with SIM instead of IC, as it is a more
natural measure for this task.

Lemma 6.3. Let µ be a full support distribution, and µ = ν µ be its decomposition, where ν is a
symmetric reference distribution and µ is the pretend distribution. Let 0 < c < 1, and let π be the protocol
which behaves as follows:

   1. One step of a pretend random walk is performed, which corresponds to one bit that is sent in the
      protocol.

   2. The pretend random walk that corresponds to the buzzer protocol is simulated from that point:
      assuming that after the first bit was sent the pretend distribution is (p, q), let π(p,q)∗   be the buzzer
      protocol for the input distribution ν (p, q). Then, the pretend random walk that corresponds to
       ∗
      π(p,q) is simulated (the value of (p, q) is different for the case that the first bit equals 1, and when it
      equals 0).

Then
                            Φc,µ (π) − Φc,µ = Oν (SIMµ (AND, 0) − SIMµ (π)).                               (6.4)
The constant in the O(·) is uniform whenever ν(0, 0), ν(0, 1), ν(1, 0), c are bounded away from 0 and 1.

    The potential function of Definition 6.1 is defined in that manner so that Lemma 6.3 holds. Let us
elaborate on this: assume that a protocol π is defined as in this lemma, with a pretend input distribution of
(p, q). Assume that the first step moves from (p, q) either to (p + δ ) or to (p − δ ) with equal probability.
Then

                 SIM(p,q) (π) − SIM(p,q) (AND, 0)
                   1                         1
                 = SIM(p+δ ,q) (AND, 0) + SIM(p−δ ,q) (AND, 0) − SIM(p,q) (AND, 0)
                   2                         2
                   δ2 ∂2
                 ≈          SIM(p,q) (AND, 0).
                    2 ∂ p2

Thus, this difference has the same order of magnitude as δ 2 . We would like the change in the potential
function to have the same order. Looking at the function x2 , it holds that

                                     1            1                δ2
                                       (x + δ )2 + (x − δ )2 − x2 = .                                      (6.5)
                                     2            2                2
If a protocol π moves according to the direction of the buzzer protocol, then π is the same as π ∗ and
both differences are zero. Therefore, assume that p > q, and π moves in the x direction, whereas the
buzzer protocol would have moved in the y direction. Roughly speaking, the leaf distribution of π is

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  51
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

obtained from the leaf distribution of π ∗ by splitting some of the mass around ` ≈ p between ` ≈ p − δ
and ` ≈ p + δ . Thus, Φc,µ (π) − Φc,µ approximately has the order of magnitude of
                              1                1                          δ2
                                (c − p − δ )2 + (c − p + δ )2 − (c − p)2 = .                              (6.6)
                              2                2                          2
We chose (c − p)2+ instead of (c − p)2 since Lemma 6.8 requires the buzzer protocol to have a value of
zero. Indeed, by choosing c carefully we can achieve this.
    We will prove Lemma 6.3 using the following criterion.
Lemma 6.4. Let ν be a symmetric reference distribution, and C > 0 a constant. Define
                                 F(p, q) = C SIM(p,q) (AND, 0) + Φc,(p,q) .
If for every q, F(p, q) is concave as a function of p, and for every p, F(p, q) is concave as a function of q,
then Lemma 6.3 holds, and the constant in the O(·) can be taken to be equal to C.
Proof. Let π be the protocol defined in Lemma 6.3, and let µ be its pretend input distribution. Assume
that the pretend random walk of π first moves from µ either to µ0 or to µ1 , with probabilities λ0 and
λ1 . We assume this step is on the x-direction, thus, the first step is from (p, q) to (p0 , q) or (p1 , q). The
analysis for the case that this step is in the y-direction is similar. Let 0 < c < 1. Then
                                  SIM(p,q) (π) = ∑ λb SIM(pb ,q) (AND, 0),
                                                   b

and
                                          Φc,(p,q) = ∑ λb Φc,(pb ,q) .
                                                           b
From concavity,
                 C SIM(p,q) (AND, 0) + Φc,(p,q) = F(p, q) ≥ ∑ λb F(pb , q)
                                                                    b
                                                  = ∑ λb (C SIM(pb ,q) (AND, 0) + Φc,(pb ,q) )
                                                       b
                                                  = C SIM(p,q) (π) + Φc,(p,q) (π).
    Thus, our focus would be proving that these concavity conditions hold for some value C. We proceed
by calculating Φc,(p,q) , assuming without loss of generality that p ≥ q. One can see that whenever p ≥ c,
with probability 1 the leaf distribution of the buzzer protocol satisfies ` ≥ p ≥ c, and thus the potential
function evaluates to 0. Consider the case p < c. Using the leaf distribution, we obtain the formula
                                                                   Z c
                                                               2        pq
                            Φc,(p,q) = (1 − q/p)(c − p) + 2              3
                                                                           (c − `)2 d`.                   (6.7)
                                                                    `=p `

Thus, the general definition is as follows:
                           
                           0
                                                                     if max{p, q} ≥ c,
                                             2
                                                  R c pq         2
               Φc,(p,q) = (1 − q/p)(c − p) + 2 `=p `3 (c − `) d` if q ≤ p ≤ c,                            (6.8)
                                                 R c pq
                             (1 − p/q)(c − q)2 + 2 `=q    (c − `)2 d` if p ≤ q ≤ c.
                           
                                                       `3


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                 52
                            T RADING I NFORMATION C OMPLEXITY FOR E RROR

    In order to apply Lemma 6.4, we start by showing that the function Φc,(p,q) is differentiable for all
p (in the direction of p) given a fixed value of q, and for all q given a fixed value of p. This is done
by calculating the two one-sided derivatives in the points suspected of non-differentiability: p = q and
max{p, q} = c. To state it into more detail, for any fixed q, we calculate both

                                  ∂ Φc,(p,q)        Φc,(p+h,q) − Φc,(p,q)
                                             = lim+                       ,                                (6.9)
                                     ∂ p + h→0               h

and
                                  ∂ Φc,(p,q)        Φc,(p+h,q) − Φc,(p,q)
                                             = lim−                       ,                               (6.10)
                                     ∂ p − h→0               h
and verify that both values are equal in all suspected points. We do the same switching the roles of p
and q (though it is not required as this potential function is symmetric, since we assume the reference
distribution to be symmetric). Additionally, we calculate its second derivatives whenever they are defined.
If max{p, q} > c, then they are trivially zero. For q < p < c, we get:

                                          ∂ 2 Φc,(p,q)
                                                       = 2(1 − q/p)                                       (6.11)
                                              ∂ p2

and
                                              ∂ 2 Φc,(p,q)
                                                           = 0.                                           (6.12)
                                                  ∂ q2
    Actually, there is a reason why this second derivative with respect to q is zero. For any 0 < δ ≤
min{p − q, q}, consider a protocol π that first moves to (p, q − δ ) or to (p, q + δ ), each with probability
1/2, and then simulates the buzzer protocol. It has the same leaf distribution as the buzzer protocol (in the
pretend world). Both the buzzer protocol and π either get to the point (p, 0) or to the point (p, p), with
probabilities 1 − q/p and q/p, respectively. From that point on, both continue the same way, resulting in
the same leaf distribution. This validates the equality

                                             1             1
                                   Φc,(p,q) = Φc,(p,q+δ ) + Φc,(p,q−δ )                                   (6.13)
                                             2             2

for all q and δ sufficiently small, which implies linearity in the region q ∈ [0, p] (given a fixed p).
    Similar calculations will now be performed with regard to

                                             SIM p,q (AND, 0).

Denote x = ν(0, 0), y = ν(1, 0) = ν(0, 1), z = ν(1, 1). It is possible to extract the value of this function
from the equations in [5], using the conversion from SIM to CI (5.8) and from CI to IC (2.19). Neverthe-
less, we calculate it using the formula (5.11), which is an expectation over a value obtained in the leafs of
the protocol. Let p ≥ q, and let Π correspond to the buzzer protocol, which starts at distribution (p, q).
Then,

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  53
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I




                  SIM p,q (AND, 0) = E[hν, µΠ i(HµΠ (X | Y ) + HµΠ (Y | X))]
                                     Π
                                                                                 
                                            q                              py
                                   = 1−          ((1 − p)x + py)h                     +
                                            p                        (1 − p)x + py
                                     Z 1                                       
                                         2pq                            y`
                                           3
                                               ((1 − `)x + `y)h                    d`
                                      p `                         x(1 − `) + y`
                                       
                                                                               (1 − p)x
                                   = − q(1 − p)y + (1 − p)(1 − q)x log                     +
                                                                            (1 − p)x + py
                                          pqy2
                                                                                      
                                                                                py
                                                 + (p + q − 2pq)y log                     .
                                             x                            (1 − p)x + py
      Calculating the second derivative, we get for p > q,

                     ∂ 2 SIM(p,q) (AND, 0)                          xy
                                  2
                                           = −2(1 − q/p)          2
                                                                                   ,               (6.14)
                             ∂p                          2(1 − p)p ((1 − p)x + py)
and
                                         ∂ 2 SIM(p,q) (AND, 0)
                                                                = 0.                                (6.15)
                                                 ∂ q2
The reason that the second derivative is zero is the same as explained for the potential function. For
proving differentiability (on each direction separately), the only suspected point is p = q. Comparing the
two one-sided derivatives implies the result.
    Now we are almost ready to apply Lemma 6.4. Define

                                             2(1 − p)p2 ((1 − p)x + py)
                                  C = max                               ,                          (6.16)
                                       0≤p≤1             xy
and F(p, q) = C SIMµ (π ∗ ) + Φc,µ . For any fixed q,

                                                  ∂ F(p, q)
                                                     ∂p
is continuous, piecewise differentiable, and its derivative,
                                                ∂ ∂ F(p, q)
                                                ∂p ∂p
is non-positive wherever it is defined. Thus,
                                                  ∂ F(p, q)
                                                     ∂p
is non-increasing, and F(p, q) is concave as a function of p. The same holds when switching the roles of
p and q, thus the conditions in Lemma 6.4 are satisfied, which concludes the proof of Lemma 6.3. Finally,
we are able to prove Theorem 6.2.

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                           54
                              T RADING I NFORMATION C OMPLEXITY FOR E RROR

Proof of Theorem 6.2. Let T be the protocol tree of π. This is a directed binary tree with two children
for each internal node. Each node corresponds to a state of the protocol when some communication has
taken place, and its children are the two consecutive states, chosen according to the bit sent by the player
owning the node.
      We can construct T using a sequence of trees, T1 , T2 , . . . , Tk = T . The tree T1 contains only the root of
T , and for all i, Ti is obtained from Ti−1 by adding the children of a leaf of Ti−1 which is not a leaf of T .
      Given a tree Ti , construct a protocol πi , that whenever it reaches a state represented by node v which
is not a leaf of Ti , the protocol behaves as π for the next bit sent, and if the state is represented by a leaf of
Ti , then the buzzer protocol is simulated from that point on. Let D be the constant in the O(·) guaranteed
from Lemma 6.3. The lemma implies that for all i,

                           Φc,µ (πi ) − Φc,µ (πi−1 ) ≤ D(SIMµ (πi−1 ) − SIMµ (πi )).

Summing over i, we get a telescopic summation that results in

 Φc,µ (π) = Φc,µ (πk )−Φc,µ (π1 ) ≤ D(SIMµ (π1 )−SIMµ (πk )) = D(SIMµ (AND, 0)−SIMµ (π)). (6.17)

We used the fact that
                                             Φc,µ (π1 ) = Φc,µ = 0,
which hold since we assumed that c ≤ max{p, q}, and the leaf distribution of the buzzer protocol has
zero mass on ` < max{p, q}, therefore its potential cost is zero. This finishes the proof as

   SIMµ (AND, 0) − SIMµ (π) = hν, µi(CIµ (AND, 0) − CIµ (π)) = hν, µi IWµ (π) ≤ IWµ (π).                     (6.18)


6.2     Lower bound on the information complexity of ICµ (AND, ε)
In this section, we prove Theorem 3.7 by showing that every distribution µ which is of full support,
except perhaps for µ(1, 1), satisfies ICµ (AND, ε) ≥ ICµ (AND, 0)−O(h(ε)). Recall that Theorem 3.7 (ii)
follows from Part (i) and we have already established the upper bound of Theorem 3.7 (i) in Theorem 3.2.
Hence it remains to prove the following theorem.

Theorem 6.5 (The remaining case of Theorem 3.7). Let µ be a full-support distribution, except perhaps
for µ(1, 1). For all ε ≥ 0,

                                  ICµ (AND, ε) ≥ ICµ (AND, 0) − Oµ (h(ε)).                                   (6.19)

The hidden constant can be fixed if µ(0, 0), µ(0, 1), µ(1, 0) are bounded away from 0.

    The proof uses the idea of protocol completion: given a protocol π performing [AND, ε], we can
create a protocol π0 , which we call the zero-error completion of π. Such a protocol π0 takes the following
steps:

      • First Alice and Bob simulate π until it terminates.

      • Afterwards they run a protocol that solves the AND function with zero error.

                         T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                    55
                   Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

    The cost of completion is the amount of information revealed in the second step, and it is equal to
ICµ (π0 ) − ICµ (π).
                  √ We have shown in the proof of Theorem 3.5 that for general functions, this cost is
bounded by O(h( ε)), but here we would like to prove a stronger bound of O(h(ε)) for protocols that
are almost optimal for the AND function. This obviously would yield the desired lower bound, and prove
Theorem 6.5. This completion cost can be arbitrarily close to

                                           E[ICµ Π (AND, 0)].
                                           Π

In order to bound this quantity, we first bound the information complexity of the AND function.
Lemma 6.6. Consider a reference distribution ν with ν(0, 0) = x, ν(1, 0) = ν(0, 1) = y, ν(1, 1) = z, such
that x, y, z > 0. Let µ = (p, q) be a pretend distribution. Let µ = ν µ, and µ(1, 1) = δ . Let 0 < C < 1
be an arbitrary constant.
    Firstly ICµ (AND, 0) ≤ 2h(1 − δ ). Secondly
                                             (
                                               O(h(δ /z))       if max(p, q) ≥ C,
                            ICµ (AND, 0) ≤           p                                             (6.20)
                                               O(h( δ /z)) if p, q < C.

The hidden constants can be fixed if x, y,C are bounded away from both 0 and 1.
Proof. First we prove that ICµ (AND, 0) ≤ 2h(1 − δ ). Assume that δ ≥ 1/2, as otherwise the inequality
trivially follows. The information complexity is achieved by a protocol where both Alice and Bob send
their inputs. The cost of that protocol is at most H(XY ) ≤ H(X) + H(Y ) ≤ 2h(δ ).
     For proving the other bounds, assume that δ < 1/2, since otherwise the lemma trivially follows. If
p, q > 1/2, then
                                             ν(1, 1)pq
                                       δ=              ≥ ν(1, 1) = z,
                                              hν, µi
as
          hν, µi = (1 − p)(1 − q)x + [p(1 − q) + (1 − p)q]y + pqz ≤ (x + 2y + z)pq = pq.         (6.21)
In this case, the lemma follows.
    Assume that either p ≤ 1/2 or q ≤ 1/2. Without loss of generality, p ≤ q. We will analyze the
protocol in which Alice first sends her input to Bob, and if X = 1 then Bob sends his input to Alice. This
protocol has a cost of

      H(X | Y ) + Pr[X = 1]H(Y | X = 1) ≤ H(X) + Pr[X = 1] ≤ h(Pr[X = 1]) + Pr[X = 1].             (6.22)

The obtained bound is monotonic in Pr[X = 1], a fact that we will use.
   Now
                                    p(1 − q)y + pqz       p(y + z) δ (y + z)
                       Pr[X = 1] =                    ≤           =          .                     (6.23)
                                         hν, µi            hν, µi      zq
   Thus, if q ≥ C, then the cost of completion is at most
                                                      (
                                                                   if y+z
                             
                    δ (y + z)     δ (y + z) (y + z)δ   h(δ /z)         C < 1,
                  h             +          ≤         + y+z                                         (6.24)
                       zC            zC        Cz       C 2h(δ /z) otherwise,


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                           56
                              T RADING I NFORMATION C OMPLEXITY FOR E RROR

using the bound h(cx) ≤ 2ch(x) for all c > 1, from (3.3).
    If q ≤ C, Pr[X = 1] is maximized at q = p. Assume indeed that p = q. We will bound its value from
below. The equation
                                           q2 z      q2 z
                                                 =        =δ
                                         hν, µi hν, µi
implies                                              r
                                                         δ hν, µi
                                                q=                .                                     (6.25)
                                                             z
Now since
                           hν, µi ≥ ν(0, 0)µ(0, 0) = (1 − p)(1 − q)x ≥ (1 −C)2 x,                       (6.26)
we have                                                       r
                                              δ (y + z)           δ    y+z
                                  Pr[X = 1] ≤           ≤                 √ .                           (6.27)
                                                 zq               z (1 −C) x
The proof concludes applying similar calculations as in (6.24).

                                                         √ that max{p, q} does not exceed some constant
    Next, we use this bound to show that if the probability
is very small, then one can get an improvement over h( ε) for the completion cost.

Lemma 6.7. Let ν be a symmetric reference distribution with ν(0, 0) = x, ν(0, 1) = ν(1, 0) = y and
ν(1, 1) = z > 0. Let µ = (p, q) be a pretend distribution, and let µ = ν µ = ν.
   Let π be a protocol performing [AND, ε]. Let 0 < C < 1 be an arbitrary constant,

                                            κ = Pr[max{p, q} ≤ C].

    The protocol π can be completed to a zero-error protocol using an additional information cost of
                                        p                        
                                                              ε
                                 O κh( ε/κ) + (1 − κ)h( 1−κ      ) ,                            (6.28)

where the cost is according to the distribution µ, and the hidden constant in O(·) can be fixed if x, y, p, q,C
are all bounded away from both 0 and 1.

Proof. First, note that
                                             zpq         zpq
                                µ(1, 1) =         ≤                = O(z).                              (6.29)
                                            hν, µi x(1 − p)(1 − q)
Let ψ be the random variable denoting the completion cost as a function of Π. Let 1o=b be the indicator
of whether π outputs b given the transcript Π, for b = 0, 1. The total completion cost is

                                            E[ψ] = ∑ E[ψ1o=b ].                                         (6.30)
                                                     b=0,1

    We start by bounding E[ψ1o=1 ]. Let δ be the random variable which equals µ Π (1, 1).

                            E[(1 − δ )1o=1 ] = Pr[(X,Y ) 6= (1, 1), π outputs 1] ≤ ε.                   (6.31)

                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                              57
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

From Lemma 6.6, the completion cost ψ is at most 2h(1 − δ ). From the concavity of h,

  E[ψ1o=1 ] = E O(h(1 − δ ))1o=1 = E O(h((1 − δ )1o=1 )) ≤ O(h(E[(1 − δ )1o=1 ])) ≤ O(h(ε)). (6.32)

This can be bounded as desired since in both cases of κ > 1/2 and κ ≤ 1/2, we have
                                            p                       
                                                                  ε
                             h(ε) = O κh( ε/κ) + (1 − κ)h( 1−κ      ) .                             (6.33)

    Next we bound E[ψ1o=0 ].

                    E[δ 1o=0 ] = Pr[(X,Y ) = (1, 1), π outputs 0] ≤ ε µ(1, 1) ≤ εO(z).              (6.34)

Let S be the event that max{p, q} ≤ C. Then,

                                E[δ 1o=0 | S] ≤ εO(z)/ Pr[S] = εO(z)/κ.                             (6.35)

                                E[δ 1o=0 | S] ≤ εO(z)/(1 − κ).                            (6.36)
                                                        p    
   From Lemma 6.6, the completion cost is of order of h   δ /z when S happens, and h(δ /z) other-
wise.

                 E[ψ1o=0 ] = Pr[S] E[ψ1o=0 | S] + Pr[S] E[ψ1o=0 | S]                                (6.37)
                                    h p            i                               
                           =O κE h         δ 1o=0 /z | S + (1 − κ) E[h(δ 1o=0 /z) | S]              (6.38)
                                p                                                
                           ≤ O κh      E[δ 1o=0 | S]/z + (1 − κ)h(E[δ 1o=0 | S]/z)                  (6.39)
                                p                                       
                           ≤ O κh      O(ε)/κ + (1 − κ)h(O(ε)/(1 − κ))                              (6.40)
                                                               
                                      p                     ε
                           ≤ O κh       ε/κ + (1 − κ)h               ,                              (6.41)
                                                            1−κ
                                                         p
where (6.39) follows from the concavity of h(·/z) and h( ·/z), and (6.41) follows from (3.3).

   Consider an almost optimal protocol π0 so that ICµ (π0 ) − ICµ (AND, 0) is small. Our stability result,
Theorem 6.2, translates this to a bound on the potential function introduced in Definition 6.1. The next
lemma uses this to show that for such a protocol π0 , one can obtain a strong bound on the value of κ in
Lemma 6.7.

Lemma 6.8. Let µ be full-support distribution and let µ = ν µ be its decomposition,
                                                                                            where ν is a
symmetric reference distribution, and µ is the pretend distribution. Let c = max Prµ [X = 1], Prµ [Y = 1] .
Let π be an arbitrary protocol, and π0 be the completion of π to a protocol performing [AND, 0]. Then
                           h               ci
                        Pr max{p, q} ≤        = Oc,µ,ν (ICµ (π0 ) − ICµ (AND, 0)).                   (6.42)
                                           4
The hidden constant can be fixed if p, q, µ(0, 0), µ(0, 1), µ(1, 0) are all bounded away from both 0 and 1,
where µ = (p, q).

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             58
                           T RADING I NFORMATION C OMPLEXITY FOR E RROR

Proof. Let ` p,q be the distribution of ` that corresponds to the buzzer protocol when it is invoked from a
pretend distribution parametrized by (p, q).
    We start by showing that for any 0 < p, q < 1,

                                                                3
                                       Pr[` p,q ≤ 2 max{p, q}] ≥ .                                  (6.43)
                                                                4
Assume without loss of generality that p ≥ q. Using the leaf distribution from Section 6.1,
                                               Z 2p                  
                                                    pq              q      3
                          Pr[p ≤ ` ≤ 2p] = 2         3
                                                       d` +   1 −       > .                         (6.44)
                                                p   `               p      4

    This implies
                        h      ci     h        ci
                      Pr `π0 ≤    = Pr `π0 ≤ 2
                               2               4
                                      h              ci
                                    ≥ Pr max{p, q} ≤    Pr [`p,q ≤ 2 max{p, q}]
                                                     4
                                      3  h             ci
                                    ≥ Pr max{p, q} ≤      .
                                      4                4
Markov’s inequality and Theorem 6.2 imply

                                  c2     E[(c − `π0 )2+ ] Φc,µ (π0 )
                                    
     h      ci                2
   Pr `π0 ≤    = Pr (c − `π0 )+ ≥      ≤                 =           = O(ICµ (π0 ) − IC(AND, 0)).
            2                     4          c2 /4         c2 /4
                                                                                              (6.45)

    Now we are ready to prove Theorem 6.5, and thus complete the proof of Theorem 3.7.

Proof of Theorem 6.5. We first prove the theorem for the full-support distributions. Consider such a
distribution µ. Let π be a protocol performing [AND, ε]. We can assume that

                                        ICµ (π) ≤ ICµ (AND, 0),

and let                                       
                   C = max Pr[X = 1], Pr[Y = 1] /4,           κ = Pr[max{p, q} ≤ C].
                               µ          µ

Lemma 6.7 constructs a zero-error protocol π0 whose wastage w is at most
                                    r                          
                                             ε                 ε
                            w = O κh            + (1 − κ)h            .                             (6.46)
                                             κ               1−κ

Lemma 6.8 states that κ = O(w), and so
                                    r                  
                                       ε                ε
                              κ = O κh    + (1 − κ)h          .                                     (6.47)
                                       κ               1−κ

                       T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             59
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

If ε/(1 − κ) ≤ 1/2, then (3.3) shows that
                                             r        
                                                ε
                                       κ = O κh    + h(ε) .                                            (6.48)
                                                κ
Otherwise, κ ≥ 1 − 2ε ≥ 1/2 (assuming ε ≤ 1/4), and so
                                     √                   √
                            κ = O(h( ε) + (1 − κ)) = O(h( ε) + ε),                                     (6.49)
which contradicts κ ≥ 1/2 for small enough ε.
   Denoting the hidden constant in (6.48) by M, we get
                                           r 
                                                ε
                                     1 − Mh          κ ≤ Mh(ε).                                        (6.50)
                                                κ
We will show that for small ε, this forces κ ≤ 2Mh(ε). Indeed, suppose that κ > 2Mh(ε), which implies
that κ > 2Mε log(1/ε). Then
                                            ε        1
                                              <             ,                                  (6.51)
                                            κ   2M log(1/ε)
                                 p
and so for small enough ε, Mh( ε/κ) < 1/2. This shows that
                                           r 
                                                ε         κ
                                     1 − Mh          κ > > Mh(ε),                              (6.52)
                                                κ         2
contradicting the inequality above. We conclude that for small ε we have κ = O(h(ε)).
   Applying Lemma 6.7 again, we see that
                                   r 
                                         ε
        ICµ (π0 ) − ICµ (π) ≤ κO h            + O(h(ε)) ≤ O(κ) + O(h(ε)) = O(h(ε)).                    (6.53)
                                         κ
Since ICµ (π0 ) ≥ ICµ (AND, 0), we conclude that ICµ (π) ≥ ICµ −O(h(ε)).
    Next consider a distribution µ with µ(1, 1) = 0, that assigns a strictly positive probability for every
other input. There is a series of full support distributions, µ 1 , µ 2 , . . . that converge to µ, and assume
without loss of generality that for every input a ∈ {0, 1}2 and for every n ∈ N, µ n (a) ≥ µ(a)/2. From
the continuity of information complexity with respect to the tasks [AND, 0] and [AND, ε],
                                    lim ICµ n (AND, 0) = ICµ (AND, 0),                                 (6.54)
                                    n→∞
and
                                    lim ICµ n (AND, 0) = ICµ (AND, 0).                                 (6.55)
                                    n→∞
Assume that µ(0, 0), µ(0, 1), µ(1, 0) are bounded from below. It is possible to decompose µ into ν
(p, q), where ν is symmetric and p, q, ν(0, 0), ν(0, 1) and ν(1, 0) are bounded. This is done by considering
a decomposition where p = 1/2 and q is chosen such that ν is symmetric. Therefore, there is a constant
C > 0 such that
                                ICµn (AND, ε) ≥ ICµn (AND, ε) −Ch(ε).                                 (6.56)
Thus,
                                  ICµ (AND, ε) ≥ ICµ (AND, ε) −Ch(ε).                                  (6.57)


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               60
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

7      The set disjointness function with error
In this section we present the proofs of the results concerning the set disjointness function. It will be
convenient to switch the roles of 0 and 1 in the range of the function, and redefine DISJn as

                                          DISJn (X,Y ) = ∨ni=1 (Xi ∧Yi ) ,

i. e., DISJn (X,Y ) = 0 if the inputs are disjoint and it is equal to 1 otherwise. Obviously, this will not
affect the correctness of our results.

7.1     Proof of Theorem 3.11

Theorem 3.11 (restated). For the set disjointness function DISJn on inputs of length n, we have

                                          Rε (DISJn )
                                lim sup               ≤ IC0 (AND, 0) −C h(ε),                         (7.1)
                                 n→∞           n
for some constant C > 0.
    We will prove the following lemma, from which Theorem 3.11 follows using Corollary 3.8.

Lemma 7.1.
                                           Rε (DISJn )
                                 lim sup               ≤ IC0 (AND, ε, 1 → 0).                         (7.2)
                                   n→∞          n
    Intuitively, an upper bound like Lemma 7.1 is essentially a compression result. Besides, as DISJn
has a self-reducible structure (see [6]), one can make use of this fact together with the Braverman–
Rao [7] compression. A difficulty is that what we want to solve is [DISJn , ε], that is, the error allowed
is non-distributional, while the error unavoidably introduced in the compression phase is distributional.
Fortunately, this can be salvaged by a minimax argument introduced in Section 6.2 of [4].
    In order to use self-reducibility and compression, one first needs to have a control on the information
cost of solving [DISJn , ε].

Lemma 7.2.
                                      IC(DISJn , ε, 1 → 0)
                            lim sup                        ≤ IC0 (AND, ε, 1 → 0),                     (7.3)
                             n→∞              n
where IC(DISJn , ε, 1 → 0) := maxµ ICµ (DISJn , ε, 1 → 0).

      The proof is a direct adaptation of the proof of Lemma 8.5 in [5].

Proof. Let Ω0 denote the set of all measures µ on {0, 1}2 with µ(1, 1) = 0. Let π be a protocol that
computes [AND, ε, 1 → 0] and satisfies

                                   max ICµ (π) ≤ IC0 (AND, ε, 1 → 0) + δ
                                   µ∈Ω0

for some small δ > 0. Consider the following protocol τ that computes DISJn with error.


                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                            61
                     Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I




      • Alice and Bob exchange (with replacement using public randomness) n2/3 random coordinates.
        Denote this set of random coordinates by J. If for some j ∈ J, x j = 1 and y j = 1, then they
        output 1 and terminate.

      • For each coordinate outside J, Alice and Bob run the protocol π and output 1 if π outputs 1 on
        some coordinate. Otherwise they output 0.


   The intuition of the protocol is as follows: fix some µ. If with high probability for many of the
coordinates i, xi = yi = 1, then with high probability only the first step of the protocol is performed and

                                                ICµ (π) / n2/3 .

Otherwise, with high probability, for most coordinates i, (xi , yi ) 6= (1, 1). Hence, for most coordinates i,
the information complexity of evaluating π(xi , yi ) is not much higher than its information complexity on
some distribution which gives zero mass to x = y = 1, which is not much higher than IC0 (AND, ε, 1 → 0).
     As π has one-sided 1 → 0 error, obviously τ has only one-sided 1 → 0 error too, and this error happens
with probability at most ε d ≤ ε, where d is the number of coordinates outside J which satisfy x j = y j = 1
(if x j = y j = 1 for some coordinate in J, there is no error). In particular, τ computes [DISJn , ε, 1 → 0].
     A direct inspection shows that the remaining proof of Lemma 8.5 in [5] depends only on the protocol
but not on the specific problem, hence the proof works for our problem too, and the lemma can be proved
similarly.

    Next we prove an amortized upper bound for DISJn .
Lemma 7.3. For every ε, δ > 0, there exists a constant C > 0 that depends on n, ε, δ , such that as long
as N ≥ C(n, ε, δ ), we have

                                 Rε (DISJnN )
                                              ≤ (1 + δ ) IC(DISJn , ε, 1 → 0).                              (7.4)
                                      N
     We will use the following claim [4, Claim 6.11]. Note that we state it slightly different from its
original form in order to adapt to our purpose.
     We introduce some notation that will be used in the statement of the following claim and the proof
of Lemma 7.3: let π be a protocol getting its inputs from X × Y. For every nonzero integer k ∈ N, let
π k denote the protocol that gets inputs from Xk × Yk and executes π on each pair (xi , yi ) (i = 1, 2, . . . , k)
independently. Define IC(π) := maxµ ICµ (π) where the maximum is over all distributions µ ∈ ∆(X × Y).
Claim 7.4 ([4]). Let π be a protocol getting its inputs from X × Y. For any numbers δ1 , δ2 > 0, there
                                                                                                  3     3
exists M ∈ N such that for any integer m ≥ M, there exists a protocol πm3 that gets inputs from Xm × Ym
and:
                             3      3
    • For each input in Xm × Ym , the total variation distance between the output of πm3 and the output
            3
      of π m is at most δ1 .

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                   62
                             T RADING I NFORMATION C OMPLEXITY FOR E RROR

    • The worst-case communication complexity of πm3 is at most m3 IC(π)(1 + δ2 ).

Proof of Lemma 7.3. Denote I := IC(DISJn , ε, 1 → 0). By the continuity of information complexity with
respect to ε (Lemma 2.4, which holds for one-sided error with the same proof), there exists some ξ > 0
such that
                               IC(DISJn , ε − ξ , 1 → 0) ≤ I(1 + δ )1/4 .
    Let π be a protocol for the task [DISJn , ε − ξ , 1 → 0], such that IC(π) ≤ I(1 + δ )1/2 . Applying
Claim 7.4 to π with δ1 = ξ and δ2 = (1 + δ )1/4 − 1, we obtain a number M such that for any m ≥ M
there is a protocol πm3 with communication complexity at most

                                   m3 IC(π)(1 + δ )1/4 ≤ m3 I(1 + δ )3/4
                                                                                            3
and for each input (x, y) the total variation distance between the outputs of πm3 and π m is at most ξ .
    Let τnm3 be the protocol for DISJnm3 where the parties split their input into m3 blocks of n bits, simulate
                                                                                                         3
πm3 and output “disjoint” if πm3 outputs “disjoint” for all m3 sub-problems. Similarly, define τ nm with
                3
respect to π m . Note that the communication complexity of τnm3 is the same as the communication
complexity of πm3 , which is bounded by m3 I(1 + δ )3/4 .
                                                                                                           3
    We will show that τnm3 has a worst-case error of at most ε: for any input (x, y) ∈ {0, 1}nm , let
Tnm3 (x, y) be the random variable denoting the transcript of τnm3 conditioned on the input being (x, y) and
                                     3                                   3                                   3
similarly define Πm3 (x, y) and Πm (x, y) with respect to πm3 and π m , respectively. Let ϕ : {0, 1}m →
{0, 1} be the function such that τnm3 outputs ϕ(v) whenever the simulated πm3 outputs v. The probability
of τnm3 to err on input (x, y) equals

                      Pr[Tnm3 (x, y) 6= DISJnm3 (x, y)]                                                  (7.5)
                           h              n             3
                                                                            oi
                      = Pr Πm3 (x, y) ∈ v ∈ {0, 1}m : ϕ(v) 6= DISJnm3 (x, y)                             (7.6)
                           h 3            n             3
                                                                            oi
                      ≤ Pr Πm (x, y) ∈ v ∈ {0, 1}m : ϕ(v) 6= DISJnm3 (x, y) + ξ                          (7.7)
                      ≤ε                                                                                 (7.8)
                                                                                       3
where (7.7) follows from the fact that given any (x, y), the transcripts of πm3 and π m have a total variation
                                                                 3
distance of at most ξ , and (7.8) follows from the fact that π m simulates m3 copies of DISJn , each with a
one-sided error of ε − ξ . This implies that for all m ≥ M,

                                        Rε (DISJnm3 )
                                                      ≤ I(1 + δ )3/4 .                                   (7.9)
                                             m3
    To complete the proof, by
                                               (N 1/3 + 1)3
                                            lim             = 1,
                                           N→∞      N
there exists C0 > 0 such that for all N ≥ C0 one has

                                         (N 1/3 + 1)3
                                                      ≤ (1 + δ )1/4 .
                                              N

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                63
                    Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

Set C(n, ε, δ ) := max(M 3 ,C0 ). Then for any N ≥ C(n, ε, δ ),
                                  
                 Rε DISJn N 1/3 3  1/3 3                    1/3 3
Rε (DISJnN )                d e         N                      N                   (N 1/3 + 1)3
              ≤              3      ·         ≤ I(1+δ )3/4           ≤ I(1+δ )3/4              ≤ I(1+δ )
     N                                    N                      N                      N
                      
                       N 1/3
                                                                                                    (7.10)
as desired.

      Now we prove the upper bound.

Proof of Lemma 7.1. Fix ε > 0. By Lemma 7.2, there exists T (ε) depending on ε such that

                            IC(DISJn , ε, 1 → 0) ≤ n IC0 (AND, ε, 1 → 0) + o(n)                        (7.11)

whenever n ≥ T (ε). For every such sufficiently large n, choose δ = 1/n. Lemma 7.3 states that
                                                    
                            Rε (DISJnN )           1
                                          ≤ 1+         IC(DISJn , ε, 1 → 0)                    (7.12)
                                 N                 n

whenever N ≥ C(n, ε) for some constant C(n, ε). Since IC(DISJn , ε, 1 → 0) ≤ n,
                              Rε (DISJnN )                        1
                                           ≤ IC0 (AND, ε, 1 → 0) + + o(1)                              (7.13)
                                  nN                              n
for N ≥ C(n, ε). It follows that
                                   Rε (DISJM )
                                               ≤ IC0 (AND, ε, 1 → 0) + o(1)                            (7.14)
                                        M
where o(1) → 0 as M → ∞, completing the proof.

7.2     A protocol for Set-Disjointness

Theorem 3.13 (restated). For the set disjointness function DISJn on inputs of length n, we have
                                                           p      
                     ICD (DISJn , ε) ≤ n IC0 (AND, 0) −C1 h(ε) +C2 log n,                       (7.15)

for constants C1 ,C2 > 0.

Proof. We already established the lower bound in (3.25), it remains to prove the upper bound.
     Let µ be an input distribution for DISJn , and let p = Prµ [DISJn (X,Y ) = 1]. We can assume that
p ≥ ε as otherwise ICµ (DISJn , µ, ε) = 0, and the upper bound trivially holds. Below we introduce a
protocol π in Figure 2 that solves [DISJn , µ, ε] and has the desired information cost. In fact, our protocol
is stronger in the sense that it has only one-sided error: the protocol π always outputs 0 correctly if the
correct output is 0, and on the other hand, if there are t ≥ 1 coordinates satisfying Xi = Yi = 1, then π will
erroneously output 0 with probability at most (ε/2p)t ≤ ε/2p. Thus the distributional error of π is at
most p · ε/(2p) < ε, and π indeed solves [DISJn , µ, ε, 1 → 0].

                        T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                               64
                               T RADING I NFORMATION C OMPLEXITY FOR E RROR


  On input (X,Y ):

      • Alice and Bob, using public randomness, jointly sample a permutation σ on the set {1, 2, . . . , n}
        uniformly at random; and they run the following sub-protocol π σ :

      • For i = 1, 2, . . . , n repeat:

            – Alice and Bob run a protocol πiσ that is (almost) optimal for ICνi (AND, ε/2p, 1 → 0) on
              input (Xσ (i) ,Yσ (i) ), where νi is the distribution of (Xσ (i) ,Yσ (i) ) conditioned on the event
              that the protocol has not yet terminated;
            – if the protocol πiσ outputs 1, then terminate and output 1;

      • If the “for-loop” ends without outputting 1, output 0 and terminate.



                          Figure 2: The protocol π that solves [DISJn , µ, ε, 1 → 0].


     We now analyze the information cost. We start by analyzing the information cost of the sub-protocol
π σ . Let Πσ be the transcript of π σ , and write

                                                   Πσ = Πσ1 . . . Πσn

where Πσi denotes the transcript of the protocol πiσ for i = 1, . . . , n. As usual let

                                                  Πσ<i = Πσ1 . . . Πσi−1

be the partial transcript. Let µi denote the distribution of Xσ (i)Yσ (i) , and νi denote the distribution of
Xσ (i)Yσ (i) conditioned on Πσ<i . Corollary 3.9 (iii) gives a bound on the information exchanged in each
round: there exist constants C1 ,C2 > 0 such that for any distribution ν,

                   ICν (AND, ε/2p, 1 → 0) ≤ IC0 (AND, 0) +C1 h(ν(1, 1)) −C2 h(ε/p).                         (7.16)

Note that (Πσi | XY Πσ<i ) has the same distribution as (Πσi | Xσ (i)Yσ (i) Πσ<i ), and thus

                                 n                            n
             I(Y ; Πσ | X) = ∑ I(Y ; Πσi | X, Πσ<i ) = ∑ [H(Πσi | X, Πσ<i ) − H(Πσi | XY, Πσ<i )]
                                i=1                          i=1
                                 n
                             ≤ ∑ [H(Πσi | Xσ (i) , Πσ<i ) − H(Πσi | Xσ (i)Yσ (i) , Πσ<i )]
                                i=1
                                 n
                             = ∑ I(Yσ (i) ; Πσi | Xσ (i) , Πσ<i ).
                                i=1

Thus, denoting by T σ the number of AND protocols executed before the termination of π σ , the above

                          T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                  65
                      Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

inequality implies (note that νi is a random variable, and πiσ depends on νi )
                      n                    n
       ICµ (π σ ) ≤ ∑ E ICνi (πiσ ) ≤ ∑ Pr[T σ ≥ i] E [ICνi (πiσ ) | T σ ≥ i]
                     i=1                  i=1
                      n
                  ≤ ∑ Pr[T σ ≥ i] E IC0 (AND, 0) +C1 h(νi (1, 1)) −C2 h(ε/p) | T σ ≥ i
                                                                                     
                     i=1
                                                        n
                  ≤ IC0 (AND, 0) −C2 h(ε/p) E[T σ ] +C1 ∑ Pr[T σ ≥ i] E h(νi (1, 1)) | T σ ≥ i .
                                                                                            
                                                                   i=1

We want to bound the second term. Note since p ≥ ε,
                                                                                                                ε     1
Pr[T σ = i | T σ ≥ i, Xσ (i) = Yσ (i) = 1] = Pr[πiσ (Xσ (i)Yσ (i) ) = 1 | T σ ≥ i, Xσ (i) = Yσ (i) = 1] ≥ 1 −      ≥ .
                                                                                                                2p 2
                                                                                                                  (7.17)
Hence, applying (3.3) twice and using the concavity of h, we get
                                              
           Pr[T σ ≥ i] E h(νi (1, 1)) | T σ ≥ i ≤ Pr[T σ ≥ i] h (E [νi (1, 1) | T σ ≥ i])
                                                     = Pr[T σ ≥ i] h(Pr[Xσ (i) = Yσ (i) = 1 | T σ ≥ i])
                                                     ≤ h(Pr[Xσ (i) = Yσ (i) = 1 | T σ ≥ i] Pr[T σ ≥ i])
                                                     = h(Pr[T σ ≥ i, Xσ (i) = Yσ (i) = 1])
                                                     ≤ 2 Pr[T σ = i | T σ ≥ i, Xσ (i) = Yσ (i) = 1]
                                                         · h(Pr[T σ ≥ i, Xσ (i) = Yσ (i) = 1])
                                                     ≤ 2h(Pr[T σ = i, Xσ (i) = Yσ (i) = 1])
                                                     ≤ 2h(Pr[T σ = i, π(X,Y ) = 1]).

Using concavity of h again,

                    1 n
                      ∑ h(Pr[T σ = i, π(X,Y ) = 1]) ≤ h(Pr[π(X,Y ) = 1]/n) = h(p/n).
                    n i=1
                                                                                                                  (7.18)

Therefore
                                  n                                     
                                 ∑ Pr[T σ ≥ i] E h(νi (1, 1)) | T σ ≥ i ≤ 2nh(p/n).                               (7.19)
                                 i=1
That is, we have shown

                          ICµ (π σ ) ≤ IC0 (AND, 0) −C2 h(ε/p) E[T σ ] + 2C1 nh(p/n).
                                                              
                                                                                                                  (7.20)

Taking the expectation with respect to σ , we obtain

               ICµ (π) = E ICµ (π σ ) = IC0 (AND, 0) −C2 h(ε/p) E [T σ ] + 2C1 nh(p/n).
                                                               
                                                                                                                  (7.21)
                             σ                                               σ ,XY

Hence it remains to bound E[T σ ] where the expectation is over σ and the input XY .

                           T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                                      66
                                T RADING I NFORMATION C OMPLEXITY FOR E RROR

     Let x, y be such that DISJ(x, y) = 1, and let j be an index such that AND(x j , y j ) = 1. Then
                                  n
            E [T σ | XY = xy] = ∑ Pr[σ (i) = j] E[T σ | XY = xy, σ (i) = j]
        σ ,XY
                                 i=1
                n
             1
        =      ∑ E[T σ | XY = xy, σ (i) = j]
             n i=1
         1 n
        =  ∑ ∑ E[T σ | XY = xy, σ (i) = j, πiσ (X,Y ) = b] Pr[πiσ (X,Y ) = b | XY = xy, σ (i) = j]
         n i=1 b=0,1
            n 
         1                                                                                       
        ≤ ∑ i Pr[πiσ (X,Y ) = 1 | XY = xy, σ (i) = j] + n Pr[πiσ (X,Y ) = 0 | XY = xy, σ (i) = j]
         n i=1
         1 n
                              
                     ε       ε           ε n+1          ε       n+1      ε
        ≤ ∑ i(1 − ) + n           = (1 − )          + n≤             + n.
         n i=1       2p     2p           2p 2          2p         2     4p

     This allows us the next bound:

     E [T σ ] = Pr[DISJ(X,Y ) = 1] E[T | DISJ(X,Y ) = 1] + Pr[DISJ(X,Y ) = 0] E[T | DISJ(X,Y ) = 0]
    σ ,XY
                                                                                                       (7.22)
                                     
                         n+1  ε               2p    ε
                ≤p           + n + (1 − p)n ≤    n + n + (1 − p)n = (1 − p/3 + ε/4)n.                  (7.23)
                          2   4p              3     4

Combine (7.21) and (7.23) we get

                     ICµ (π) ≤ n(1 − p/3 + ε/4) IC0 (AND, 0) −C2 h(ε/p) +C1 2nh(p/n)
                                                                       

                             = n(IC0 (AND, 0) −C3 h(ε/p) + p) +C4 nh(p/n),

for constants C3 ,C4 > 0.
    It remains to optimize over p. We start by minimizing p + h(ε/p). Up to a constant multiple,
                                                                                            p      the
minimum is attainedp at the point satisfying p = h(ε/p). A simple calculation shows that p ≈ h(ε), and
so p + h(ε/p) = Ω( h(ε)). Thus
                                                         p
                         ICµ (π) ≤ n[IC0 (AND, 0) − Ω( h(ε))] + O(nh(p/n)).                     (7.24)

The value of the error term O(nh(p/n)) is at most O(nh(1/n)) = O(n log(n)/n) = O(log n), and the
theorem follows.


8     Open problems and concluding remarks
     • In Conjecture 3.12 we speculated that the exact asymptotics of Rε (DISJn ) is given by the informa-
       tion complexity of the AND function when only one-sided error is allowed:

                                          Rε (DISJn ) = n IC0 (AND, ε, 1 → 0) ± o(n).                   (8.1)

                            T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                          67
               Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

  The set disjointness function has a “self-reducible” structure in the sense that it is possible to
  solve an instance of the corresponding communication problem by dividing the input into blocks
  and solving the same problem on each block separately. This structure allows us to relate the
  communication complexity of the problem to its amortized communication complexity, and thus to
  its information complexity via the fundamental result of Braverman and Rao [7]. Applying such
  ideas we showed (the lower bound is obvious)

                     IC(DISJn , ε) ≤ Rε (DISJn ) ≤ m IC(DISJ mn , ε, 1 → 0) + o(n),             (8.2)

  for an appropriate choice of m = m(n) that tends to infinity as n → ∞. In Theorem 3.11 we
  combined this with our analysis of the information complexity of the set disjointness to prove
  Rε (DISJn ) = n[IC0 (AND, 0) − Θ(h(ε))]. More precisely we showed

     n IC0 (AND, ε) ≤ IC(DISJn , ε) ≤ IC(DISJn , ε, 1 → 0) ≤ n IC0 (AND, ε, 1 → 0) + o(n),      (8.3)

  and combined it with our results regarding the information complexity of the AND function. We
  believe that the upper bound is the truth; that is

                             IC(DISJn , ε) ≥ n IC0 (AND, ε, 1 → 0) − o(n),                      (8.4)

  which would imply Conjecture 3.12.
• The example of the AND function shows that the Ω(h(ε)) gain in the information cost, appearing
  in our upper bounds √ in Theorems 3.2, 3.6, 3.15 and 3.16 is tight. However we do not know
  whether the O(h( ε)) gain appearing in the lower bounds in Theorems 3.5 and 3.6, Corollary 3.14
  and Theorem 3.16 is sharp. In fact we are not aware of any example that exhibits a gain that
  is not Θ(h(ε)). Is it true that for every function f : X × Y → Z, and measure µ on X × Y with
  ICµ ( f , 0) > 0, we have ICµ ( f , ε) = ICµ ( f , 0) − Θ(h(ε))? One can ask a similar question for
  ICµ ( f , µ, ε), IC( f , ε), and ICD ( f , ε).
• Recall the inner product function IPn : {0, 1}n × {0, 1}n → {0, 1} is defined as
                                                    n
                                    IPn : (x, y) 7→ ∑ xi yi   mod 2.                            (8.5)
                                                   i=1

  Let ν denote the uniform distribution on {0, 1}n × {0, 1}n . In [6, Theorem 1.3], Braverman et al.
  exploited the self-reducibility property of IPn and showed that for any δ > 0 there exists ε > 0
  such that
                                              ICν (IPn , ν, ε)
                                    lim inf                    ≥ 1−δ.                           (8.6)
                                        n→∞         n
  This in particular implies
                                              ICν (IPn , ν, 0)
                                         lim                   = 1.
                                        n→∞          n
  In the same paper they also showed ICν (IPn , ν, ε) ≤ (1 − 2ε)(n + 1) for any 0 < ε < 1/2, alterna-
  tively,
                                             ICν (IPn , ν, ε)
                                   lim sup                     ≤ 1 − 2ε.                        (8.7)
                                       n→∞          n

                   T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                           68
                          T RADING I NFORMATION C OMPLEXITY FOR E RROR

    In [6, Problem 1.4] they asked whether this upper bound is tight, possibly with the coefficient 2 be
    replaced by some other constant α? Formally, does there exist a constant α ≥ 2 such that
                                         ICν (IPn , ν, ε)
                                 lim inf                  ≥ 1 − αε − o(ε) ?                        (8.8)
                                     n→∞       n
    A positive answer would imply that, by allowing ε-error for IPn , comparing to 0-error protocols one
    can save roughly nΘ(ε) information as n → ∞. Note that our bound ICν ( f , ν, ε) ≤ ICν ( f , ν, 0) −
    Ω(h(ε)) in Theorem 3.6 does not refute (8.8) as the constant in Ω(·) equals to o(n) when f = IPn .
  • The focus of this paper has been on the internal information complexity since this notion has
    nice properties such as the direct sum property (see the introduction for references) comparing to
    external information complexity. However, it would still be interesting to study external information
    complexity. Considering that external information complexity is typically simpler than internal
    information complexity, we believe that the analogues of many of our results, specially those about
    the AND function, can be proven for this case as well. We defer this to future research.


References
[1] L ÁSZLÓ BABAI , P ETER F RANKL , AND JANOS S IMON: Complexity classes in communica-
    tion complexity theory. In Proc. 27th FOCS, pp. 337–347. IEEE Comp. Soc. Press, 1986.
    [doi:10.1109/SFCS.1986.15] 4
[2] Z IV BAR -YOSSEF, T. S. JAYRAM , R AVI K UMAR , AND D. S IVAKUMAR: An information statistics
    approach to data stream and communication complexity. J. Comput. System Sci., 68(4):702–732,
    2004. Preliminary version in FOCS’02. [doi:10.1016/j.jcss.2003.11.006] 4
[3] B OAZ BARAK , M ARK B RAVERMAN , X I C HEN , AND A NUP R AO: How to compress interactive
    communication. SIAM J. Comput., 42(3):1327–1363, 2013. Preliminary version in STOC’10.
    [doi:10.1137/100811969] 4, 11
[4] M ARK B RAVERMAN: Interactive information complexity. SIAM J. Comput., 44(6):1698–1739,
    2015. Preliminary version in STOC’12. [doi:10.1137/130938517] 4, 5, 6, 9, 13, 14, 22, 23, 24, 61,
    62
[5] M ARK B RAVERMAN , A NKIT G ARG , D ENIS PANKRATOV, AND O MRI W EINSTEIN: From infor-
    mation to exact communication (extended abstract). In Proc. 45th STOC, pp. 151–160. ACM Press,
    2013. [doi:10.1145/2488608.2488628] 4, 5, 6, 7, 13, 18, 19, 20, 21, 22, 42, 49, 50, 53, 61, 62
[6] M ARK B RAVERMAN , A NKIT G ARG , D ENIS PANKRATOV, AND O MRI W EINSTEIN: Information
    lower bounds via self-reducibility. Theory Comput. Syst., 59(2):377–396, 2016. Preliminary version
    in CSR’13. [doi:10.1007/s00224-015-9655-z] 13, 42, 61, 68, 69
[7] M ARK B RAVERMAN AND A NUP R AO: Information equals amortized communication.
    IEEE Trans. Inform. Theory, 60(10):6058–6069, 2014. Preliminary version in FOCS’11.
    [doi:10.1109/TIT.2014.2347282, arXiv:1106.3595] 4, 22, 61, 68

                     T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                             69
                  Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

 [8] M ARK B RAVERMAN , A NUP R AO , O MRI W EINSTEIN , AND A MIR Y EHUDAYOFF: Direct product
     via round-preserving compression. In Proc. 40th Internat. Colloq. on Automata, Languages and
     Programming (ICALP’13), pp. 232–243. Springer, 2013. [doi:10.1007/978-3-642-39206-1_20] 4

 [9] M ARK B RAVERMAN , A NUP R AO , O MRI W EINSTEIN , AND A MIR Y EHUDAYOFF: Direct products
     in communication complexity. In Proc. 54th FOCS, pp. 746–755. IEEE Comp. Soc. Press, 2013.
     [doi:10.1109/FOCS.2013.85] 4

[10] M ARK B RAVERMAN AND J ON S CHNEIDER: Information complexity is computable. In Proc.
     43rd Internat. Colloq. on Automata, Languages and Programming (ICALP’16). DROPS, 2016.
     [doi:10.4230/LIPIcs.ICALP.2016.87, arXiv:1502.02971] 15

[11] A MIT C HAKRABARTI , YAOYUN S HI , A NTHONY W IRTH , AND A NDREW YAO: Informational
     complexity and the direct sum problem for simultaneous message complexity. In Proc. 42nd FOCS,
     pp. 270–278. IEEE Comp. Soc. Press, 2001. [doi:10.1109/SFCS.2001.959901] 4

[12] A RKADEV C HATTOPADHYAY AND T ONIANN P ITASSI: The story of set disjointness. SIGACT
     News, 41(3):59–85, 2010. [doi:10.1145/1855118.1855133] 4

[13] Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI , AND YAQIAO L I: Trading information
     complexity for error. In Proc. 32nd IEEE Conf. on Computational Complexity (CCC’17), pp.
     16:1–16:59. DROPS, 2017. [doi:10.4230/LIPIcs.CCC.2017.16, arXiv:1611.06650] 1

[14] T OMÁS F EDER , E YAL K USHILEVITZ , M ONI NAOR , AND N OAM N ISAN: Amortized commu-
     nication complexity. SIAM J. Comput., 24(4):736–750, 1995. Preliminary version in CCC’17.
     [doi:10.1137/S0097539792235864] 4

[15] Y UVAL F ILMUS , H AMED H ATAMI , YAQIAO L I , AND S UZIN YOU: Information complexity of the
     AND function in the two-party, and multiparty settings. In Proc. 23rd Ann. Internat. Computing
     and Combinatorics Conf. (COCOON’17), pp. 200–211. Springer, 2017. [doi:10.1007/978-3-319-
     62389-4_17, arXiv:1703.07833] 18

[16] A NAT G ANOR , G ILLAT KOL , AND R AN R AZ: Exponential separation of information and commu-
     nication for Boolean functions. J. ACM, 63(5):46:1–46:31, 2016. Preliminary version in STOC’15.
     [doi:10.1145/2907939] 4

[17] P RAHLADH H ARSHA , R AHUL JAIN , DAVID A. M C A LLESTER , AND JAIKUMAR R ADHAKRISH -
     NAN: The communication complexity of correlation. IEEE Trans. Inform. Theory, 56(1):438–449,
     2010. [doi:10.1109/TIT.2009.2034824] 4

[18] R AHUL JAIN: New strong direct product results in communication complexity. J. ACM, 62(3):20:1–
     20:27, 2015. Preliminary version in ECCC’11. [doi:10.1145/2699432] 4

[19] R AHUL JAIN , ATTILA P ERESZLÉNYI , AND P ENGHUI YAO: A direct product theorem for bounded-
     round public-coin randomized communication complexity. Algorithmica, 76(3):720–748, 2016.
     Preliminary version in FOCS’12. [doi:10.1007/s00453-015-0100-0, arXiv:1201.1666] 4

                     T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                        70
                          T RADING I NFORMATION C OMPLEXITY FOR E RROR

[20] R AHUL JAIN , JAIKUMAR R ADHAKRISHNAN , AND P RANAB S EN: A direct sum theorem in
     communication complexity via message compression. In Proc. 30th Internat. Colloq. on Automata,
     Languages and Programming (ICALP’03), pp. 300–315. Springer, 2003. [doi:10.1007/3-540-45061-
     0_26, arXiv:cs/0304020] 4

[21] BALA K ALYANASUNDARAM AND G EORG S CHNITGER: The probabilistic communication com-
     plexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992. Preliminary version in
     SCT’87. [doi:10.1137/0405044] 4

[22] H ARTMUT K LAUCK: A strong direct product theorem for disjointness. In Proc. 42nd STOC, pp.
     77–86. ACM Press, 2010. [doi:10.1145/1806689.1806702, arXiv:0908.2940] 4

[23] E YAL K USHILEVITZ AND N OAM N ISAN: Communication complexity. Cambridge University
     Press, 1997. 12

[24] NAN M A AND P RAKASH I SHWAR: Some results on distributed source coding for in-
     teractive function computation.   IEEE Trans. Inform. Theory, 57(9):6180–6195, 2011.
     [doi:10.1109/TIT.2011.2161916] 19

[25] NAN M A AND P RAKASH I SHWAR: The infinite-message limit of two-terminal interactive source
     coding. IEEE Trans. Inform. Theory, 59(7):4071–4094, 2013. Preliminary version in Allerton’09.
     [doi:10.1109/TIT.2013.2251412, arXiv:0908.3512] 19

[26] M ARCO M OLINARO , DAVID P. W OODRUFF , AND G RIGORY YAROSLAVTSEV: Beating the
     direct sum theorem in communication complexity with implications for sketching. In Proc. 24th
     Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA’13), pp. 1738–1756. ACM Press, 2013.
     [doi:10.1137/1.9781611973105.125] 9

[27] A LEXANDER A. R AZBOROV: On the distributional complexity of disjointness. Theoret. Comput.
     Sci., 106(2):385–390, 1992. Preliminary version in ICALP’90. [doi:10.1016/0304-3975(92)90260-
     M] 4

[28] B YRON S CHMULAND: On the compacity of the space of probability measures. Mathematics Stack
     Exchange, January 18, 2014. 38

[29] C LAUDE E. S HANNON: A mathematical theory of communication. Bell System Technical Journal,
     27(3):379–423, 1948. [doi:10.1002/j.1538-7305.1948.tb01338.x] 4

[30] A LEXANDER A. S HERSTOV: Communication complexity theory: Thirty-five years of set disjoint-
     ness. In Proc. 39th Internat. Symp. on Mathematical Foundations of Computer Science (MFCS’14),
     pp. 24–43. Springer, 2014. [doi:10.1007/978-3-662-44522-8_3] 4

[31] A NDREW C HI -C HIH YAO: Some complexity questions related to distributive computing (Prelimi-
     nary report). In Proc. 11th STOC, pp. 209–213. ACM Press, 1979. [doi:10.1145/800135.804414] 4,
     10



                     T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                        71
                Y UVAL DAGAN , Y UVAL F ILMUS , H AMED H ATAMI AND YAQIAO L I

AUTHORS

    Yuval Dagan
    M. Sc. student
    Technion — Israel Institute of Technology
    Haifa, Israel
    yuval dagan cs technion ac il



    Yuval Filmus
    Assistant professor
    Technion — Israel Institute of Technology
    Haifa, Israel
    yuvalfi cs technion ac il
    http://www.cs.toronto.edu/~yuvalf/


    Hamed Hatami
    Associate professor
    McGill University
    Montreal, QC, Canada
    hatami cs mcgill ca
    http://www.cs.mcgill.ca/~hatami/


    Yaqiao Li
    Ph. D. student
    McGill University
    Montreal, QC, Canada
    yaqiao li mail mcgill ca
    http://www.cs.mcgill.ca/~yli252/


ABOUT THE AUTHORS

    Y UVAL DAGAN studied at the Technion between 2011-2017; his advisor was Yuval Filmus.
       His thesis was in theoretical computer science, focusing on information complexity and
       learning theory. He lives in Tel Aviv, Israel.




                   T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                         72
                    T RADING I NFORMATION C OMPLEXITY FOR E RROR

Y UVAL F ILMUS graduated from the University of Toronto in 2013; his advisor was Toni
   Pitassi. His thesis focused on spectral aspects of extremal combinatorics, but he has
   also maintained a keen interest in computational complexity. In his free time, he enjoys
   helping people on Computer Science Stack Exchange.


H AMED H ATAMI graduated from the University of Toronto in 2009; his advisors were
   Michael Molloy and Balazs Szegedy. His work is focused on the applications of mathe-
   matical analysis to theoretical computer science and discrete mathematics.


YAQIAO L I is a Ph. D. student at McGill university, supervised by Hamed Hatami. His
  current research focuses on communication and information complexity. He grew up in
  Sichuan where spicy food rules. He has a lot of fun playing badminton and fishing.




                T HEORY OF C OMPUTING, Volume 14 (6), 2018, pp. 1–73                          73