Simplified Separation of Information and Communication

Authors Anup Rao, Makrand Sinha,
Plaintext
                          T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29
                                       www.theoryofcomputing.org




                     Simplified Separation of
                 Information and Communication
                                    Anup Rao∗                       Makrand Sinha†
             Received September 20, 2016; Revised January 2, 2018; Published December 27, 2018




       Abstract: We give an example of a Boolean function whose information complexity
       is exponentially smaller than its communication complexity. Such a separation was first
       demonstrated by Ganor, Kol and Raz (J. ACM 2016). We give a simpler proof of the same
       result. In the course of this simplification, we make several new contributions: we introduce
       a new communication lower-bound technique, the notion of a fooling distribution, which
       allows us to separate information and communication complexity, and we also give a more
       direct proof of the information complexity upper bound.
           We also prove a generalization of Shearer’s Lemma that may be of independent interest.
       A version of Shearer’s original lemma bounds the expected mutual information of a subset
       of random variables with another random variable, when the subset is chosen independently
       of all the random variables that are involved. Our generalization allows some dependence
       between the random subset and the random variables involved, and still gives us similar
       bounds with an appropriate error term.

ACM Classification: F.1.1, F.1.2, F.2.3
AMS Classification: 68Q05, 68Q10, 68Q17, 68Q85, 68Q87
Key words and phrases: communication complexity, information complexity, fooling distribution,
separation of complexity measures

    A preliminary version of this paper appeared in ECCC as Technical Report TR15-057.
   ∗ Supported by an Alfred P. Sloan Fellowship, the National Science Foundation under agreement CCF-1016565, an NSF

Career award, and by the Binational Science Foundation under agreement 2010089.
   † Supported by the National Science Foundation under agreement CCF-1016565, and by the Binational Science Foundation

under agreement 2010089.


© 2018 Anup Rao and Makrand Sinha
c b Licensed under a Creative Commons Attribution License (CC-BY)                         DOI: 10.4086/toc.2018.v014a020
                                          A NUP R AO AND M AKRAND S INHA

1     Introduction
A fundamental question in the study of communication complexity is whether the information complexity
of a communication problem is the same as its communication complexity. If the messages in a protocol
reveal a small amount of information, does that mean that the protocol can be simulated using few bits
of communication? When the protocol is deterministic and one-way, a precise answer was given by
Shannon [26] who defined the notion of entropy, H(M), to measure the information content of a message
M, and showed that the number of bits of communication can always be made at most H(M) + 1 in
expectation, which is tight.
     For randomized and interactive (multi-round) protocols, this question was made explicit in a sequence
of papers. Chakrabarti, Shi, Wirth and Yao [8] defined what we now call the external information
cost of a protocol, which measures the information learned about the inputs by an external observer of
the messages. If M denotes the messages, R denotes the shared randomness and X,Y the inputs, the
external information cost is defined to be the mutual information I (XY : M | R). Barak, Braverman, Chen
and Rao [2] defined the internal information cost of a protocol as I (X : M | Y R) + I (Y : M | XR), the
information learned by the parties about the input to the other party. The internal information cost is
never greater than the external information cost.
     For bounded-round protocols, Harsha, Jain, McAllester and Radhakrishnan [16] (see also [4]) showed
how to give optimal simulations in terms of the external information cost, and Braverman and Rao [5]
gave near-optimal simulations in terms of the internal information cost (up to a 1 + o(1) factor). However,
many of the best known simulations for interactive protocols are not known to be optimal. We know how
to simulate any interactive protocol with external information cost I and communication C using a protocol
with communication O(I · log2 C) [2]. We also know how to simulate √ a protocol with internal information
I and communication C using a protocol with communication O( IC · logC) [2]. Braverman [3] showed
how to simulate any protocol with internal information cost I using communication 2O(I) . Natarajan
Ramamoorthy and Rao [21] gave better simulations when one party reveals less information than the other.
Very recently, Sherstov [27], building on the work of Kol [19], showed that when the input distribution is
a product, protocols with internal information I can be simulated with communication O(I · log2 I), which
is almost optimal.
     These results are closely tied to communication lower bounds. Information theory based methods
for proving lower bounds on the communication complexity of disjointness [17, 25, 1] can be seen as
precursors to some of them. They have been used to give answers to long-standing questions like the
direct sum [2] and direct product [6] questions in communication complexity.
     Braverman and Weinstein [7] (see also [18]) showed that any Boolean function f (x, y) that can
be computed with internal information cost I must have a (nearly) monochromatic rectangle (i. e., a
subset R = S × T of the inputs where the function is essentially constant) of density 2−O(I) , and so
large discrepancy. This means that upper bounds on the size of monochromatic rectangles cannot be
used to prove lower bounds on the communication complexity of functions that have small information
complexity.1 So, for a long time, all known methods for proving lower bounds for communication failed
to prove lower bounds on functions that have large (0 and 1) monochromatic rectangles. This pointed to a
    1 The information-based methods for proving lower bounds on disjointness also prove that disjointness does not have large

1-monochromatic rectangles.


                           T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                           2
                     S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

significant weakness in our ability to prove new communication lower bounds, since certainly, there are
functions that require large communication and still have large monochromatic rectangles. One can plant
large monochromatic rectangles into a random function to obtain such an example with high probability.
    In a remarkable sequence of papers, Ganor, Kol and Raz [13, 15] showed that there is a function with
internal information cost I with respect to a distribution that requires 2Ω(I) communication under the same
distribution. This proved that Braverman’s simulation [3] is tight. Their proof gives a method to prove
communication lower bounds on functions that have many large monochromatic rectangles, potentially
leading to fundamentally different methods to prove lower bounds on communication problems.
    In this paper, we build on the work of [15] and give a new proof of their main result. Our proofs
are shorter and we find them more intuitive. For parameters k, n ∈ N, we define a Boolean function
called the k-ary pointer jumping function with inputs X, F (given to Alice) and Y, G (given to Bob) and a
distribution q(X, F,Y, G) on its inputs. We show that there is a protocol with small internal information
cost that computes the k-ary pointer jumping function on the distribution q(X, F,Y, G):
Theorem 1.1. There is a randomized protocol for the k-ary pointer jumping problem under the input
distribution q(X, F,Y, G) that has internal information cost O((log k + log log n) · 2(2 log n)/k ) and errs with
probability at most 4/log n.
    On the other hand, we show that no protocol with communication complexity significantly smaller
than min{k, log n} can compute the same function on the distribution q(X, F,Y, G).
Theorem 1.2. For large enough values of k and n, every protocol for the k-ary pointer jumping function
that has communication complexity at most ε 3 · min{k, log n} errs with probability at least 1/2 − 8ε on
inputs drawn from the distribution q(X, F,Y, G).
    Note that the statement above applies to all protocols, whether they be deterministic or use public and
private randomness. Setting n = 2k in Theorem 1.1 and Theorem 1.2, we obtain our main result: when
the inputs are sampled from the distribution q(X, F,Y, G), the internal information complexity of the k-ary
pointer jumping function is O(log k) but the communication complexity is Ω(k).
Corollary 1.3. There is a randomized protocol for the k-ary pointer jumping function under the input
distribution q(X, F,Y, G) that has internal information cost O(log k) and errs with probability at most
4/k. Moreover, for large enough k, every protocol for the k-ary pointer jumping function that has
communication complexity at most ε 3 k, errs with probability at least 1/2 − 8ε on the input distribution
q(X, F,Y, G).
    One of the new ideas that simplify our proofs is the use of a new technique, the notion of a fooling
distribution to prove communication lower bounds. This gives us another method to separate information
and communication apart from the relative discrepancy method introduced in [15]. We describe this
technique in Section 3.3 and compare it with the relative discrepancy method in Section 4.3.
    We remark that our function and distribution is a simpler variant of the bursting noise function defined
in [15] and as such we recover the same parameters in our main theorems as in the work of [15]. For
example, [15] proves that any protocol with communication at most 2t computing the bursting noise
function with parameter t ∈ N must have error at least 1/2 − 2−t . We recover the same parameters
from Corollary 1.3 by setting k = 512 · 24t and ε = 2−t−3 . An analogous statement also holds for the
information cost upper bound in Corollary 1.3. Moreover, the inputs to the k-ary pointer jumping function
are of length N where log log log N = Θ(k) so the communication and information complexity of this

                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                  3
                                        A NUP R AO AND M AKRAND S INHA

function are really small in terms of the input length (a similar statement holds for the bursting noise
function).

1.1    Closely related work
Our work and the work of [15] proves that there is a Boolean function which has widely different
information complexity and communication complexity under a carefully designed input distribution.
One can also ask if information complexity and communication complexity are exponentially separated
for the worse-case input distribution (the non-distributional setting as defined in [3]). Fontes, Jain,
Kerenidis, Laplante, Lauriére and Roland [12] showed that the relative discrepancy technique introduced
in [15] cannot be used to separate information and communication complexity for a Boolean function in
the non-distributional setting.
     After the dissemination of our paper, Ganor, Kol and Raz [14] gave an exponential separation of ex-
ternal information and communication for a relation (search problem) that holds in the non-distributional
setting. The new proof uses a clever reduction to the randomized communication complexity of set
disjointness. However, as the new result of [14] proves a separation only for a search problem, our ideas
still give the most direct proof separating information complexity and communication complexity for
Boolean functions.
     In another closely related work, Natarajan Ramamoorthy and the second author [22] used one of the
technical lemmas (Lemma 4.1) clarified in this paper to give a simpler proof showing that the randomized
communication complexity of the Greater-Than function on n bits is Ω(log n). The same lower bound
was previously proved by Viola [28] and also by Braverman and Weinstein [7] using different techniques.

1.2    Organization
We start with the preliminaries in Section 2. Following this in Section 3, we define the k-ary pointer
jumping function and our results in a little more detail. We prove the communication lower bound in
Section 4. In Section 5, we bound the information complexity of the k-ary pointer jumping problem.


2     Preliminaries
2.1    Probability spaces and variables
Throughout this paper, log (and ln) denotes the logarithm taken in base two (and base e). We use [k] to
denote the set {1, 2, . . . , k} and [k]<n to denote the set of all strings of length less than n over the alphabet
[k], including the empty string. The notation |z| denotes the length of the string z.
     Random variables are denoted by capital letters (e. g., A) and values they attain are denoted by
lower-case letters (e. g., a). We say a random variable A determines another random variable B if given
the value of A, the value of B is fixed. Events in a probability space will be denoted by calligraphic letters
(e. g., E). Given a = (a1 , a2 , . . . , an ), we write a≤i to denote a1 , . . . , ai . We define a<i similarly. We write
aS to denote the projection of a to the coordinates specified in the set S ⊆ [n].
     Given a probability space p and a random variable A in the underlying sample space, we use the
notation p(A) to denote the probability distribution of the variable A in the probability space p. We

                          T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                        4
                     S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

will often consider multiple probability spaces with the same underlying sample space, so for example
p(A) and q(A) will denote the distribution of the random variable A under the probability spaces p and q
respectively with the underlying sample space of p and q being the same. We write p(A | b) to denote
the distribution of A conditioned on the event B = b. We write p(a) to denote the number P p [A = a] and
p(a | b) to denote the number P p [A = a | B = b]. Given a distribution p(A, B,C, D), we write p(A, B,C)
to denote the marginal distribution on the variables A, B,C. We often write p(AB) instead of p(A, B)
for conciseness of notation. Similarly, p(a, b, c) will denote the probability according to the marginal
distribution p(A, B,C) and we will often write it as p(abc) for conciseness.
    If W is an event, we write p(W) to denote its probability according to p. Given a probability space p
and a random variable A, when we write A ∈ W for an event W we only consider events in the space of
values taken by the variable A.
    Given a fixed value c, we denote by E p(b|c) [g(a, b, c)] := ∑b p(b | c) · g(a, b, c), the expected value of
the function g(a, b, c) under the distribution p(B | c). If the probability space p is clear from the context,
then we will just write Eb|c [g(a, b, c)] to denote the expectation. For a Boolean function h(a, b) and a
probability distribution p(A, B), denoting by 1[h(a, b) = 0] the indicator function for the event h(a, b) = 0,
we write p(h = 0) := E p(ab) [1[h(a, b) = 0]] as the probability that h is 0 under inputs drawn from p.
     We write A − M − B as a shorthand to say that that the random variables A, M and B form a Markov
chain, or in other words, A and B are independent given M: p(amb) = p(m) · p(a | m) · p(b | m) for every
a, b, m.
     To get familiar with the notation, consider the following example. Let A ∈ {0, 1}2 be a uniformly
distributed random variable in a probability space p. Then, p(A) is the uniform distribution on {0, 1}2 and
if a = (0, 0), p(a) = 1/4. Let A1 and A2 denote the first and second bits of A, then if B = A1 + A2 mod 2,
then when b = 1, p(A | b) is the uniform distribution on {(0, 1), (1, 0)}. If a = (1, 0), and b = 1, then
p(a | b) = 1/2, and p(a, b) = 1/4. If E is the event that A1 = B, then p(E) = 1/2. Let q(A) = p(A | E),
then q(A) is the uniform distribution on {(0, 0), (1, 0)} and q(A2 ) is the distribution over the sample space
{0, 1} which takes the value 0 with probability 1.



2.2     Statistical distance

For two distributions p(A), q(A), the statistical distance |p(A) − q(A)| between them is defined to be
|p(A) − q(A)| = maxQ (p(A ∈ Q) − q(A ∈ Q)) where Q is an event.
                                     1
Proposition 2.1. |p(A) − q(A)| =         |p(a) − q(a)| =     ∑ (p(a) − q(a)) .
                                     2∑a                 a:p(a)>q(a)
                                                                                          ε
      We say p(A) and q(A) are ε-close if |p(A) − q(A)| ≤ ε and we write it as p(A) ≈ q(A).
Proposition 2.2. If p(AB), q(AB) are such that p(A) = q(A), then


                                |p(B) − q(B)| = E [|p(B | a) − q(B | a)|] .
                                                  p(a)



                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                 5
                                     A NUP R AO AND M AKRAND S INHA

2.3     Divergence and mutual information
The divergence between distributions p(A) and q(A) is defined to be

                                         p(A)              p(a)
                                              = ∑ p(a) log      .
                                         q(A)   a          q(a)

      For three random variables A, B,C and an event E in a probability space p, we will use the shorthand

                                         A | bcE   p(A | bcE)
                                                 =            ,
                                          A|c       p(A | c)

when p is clear from context. The mutual information between A, B conditioned on C is defined as
                                   "        #      "        #
                                     A | bc          B | ac                  p(a | bc)
               I (A : B | C) = E              =E              = ∑ p(abc) log           .
                               c,b   A|c       c,a   B|c        a,b,c        p(a | c)

   We shall often work with multiple probability spaces over the same sample space. To avoid confusion,
we shall explicitly write I p (A : B | C) to specify the probability space p being used for computing the
mutual information.

2.4     Basic divergence facts
The proofs of the following basic facts can be found in the book by Cover and Thomas [10]. In the
following, p and q are probability spaces (over the same sample space), and A, B and C are random
variables on the underlying sample space.
Proposition 2.3. If A ∈ {0, 1}` , then I (A : B) ≤ `.
Proposition 2.4. If B determines C, then I (A : C) ≤ I (A : B).
                                                                               "                  #
                                                              p(A)   s             p(Ai | a<i )
Proposition 2.5 (Chain Rule). If A = (A1 , . . . , As ), then      =∑ E                               .
                                                              q(A)  i=1 p(a)       q(Ai | a<i )
                                                            ln 2 p(A)    p(A)
Proposition 2.6 (Pinsker’s Inequality). |p(A) − q(A)|2 ≤        ·      ≤      .
                                                             2    q(A)   q(A)
                     p(A)
Proposition 2.7.          ≥ 0.
                     q(A)

2.5     Divergence inequalities
The following propositions bound the change in divergence when extra conditioning is involved. Some of
these were proved in [6, 15]. For completeness, we include full proofs for all of them. Below, p and q are
probability spaces, and A, M and B are random variables on the underlying sample space.
Proposition 2.8. If A and B are independent and A − M − B, then I (A : M) = I (A : M | B).

                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                             6
                          S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

Proof. Since A and B are independent p(A | b) = p(A) and since A − M − B, p(A | mb) = p(A | m). So,
we have                                 "        #     "        #
                                          A | mb         A|m
                     I (A : M | B) = E             =E             = I (A : M) .
                                     bm    A|b      bm     A

Proposition 2.9 ([6]). For an event W and variables A, B in a probability space p, we have
                                 "        #
                                   A | bW           1
                              E             ≤ log        + I (A : B | W) .
                             b|W      A           p(W)

Proof. We can write
                 "                    #
                           A | bW                                                    p(a | bW) · p(a | W)
                      E                   − I (A : B | W) = ∑ p(ab | W) log
                  b|W         A                                    ab                  p(a) · p(a | bW)
                                                                                   p(a | W)
                                                              = ∑ p(a | W) log
                                                                    a                p(a)
                                                                                   p(W | a)         1
                                                              = ∑ p(a | W) log              ≤ log      .
                                                                    a               p(W)          p(W)
                                      "               #
                                           p(A | b)           p(A)
Proposition 2.10 ([6]). E p(b)                            ≥        .
                                            q(A)              q(A)

Proof.
                  "              #                                                            "              #
                      p(A | b)         p(A)               p(a | b) · q(a)                         p(A | b)
            E                        −      = ∑ p(ab) log                 = E                                    ≥ 0,
           p(b)        q(A)            q(A)   a,b          q(a) · p(a)     p(b)                    p(A)

where the last inequality follows from Proposition 2.7.
                                "           #          "          #
                                   p(A | b)              p(A | b)
Proposition 2.11 ([15]). E p(b)               ≤ E p(b)              .
                                    p(A)                  q(A)

Proof.
                  "              #            "               #
                      p(A | b)                    p(A | b)                        p(a | b) · p(a)   p(A)
            E                        − E                          = ∑ p(ab) log                   =      ≥ 0,
           p(b)        q(A)            p(b)        p(A)             a,b           q(a) · p(a | b)   q(A)

where the last inequality follows from Proposition 2.7.

Proposition 2.12. Let A ∈ {0, 1} and let γ = p(a) and ε = q(a) for a = 1. Then,

                                                          p(A)         γ
                                                               ≥ γ log .
                                                          q(A)        eε


                            T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                       7
                                     A NUP R AO AND M AKRAND S INHA

Proof. We have

      p(A)        γ              1−γ        γ                            γ                  γ
           = γ log + (1 − γ) log     ≥ γ log + (1 − γ) log(1 − γ) ≥ γ log − γ log e = γ log ,
      q(A)        ε              1−ε        ε                            ε                 eε

where in the last inequality we used the fact that
                                                                 
                                                             γ                     γ
                     −(1 − γ) ln(1 − γ) = (1 − γ) ln 1 +              ≤ (1 − γ)       =γ.
                                                            1−γ                   1−γ

2.6     Communication Complexity and Information Cost
The Communication Complexity of a protocol is the maximum number of bits that may be exchanged by
the protocol. Under an input distribution p(X,Y ), the Internal Information Cost of a randomized protocol
(with both shared and private randomness) is defined to be I p (X : M | Y R) + I p (Y : M | XR) where X and
Y are inputs to the protocol, M denotes the messages of the protocol and R denotes the shared randomness
of the protocol.
    We briefly describe basic properties of communication protocols that we need. For more details see
the book by Kushilevitz and Nisan [20]. For a deterministic protocol π, let π(x, y) denote the messages
of the protocol on inputs x, y. For any transcript m of the protocol, define the following events:

            Sm = {x | ∃y such that π(x, y) = m} ,           Tm = {y | ∃x such that π(x, y) = m} .

We then have:
Proposition 2.13 (Messages correspond to rectangles). If m is a transcript and x, y are inputs to a
deterministic protocol π, then,

                                    π(x, y) = m ⇐⇒ x ∈ Sm ∧ y ∈ Tm .

      Proposition 2.13 implies:
Proposition 2.14 (Markov property of protocols). Let X and Y be random inputs to a deterministic
protocol and let M denote the messages of this protocol. If X and Y are independent then X − M −Y .
    Note that the above implies that if x and y are independent inputs sampled from a distribution p(X,Y )
and m is a transcript of a deterministic communication protocol, then

                     p(xy | m) = p(xy | x ∈ Sm , y ∈ Tm ) = p(x | x ∈ Sm )p(y | y ∈ Tm ) .

Lemma 2.15 (Errors and statistical distance). Let h(x, y) be a Boolean function and p(X,Y ) be a
distribution such that p(h = 0) = p(h = 1) = 1/2. If π is a deterministic protocol with messages M that
computes h with error δ on the distribution p, then |p(M | h = 0) − p(M | h = 1)| ≥ 1 − 2δ .

Proof. Since |p(M | h = 0) − p(M | h = 1)| = maxQ (p(M ∈ Q | h = 0) − p(M ∈ Q | h = 1)) it suffices to
exhibit an event Q such that p(M ∈ Q | h = 0) − p(M ∈ Q | h = 1) = 1 − 2δ . Let M0 denote the event that

                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                            8
                       S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

the protocol outputs a zero. Then, since p(h = 0) = p(h = 1) = 1/2, writing the probability of success in
terms of M0 , we have
          p(M ∈ M0 | h = 0) 1 − p(M ∈ M0 | h = 1) 1 p(M ∈ M0 | h = 0) − p(M ∈ M0 | h = 1)
1−δ =                      +                     = +                                      .
                 2                   2            2                   2
On rearranging, the above gives us that p(M ∈ M0 | h = 0) − p(M ∈ M0 | h = 1) = 1 − 2δ and hence the
statistical distance must be at least 1 − 2δ .


3     The k-ary pointer jumping function
For a parameter k ∈ N, k ≥ 2, we work with the alphabet [k] = {1, 2, . . . , k}. Let X,Y : [k]<n → [k]
be functions mapping strings of length less than n to a single character. Let F, G : [k]n → {0, 1} be
Boolean functions. For z ∈ [k]n , let z≤r denote the prefix of z of length r. In the k-ary pointer jumping
problem, the first party is given X, F, and the second is given Y, G. The goal of the parties is to compute
F(z) + G(z) mod 2, where z ∈ [k]n is the unique string satisfying the n equations

                                       X(z≤r ) +Y (z≤r ) = zr+1 mod k ,

for every r ∈ {0, 1, . . . , n − 1}.

3.1    Input and fooling distributions
For z ∈ [k]<n , J ∈ {0, 1, . . . , n − 1} and X,Y as above, we say z is consistent with X,Y, J, if |z| ≥ J + 1, and

                                       X(z≤J ) +Y (z≤J ) = zJ+1 mod k .

The distribution on inputs is described in Figure 1.

 Fooling Distribution p(X, F,Y, G): Index J ∈ {0, . . . , n − 1} is sampled uniformly at random. Func-
       tions X,Y : [k]<n → [k] are sampled uniformly at random subject to the constraint that for any
       z ∈ [k]<J , X(z) = Y (z). Functions F, G : [k]n → {0, 1} are uniformly random.

 Input Distribution q(X, F,Y, G): Let E0 denote the event that for all consistent z, X(z) = Y (z) and
      F(z) = G(z) (when |z| = n). Let E1 denote the event that for all consistent z, X(z) = Y (z)
      and F(z) 6= G(z) (when |z| = n). In the distribution q0 (X, F,Y, G), J is sampled uniformly
      from {0, 1, . . . , n − 1}, and the rest of the variables are sampled according to the distribution
      of p(X, F,Y, G | E0 , j). In the distribution q1 (X, F,Y, G), J is sampled uniformly, and the rest
      of the variables are sampled uniformly from the distribution p(X, F,Y, G | E1 , j). The input is
      sampled by sampling from q0 with probability 1/2 and from q1 with probability 1/2.


                        Figure 1: Distributions for the k-ary pointer jumping problem.

    The distribution on inputs, described in Figure 1, ensures that F(z) + G(z) mod 2 is the same for
every consistent z, so it is enough for the parties to find a consistent z to complete the goal.

                           T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                 9
                                     A NUP R AO AND M AKRAND S INHA

Comparison with the bursting noise function. We remark that although our formulation allows us
to define the k-ary pointer jumping function and the input distribution much more compactly than the
bursting noise function defined in [15], our function is just a simpler variant of their construction. The
main difference being that our function can be thought of as being computed over a k-ary tree as opposed
to a binary tree (as in the work of [15]) and it is symmetric with respect to both parties since we are taking
the sum modulo k of the values computed by both parties (as opposed to each party going down the tree
alternately as in the work of [15]). Our distribution is also very similar to the distribution used by [15].

3.2   Simple protocols
To get a better understanding of the above function, let us present a few simple protocols for it.

Trivial protocol. There is a trivial protocol for this problem that has worst-case communication
O(n log k): in each step Alice and Bob send each other the values X(z≤r ),Y (z≤r ), until they have
computed z. The parties can compute F(z) + G(z) mod 2 with two more bits of communication. Note
that this protocol succeeds with zero error under any input distribution.

Binary search protocol. Under the input distribution q(X, F,Y, G), for any z ∈ [k]n , we have that
X(z<J ) = Y (z<J ) with probability 1 while X(z≤J ) and Y (z≤J ) are different with probability 1 − 1/k. The
players can use a version of binary search [11] to find the first difference and therefore, with probability at
least 1 − (ε + 1/k), they can compute the index J with O (log ((n log k)/ε)) bits of communication. With
an additional 2 log k bits of communication the players can then find a consistent z (satisfying X(z≤J ) +
Y (z≤J ) = zJ+1 mod k) which suffices to compute the value of the function on the input distribution
q(X, F,Y, G). This protocol has communication O (log ((n log k)/ε) + log k) and the error probability is
at most ε + 1/k on the input distribution q(X, F,Y, G).

Sampling protocol. Let t = Θ k2 log (1/ε) . Using shared randomness, the players draw a subset S by
                                              

choosing t strings uniformly at random from [k]n , exchange the values of F(z) and G(z) for each z ∈ S and
output the majority of the bits {F(z) + G(z) mod 2}z∈S . Under the input distribution q(X, F,Y, G), the
probability that a random string is consistent (that it satisfies X(z≤J ) +Y (z≤J ) = zJ+1 mod k) is 1/k, so
using standard concentration bounds, this protocol has error at most ε under the distribution q(X, F,Y, G).
Moreover, the communication is O(k2 log(1/ε)).

3.3   Information and Communication Complexity
We prove that there is a low-information          solution for this task, with internal information cost
O (log k + log log n) · 2(2 log n)/k on the distribution q(X, F,Y, G) (Theorem 1.1) but any randomized
                                    

protocol that errs with a constant probability on the input distribution q(X, F,Y, G) requires communica-
tion at least Ω(min{k, log n}) (Theorem 1.2). The lower bound is tight up to polynomial factors, as the
binary search and sampling protocol described previously, show that the communication complexity is
O(min{k2 , log n + log k}). Setting n = 2k , we get a correct protocol with information cost O(log k) even
though no protocol with communication Ω(k) can succeed.

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                10
                    S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

Low-information protocol. The low-information protocol for the problem is quite similar to the
trivial protocol, so let us first discuss the information cost of the trivial protocol under the distribution
q(X, F,Y, G). At the beginning of the protocol, with just their own inputs in hand, the parties do not have
any information about J. Using the trivial protocol, both parties learn the value of J with high probability.
This happens because J is close to the first point at which their inputs disagree. Since the entropy of
J is Θ(log n) bits and J is determined with high probability given X and Y , it is not too hard to argue
that the parties learn Ω(log n) bits of information about each other’s inputs. It follows that the internal
information cost of the trivial protocol is Ω(log n) bits, much larger than what we are aiming for.
     The low-information protocol adds some noise to hide the value of J. In each step of the low-
information protocol, the parties send each other the value X(z≤r ),Y (z≤r ) with probability 1 − (1/ log n)
and send a uniformly random value otherwise. The parties abort the protocol if they experience
log n/ log log n rounds where the messages they sent were not the same. The distribution on inputs
ensures that they will sample a consistent z with high probability. When the parties sample a consistent z,
the messages sent are almost always sampled from a distribution that the receiving party knows, while if
they sample a z that is not consistent, the protocol aborts shortly after the inconsistency. These properties
can be used to show that under the distribution q(X, F,Y, G), the information cost of the protocol is
O((log k + log log n) · 2(2 log n)/k ) and the error probability is at most 4/log n.
     Intuitively, this protocol does not reveal a lot of information about J because at the end of the protocol
the parties see a lot of disagreements, and so they only learn that J belongs to some set of density 1/log n.
Heuristically, the entropy of J conditioned on the entire exchange is ≈ log (n/ log n) = log n − log log n
and the amount of information revealed about J is O(log log n).
     In Section 5, we analyze the aforementioned low-information protocol and prove Theorem 1.1. Before
moving on, we remark that when n = 2k , using this low information-cost protocol with the simulation
                                                                                                              2
result of Braverman [3], one can obtain a deterministic protocol with communication complexity kc/ε
for a constant c > 2, that computes the k-ary pointer jumping function with error probability O(ε + 1/k)
on the distribution q(X, F,Y, G).


Communication lower bound. Consider any protocol with ` bits of communication that solves the k-
ary pointer jumping problem. Without loss of generality, we may assume that the protocol is deterministic,
since any randomness can always be fixed to obtain a deterministic protocol that succeeds with high
probability. Let M denote the messages of the protocol. Our input distribution q(X, F,Y, G) has the
property that under the inputs drawn from this distribution the k-ary pointer jumping function h is balanced
(it takes values 1 and 0 with probability half each). Hence, if the protocol solved the k-ary pointer jumping
problem with error δ on the distribution q(X, F,Y, G), then the statistical distance between q0 (M) (the
induced distribution on M when inputs are drawn from q(X, F,Y, G) conditioned on the value of k-ary
pointer jumping function being 0) and q1 (M) (the induced distribution on M when inputs are drawn from
q(X, F,Y, G) conditioned on the value of k-ary pointer jumping function being 1) would be at least 1 − 2δ
(see Lemma 2.15), since these distributions have nearly disjoint supports.
      To prove the communication lower bound, we define the fooling distribution p(X, F,Y, G) on inputs,
and using information theoretic techniques show that if the communication complexity ` of the protocol
is much less than min{k, log n}, then both of the distributions q0 (M) and q1 (M) are close to the fooling
distribution p(M), which implies that the statistical distance between q0 (M) and q1 (M) is close to 0. This

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                11
                                    A NUP R AO AND M AKRAND S INHA

will give us a contradiction, since we argued in the paragraph above that these distributions must be far
apart.
    It will be convenient to state our results in terms of the function η : [0, ∞) → [0, 1] defined as
                                          
                                          0
                                                           if α = 0,
                                  η(α) = α log(1/α) if α ∈ (0, 1/e),                                   (3.1)
                                          
                                             log(e)/e       if α ≥ 1/e.
                                          

One can check that η is non-decreasing, continuous, and concave. We prove the following upper bounds
on statistical distance between the induced message distributions.
Theorem 3.1. For any deterministic protocol for the k-ary pointer jumping function with communication
complexity at most `, we have that
                                                 s                         q       
                     γ       γ
                                                                q
                                                  3                 `             `
             q0 (M) ≈ p(M) ≈ q1 (M), with γ = 4       2e`/k + 2` 2 /n + η        2 /n .

    We stress that the fooling distribution p(X, F,Y, G) is only used in the analysis. The inputs to the
protocol come from the distribution q(X, F,Y, G).
    Theorem 3.1 implies that ` = Ω(min{k, log n}) for any protocol that solves the k-ary pointer jumping
problem for the input distribution q(X, F,Y, G). This gives us Theorem 1.2.

Proof of Theorem 1.2. Consider any protocol for the k-ary pointer jumping function that has communica-
tion complexity ` := ε 3 · min{k, log n} and errs with probability δ on the input distribution q(X, F,Y, G).
We may assume without loss of generality that the protocol is deterministic, since if the protocol used
public or private randomness, we can fix the randomness to get a deterministic protocol with the same
communication complexity and the same error on the input distribution q(X, F,Y, G). Since, the protocol
has error at most δ , the statistical distance between q0 (M) and q1 (M) must be at least 1 − 2δ (see
Lemma 2.15). On the other hand, Theorem 3.1 implies that

                      |q0 (M) − q1 (M)| ≤ |q0 (M) − p(M)| + |p(M) − q1 (M)| ≤ 2γ ,

where                          s
                                            q         q      
                                3             `           `
                           γ =4   2e`/k + 2` 2 /n + η    2 /n    < 8ε

for large enough values of k and n. Then, it must be that 1 − 2δ ≤ 2γ < 16ε and hence, the error
δ > 1/2 − 8ε.


4     Communication lower bound
4.1     High-level proof sketch for Theorem 3.1
Consider an `-bit deterministic protocol. Our goal is to show that if `  min{k, log n}, then the induced
distribution of the messages M of the protocol is roughly the same under the fooling distribution

                      T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                              12
                     S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

p(X, F,Y, G) and the distribution q0 (X, F,Y, G) (input distribution q(X, F,Y, G) conditioned on the value
of k-ary pointer jumping function being 0) and that a similar statement is true for p(X, F,Y, G) and
q1 (X, F,Y, G). As described in the previous section, this proves that communication complexity must be
Ω(min{k, log n}).
     How are the distributions p(X, F,Y, G) and q0 (X, F,Y, G) related? Let S denote the set of consistent
strings z ∈ [k]≤n and XSYS FS GS denote the projection of the random variables XY FG on the strings in the
set S and let X≤J ,Y≤J denote the restriction of X,Y to inputs of length at most J. Let XJ ,YJ denote the
restriction of X,Y to inputs of length J.
     First note that the set of consistent strings S is determined by XJ YJ J, since given XJ YJ J, one can
check whether any string z ∈ [k]n is consistent or not (whether it satisfies X(z≤J ) +Y (z≤J ) = zJ+1 mod k).
Since the distribution p(X, F,Y, G) and q0 (X, F,Y, G) are the same on X≤J Y≤J J, it follows that the
distributions p(S) and q0 (S) are also the same. Fixing the values X≤J Y≤J J, the distribution q0 (X, F,Y, G)
is obtained from p(X, F,Y, G) by conditioning on the event XS FS = YS GS . Similarly after fixing X≤J Y≤J J,
the distribution q1 (X, F,Y, G) is obtained from p(X, F,Y, G) by conditioning on the event XS FS = YS GS ,
where G is the function 1 − G.
     So, after fixing X≤J Y≤J J, to prove that p(M) ≈ q0 (M), we need to show that the distribution p(M)
does not change by much, even if we condition on the event XS FS = YS GS (and similarly XS FS = YS GS ).
We prove this in three steps:
    • First, we argue using a subtle application of the chain rule, that if `  min{k, log n}, the protocol
      does not reveal much information about S under the fooling distribution p(X, F,Y, G).
    • Next, we show that since the players do not learn much information about S, they also do not learn
      much information about XS FS and YS GS in the fooling distribution p(X, F,Y, G). To argue this, we
      prove a generalization of Shearer’s Lemma [9, 23].
    • Lastly, using the above facts we show that the distribution p(M) roughly stays the same even after
      conditioning on the event XS FS and YS GS .
    Note that it is only during the last step that the input distribution q0 (X, F,Y, G) (and q1 (X, F,Y, G)) are
related to the fooling distribution p(X, F,Y, G). The first two steps analyze the behavior of the protocol
only on the fooling distribution p(X, F,Y, G). Next we elaborate on each of the steps.

Information revealed about consistent strings
As discussed before, the set S of consistent strings is determined by XJ YJ J in the fooling distribution
p(X, F,Y, G). We bound the amount of information revealed about S to each party as follows.
Lemma 4.1. Let M denote the messages of a deterministic `-bit protocol, then
                                                                     )
                          I p (M : S | Y≤J J) ≤ I p (M : XJ | Y<J J)    2`
                                                                       ≤ .
                          I p (M : S | X≤J J) ≤ I p (M : YJ | X<J J)    n

     Since in the distribution p(X, F,Y, G), we have X<J = Y<J , it follows that I p (M : XJ | Y<J J) =
I p (M : XJ | X<J J). At first glance, one might think that the expression I p (M : XJ | X<J J) can be up-
per bounded by `/n using the chain rule (by averaging over all n values of J). However, this is not true:

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                  13
                                     A NUP R AO AND M AKRAND S INHA

since J is essentially determined by X and Y and the messages M contain information about X and Y , it is
not clear if one can directly use the chain rule as simply as above. To understand this point in more detail,
it is worthwhile to consider a simple example.
     Consider the following variant of the binary search protocol for the k-ary pointer jumping problem
from Section 3.2. In that section, we discussed the protocol on the input distribution q(X, F,Y, G),
however the binary search phase of the protocol to find the index J works pretty much the same on the
fooling distribution p(X, F,Y, G). Using O(log n) bits the players determine the value of J with a version
of binary search and then exchange one bit about the values of XJ and YJ . So, in this case the O(log n)-bit
protocol reveals at least 1 bit of information about XJ and YJ . So, one could not hope to get a bound like
`/n since this O(log n)-bit protocol already reveals 1 bit of information. In fact, this protocol also shows
that the above lemma is tight. If we stop the binary search phase after O(`) bits, the players will know a
set of size n/2` in which J lies. If the players choose a random index K in this set and reveal one bit of
information about XK , then the amount of information revealed about XJ is exactly 2` /n. as the players
reveal one bit about XJ with probability 2` /n.
     Despite the above, we are able to prove Lemma 4.1 by a subtle application of the chain rule and the
Markov property of protocols. Essentially, the proof argues that for each of the 2` possible transcripts, the
protocol, on average, only reveals 1/n bit of information.

Information revealed about XS FS and YS GS
Next, we want show that the parties do not learn much information about the values of X, F,Y, G restricted
to S (denoted XS , FS ,YS , GS ). Note that only 1/k fraction of the strings are in S, since for every z,
p(z ∈ S | J = j, X≤J = x≤ j ) ≤ 1/k. So, recalling the classical Shearer’s Lemma [9, 23], one might hope
that I (M : XS FS ) can be bounded by I (M : XF) /k ≤ `/k. If S was independent of M, X, F, such a bound
would be easy to prove using the chain rule for information (see Lemma 8 in [15]).
    In our case, S is not independent of M but almost independent, as the amount of information that M
reveals about S is small (after conditioning on X<J J). So it is conceivable that I (M : XS FS ) can still be
bounded by O(I (M : XF) /k) plus some small error terms. We prove such a generalization of Shearer’s
Lemma which may be of independent interest. Below US denotes the restriction of U to the coordinates
in S. We prove the following lemma.
Lemma 4.2. Given a probability space p0 , let U = (U1 , . . . ,Ut ) where U1 , . . . ,Ut are mutually independent
random variables. Let random variables C ∈ {0, 1}` , S ⊆ [t], and V be such that U is independent of SV ,
and U −C − SV and for all i ∈ [t], p0 (i ∈ S) ≤ 1/k. Then
                                                                 
                                              2e    p                   p               
                     I (C : US | V S) ≤ ` ·      + 2 I (C : S) + η           I (C : S) .
                                              k
    Conditioned on any fixing of X≤J Y≤J J, we have that XF and Y G are independent under p, and since
M denotes the messages in a communication protocol, the Markov property (Proposition 2.14) implies
that XF − M −Y G holds. Thus, Lemma 4.2 in conjunction with Lemma 4.1 and convexity can be used to
show that the amount of information the messages reveal about XS FS (and similarly for YS GS ) is small:
                                               )
                   I p (M : XS FS | X≤J Y≤J J)               q         q      
                                                               `
                                                 ≤ 2e`/k + 2` 2 /n + η       `
                                                                           2 /n .                   (4.1)
                   I p (M : YS GS | X≤J Y≤J J)


                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                  14
                      S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

Conditioning on the event XS FS = YS GS

Lastly, we need to show that the distribution p(M) does not change much after conditioning on the event
XS FS = YS GS which gives us the distribution q0 (M). Since, both events are essentially independent of M
(the mutual information as in (4.1) is small), one can expect that conditioning on the event XS FS = YS GS
should not change the distribution of M by much.
    In fact, we show the following general statement.
Lemma 4.3. Given a probability space p0 , if A, B ∈ [T ] are uniform and independent random variables,
and A −C − B, then

                              ε
                       p0 (C) ≈ p0 (C | A = B),
                                                               p               p
                                                     with ε = 2 3 I (C : A) + 2 3 I (C : B) .

      Lemma 4.3 together with (4.1) and another convexity argument completes the proof of Theorem 3.1.


4.2     Proof of Theorem 3.1
                              γ
We shall prove that p(M) ≈ q0 (M). The bound for the distribution q1 (M) is analogous. We first give the
proof assuming Lemmas 4.1, 4.2, and 4.3 and then we prove the lemmas.
    By Lemma 4.1, I p (M : S | X≤J J) ≤ 2` /n. After fixing x≤ j , j, S is determined by YJ . For any such
fixing, we have that p(z ∈ S | x≤ j j) ≤ 1/k holds for any string z, and that XF is independent of Y G.
Furthermore, by Proposition 2.14 we also have XF − M − SYJ after fixing x≤ j , j. Thus we can apply
Lemma 4.2 with U = XF and C = M, V = YJ , to conclude that
                                                     q                         q                     
            I p (M : XS FS | SYJ x≤ j j) ≤ 2e`/k + 2` I p (M : S | x≤ j j) + η    I p (M : S | x≤ j j) .

                                                                                                       √
Taking the expectation over the choice of x≤ j j, and using the concavity of the functions              α and η(α), it
follows that
                                                   q                        q                   
           I p (M : XS FS | Y≤J X≤J J) ≤ 2e`/k + 2` I p (M : S | X≤J J) + η   I p (M : S | X≤J J)
                                                   q            q       
                                       ≤ 2e`/k + 2` 2` /n + η      2` /n .


    The same bound applies to I p (M : YS GS | Y≤J X≤J J). For each fixing of x≤ j y≤ j j, we have XS FS − M −
YS GS . Thus we can apply Lemma 4.3 to conclude that

  |p(M | x≤ j y≤ j j) − p(M | x≤ j y≤ j j, XS FS = YS Gs )| ≤
                                                    q                                  q
                                                  2 3 I p (M : XS FS | x≤ j y≤ j j) + 2 3 I p (M : YS GS | x≤ j y≤ j j) .


                         T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                          15
                                          A NUP R AO AND M AKRAND S INHA

    Since p(X≤J Y≤J J) = q0 (X≤J Y≤J J), we can use Proposition 2.2 to bound

        |p(M) − q0 (M)| =          E           [|p(M | x≤ j y≤ j j) − p(M | x≤ j y≤ j j, XS FS = YS GS )|]
                              p(x≤ j y≤ j j)
                                                q                                 q                             
                                                 3                                 3
                          ≤     E               2 I p (M : XS FS | x≤ j y≤ j j) + 2 I p (M : YS GS | x≤ j y≤ j j)
                              p(x≤ j y≤ j j)
                             q                               q
                          ≤ 2 I p (M : XS FS | X≤J Y≤J J) + 2 3 I p (M : YS GS | X≤J Y≤J J)
                              3

                             s                          q        
                                           q
                          ≤ 4 3 2e`/k + 2` 2` /n + η        2` /n = γ ,

                                                                 √
where the second to last inequality follows from the concavity of 3 α over non-negative reals.

4.2.1   Proof of Lemma 4.1
Once Y≤J J are fixed, S is determined by XJ , since whether a string z is consistent or not is determined given
XJ YJ J. Furthermore, after Y<J J is fixed, XF and Y G are independent in the distribution p(X, F,Y, G), so
the Markov property of protocols (Proposition 2.14) implies that XJ − M − YJ (conditioned on X<J J).
Thus, using Propositions 2.4 and 2.8, we get that

                        I p (M : S | Y≤J J) ≤ I p (M : XJ | Y≤J J) = I p (M : XJ | Y<J J) .                                  (4.2)

    Since X<J = Y<J , we have that I p (M : XJ | Y<J J) = I p (M : XJ | X<J J), which we shall show is at
most 2` /n. Recall that any message m induces a rectangle Sm × Tm in the input space as given by
Proposition 2.13. Denoting by Sm (and Tm ) the event that X ∈ Sm (and Y ∈ Tm ), Proposition 2.13 implies
that M = m is equivalent to the events Sm ∧ Tm . Also, since XF and Y G are independent given x< j j, by
Proposition 2.14, we have p(X | m, x< j j) = p(X | Sm ∧ Tm , x< j j) = p(X | Sm , x< j j). So, we can write
                                                           "                   #             "                       #
                                                               X j | mx< j j                     X j | Sm , x< j j
                   I p (M : XJ | X<J J) = E                                        = E                                   .   (4.3)
                                                  yx j|m        X j | x< j j        yx j|m         X j | x< j j

The right hand side in (4.3) can be rewritten as
                                          "                   #            "                   #
                                            X j | Sm , x< j j                X j | Sm , x< j j
           ∑ p(Sm )p(Tm | Sm ) x j|SEm ,Tm X j | x< j j ≤ ∑ p(Sm ) x j|S
                                                                     E
                                                                               X j | x< j j
                                                                                                 ,                           (4.4)
            m                                                   m        m



where the inequality follows from the fact that Ea [h(a)] ≥ p(W) Ea|W [h(a)], for any non-negative
function h.
    Since J is independent of X, we have that p(XJ) = p(J)p(X) and also p(XJ | Sm ) = p(J)p(X | Sm ).
So, we can write               "                   #               "                  #
                                 X j | Sm , x< j j      n            X j | Sm , x < j
                          E                          = ∑ p( j) E                        .
                        x j|Sm     X j | x< j j        j=1    x|Sm     X j | x< j

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                                   16
                      S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

    Since p(J) is uniform we can use the chain rule to write the right hand side above as
                                "                      #                 "                     #
                 n                  X j | Sm , x < j        1 n              X j | Sm , x< j           1 X | Sm
                ∑ p( j) x|SE                               = ∑ E                                   =     ·      .
                j=1         m         X j | x< j            n j=1 x|Sm         X j | x< j              n   X

    Plugging the above into (4.4), we get that the left hand side in (4.3) can be bounded by

                            1            X | Sm  1               1     2`
                              ∑ p(Sm ) ·        ≤ ∑ p(Sm ) log        ≤ ,
                            n m            X     n m           p(Sm )  n

where the first inequality follows from Proposition 2.9 (with B = ⊥ and W = Sm ) and the second from
the fact that for 0 ≤ γ ≤ 1, it holds that γ log(1/γ) ≤ log(e)/e < 1.
    This proves that I p (M : S | Y≤J J) ≤ I p (M : XJ | Y<J J) ≤ 2` /n. The bound on I p (M : S | X≤J J) follows
analogously.

4.2.2   Proof of Lemma 4.2
First we prove the following upper bound for the conditional information in terms of the divergence.
                                        "                         #
                                  t                     Ui | cu<i
                                           0
Claim 4.4. I (C : US | V S) ≤ ∑ E p (i ∈ S | c) ·                  .
                                 i=1 cu                 Ui | u<i
    Call c bad if p0 (i ∈ S | c) ≥ 2e/k + I (C : S) for some i ∈ [t] and let W denote
                                             p
                                                                                      p the event that C is
bad. Note that when the complement event W occurs, then p0 (i ∈ S | c) ≤ 2e/k + I (C : S) for all i. We
next show that, if I (C : S) is small, C is bad only with a small probability.
Claim 4.5. p0 (W) ≤ I (C : S).
                        p

    We can now prove Lemma 4.2, using Claims 4.4 and 4.5. First, it holds that
                                          "            #                    t "               #
                                  t          Ui | cu<i                                Ui | cu<i
                          0                                   2e p
      I (C : US | V S) ≤ p (W) ∑ E                      +        + I (C : S) · ∑ E                .   (4.5)
                                i=1 cu|W     Ui | u<i         k                i=1 cu  Ui | u<i

When c is bad, we have that p0 (Ui | cu<i W) = p0 (Ui | cu<i ) and so,

                                                 Ui | cu<i   Ui | cu<i W
                                                           =             .
                                                 Ui | u<i     Ui | u<i

Plugging this into the right hand side in (4.5) and using the chain rule, we get that
                                         "              #                     t "               #
                           0
                                  t        Ui | cu<i W        2e p                      Ui | cu<i
       I (C : US | V S) ≤ p (W) ∑ E                      +       + I (C : S) ∑ E
                                i=1 cu|W     Ui | u<i          k                i=1 cu  Ui | u<i
                                    "           #                         "       #
                                       U | cW                                U |c
                                                                      
                           0                          2e p
                        = p (W) E                 +      + I (C : S) E                .
                                c|W       U            k                 c    U


                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                       17
                                     A NUP R AO AND M AKRAND S INHA

    Using Proposition 2.9 and that
                                             "          #
                                                 U |c
                                         E                  = I (U : C) ,
                                         c        U

we can bound the above as
                                                                            
                        0              1                       2e p
     I (C : US | V S) ≤ p (W) log         + I (U : C | W) +       + I (C : S) · I (U : C)
                                   p0 (W)                      k
                                                                                 2eI (U : C)
                   ≤ η p0 (W) + p0 (W) · I (U : C | W) + I (C : S) · I (U : C) +
                                                          p
                                                                                             .    (4.6)
                                                                                      k
    Using Claim 4.5, p0p
                               p
                       (W) ≤ I (C : S). Since η (see (3.1)) is a non-decreasing function, we can then
bound η(p0 (W)) ≤ η( I (C : S)). By Proposition 2.3, I (U : C | W) , I (U : C) ≤ `, so plugging it into
(4.6), we get that
                                            p               p            2e`
                       I (C : US | V S) ≤ η     I (C : S) + 2` I (C : S) +     .
                                                                            k
    This finishes the proof of Lemma 4.2. It only remains to prove the claims.

Proof of Claim 4.4. For i ∈ [t], set Ui∗ = Ui if i ∈ S and set Ui∗ =⊥ otherwise. Since we have that
I (C : US | V S) = I (C : U1∗U2∗ . . .Ut∗ | V S), by the chain rule, we get
                              "                    #        "                 #      "                  #
                                U1∗ . . .Ut∗ | cvs            t U ∗ | cu∗ vs
                                                                    i    <i               Ui | cu∗<i vs
       I (C : US | V S) = E                           = E ∑                     = E ∑                     ,
                          cvs         U ∗ | vs          cuvs i=1 Ui∗ | u∗
                                                                        <i vs    cuvs i∈s Ui | u∗<i vs

where the last equality holds because when i ∈ / S, Ui∗ =⊥ (and so the divergence is 0) and when i ∈ S,
  ∗
Ui = Ui .
    By assumption, U is independent of V S, U −C −V S and p0 (ui | u∗<i ) = p0 (ui ) = p0 (ui | u<i ), so the
right hand side above can be written as
                 "                  #     " "                 ##     " "                  ##
                      Ui | cu∗<i vs                Ui | cu∗<i                 Ui | cu∗<i
              E ∑                     =E E ∑                     =E E ∑                       .         (4.7)
             cuvs i∈s Ui | u∗<i vs     cvs u|c i∈s Ui | u∗<i      cvs u|c i∈s Ui | u<i

   Let 1[i ∈ S] denote the indicator variable for the event that i ∈ S. Using linearity of expectation and
Proposition 2.10, we have that
                 "        "            ##      "        "           ##       "                    #
                            Ui | cu∗<i                    Ui | cu<i                     Ui | cu<i
     (4.7) = E ∑ E                        ≤ E ∑E                       = E ∑ 1[i ∈ s]               ,
              cvs i∈s u|c   Ui | u<i        cvs i∈s u|c   Ui | u<i      cuvs   i         Ui | u<i

where the first inequality above follows from Proposition 2.10 and the definition of Ui∗ (which depends
only on Ui and S).
    Finally, using U −C −V S once more, we get that the right hand side above equals
                       "                        #          "                       #
                                      Ui | cu<i      t                   U  | cu
                                                                          i     <i
                    E ∑ E [1[i ∈ s]]              = ∑ E p0 (i ∈ S | c) ·             .
                    cu   i s|c         Ui | u<i     i=1 cu               Ui | u<i


                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                              18
                      S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

Proof of Claim 4.5. Define Si = 1 if i ∈ S and 0 otherwise. We are going to show that whenever c is bad,
we have that
                                           S|c      p
                                                 ≥ I (C : S) .
                                             S
Since                                                      "         #
                                                               S|c
                                           I (C : S) = E                 ,
                                                       c        S
                                                                                    p
Markov’spinequality will imply that the probability of c being bad is at most I (C : S) and hence
p(W) ≤ I (C : S).
    When c is bad, there is an i∗ such that p0 (i∗ ∈ S | c) ≥ 2e/k + I (C : S). By chain rule and Proposi-
                                                                    p

tion 2.7,
                                             S|c         Si∗ | c
                                                     ≥           .
                                                S          Si∗
Since p0 (i ∈ S) ≤ 1/k for any i, by Proposition 2.12,

                    S|c    Si∗ | c                       k · p0 (i∗ ∈ S | c)
                                                                            
                                      0 ∗
                        ≥          ≥ p (i ∈ S | c) log
                     S      Si∗                                    e
                                                                         
                            2e p                     k 2e p                      p
                        ≥       + I (C : S) log              + I (C : S)        ≥ I (C : S) .
                             k                       e k

This finishes the proof of the claim.

4.2.3   Proof of Lemma 4.3
We assume I (C : A) , I (C : B) ≤ 1, since otherwise the lemma is trivially true. For brevity, set
                                          "      #                          "       #
                       3                    A|c           3                    B|c
                    α = I (C : A) = E              and β = I (C : B) = E              .
                                        c    A                            c     B

    Call c bad if
                                        A|c                    B|c
                                            ≥ α2      or           ≥ β2,
                                         A                      B
and good otherwise. By Markov’s inequality, the probability that C is bad is at most α + β . To prove
Lemma 4.3, we need the following claim proved in [15]. For completeness, we include the short proof
after finishing the proof of Lemma 4.3.
Claim 4.6. Given independent random variables A∗ , B∗ ∈ [T ] in a probability space p0 , if A∗ is γ1 -close
to uniform, and B∗ is γ2 -close to uniform, then

                                                           1 − γ1 − γ2
                                         p0 (A∗ = B∗ ) ≥               .
                                                                T

                         T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                          19
                                        A NUP R AO AND M AKRAND S INHA

     When c is good, Pinsker’s inequality (Proposition 2.6) implies that conditioned on c, A is α-close
to uniform and B is β -close to uniform. Then, since A − C − B, using Claim 4.6 (with A∗ = A and
B∗ = B in the probability space p0 conditioned on c) implies that p0 (A = B | c) ≥ (1 − α − β )/T . Since
p0 (A = B) = 1/T , we have that for a good c,
                                                p0 (c) · p0 (A = B | c)
                         p0 (c | A = B) =                               ≥ (1 − α − β ) · p0 (c) .                  (4.8)
                                                      p0 (A = B)
      For any event Q, (4.8) implies that
             p0 (C ∈ Q) − p0 (C ∈ Q | A = B) ≤            ∑        p0 (c) +       ∑    (p0 (c) − p0 (c | A = B))
                                                       c∈Q,c bad              c∈Q,c good
                                                         0                       0
                                                    ≤ p (C is bad) + ∑ p (c)(α + β )
                                                                              c
                                                    ≤ α + β + ∑ p(c)(α + β ) ≤ 2α + 2β ,
                                                                       c

and since |p0 (C) − p0 (C | A = B)| = maxQ (p0 (C ∈ Q) − p0 (C ∈ Q | A = B)) we get the required bound
bound on statistical distance.

Proof of Claim 4.6. For each i, let p0 (A∗ = i) = 1/T + αi and p0 (B∗ = i) = 1/T + βi . Then, ∑i αi =
∑i βi = 0, and αi , βi ≥ −1/T . Using these facts,
                                                          
                         0 ∗     ∗          1         1
                        p (A = B ) = ∑        + αi      + βi
                                        i   T         T
                                            1 ∑i αi ∑i βi            1
                                        =     +    +      + ∑ αi βi = + ∑ αi βi .
                                            T   T    T      i        T  i

To get a lower bound on the above, we will only consider the negative terms in the summation:
                               1                         1 1              1
             p0 (A∗ = B∗ ) ≥     + ∑ αi βi + ∑ αi βi ≥ −       ∑     αi −          βi .
                               T i:α >0,β <0 i:α <0,β >0
                                                         T T i:αi >0      T i:β∑>0
                                    i       i               i      i                                     i

    From Proposition 2.1, it follows that ∑i:αi >0 αi is the statistical distance between A∗ and the uniform
distribution on [T ] and likewise for B∗ . So, we get,
                                                                   1 − γ1 − γ2
                                                p0 (A∗ = B∗ ) ≥                .
                                                                        T

4.3     Fooling distributions and relative discrepancy
In this section, we compare our techniques to that of Ganor, Kol and Raz [15]. The key concept introduced
in [15] to prove lower bounds is the notion of the relative discrepancy. Let f (x, y) be a Boolean function
and q(X,Y ) be a distribution such that q( f = 0) = q( f = 1) = 1/2. Then, f has (ε, δ ) relative discrepancy
under q(X,Y ) if there exists a distribution u(X,Y ) such that for every rectangle S × T in the input space,
                                        )
               q(X ∈ S,Y ∈ T | f = 0)
                                          ≥ (1 − ε)u(X ∈ S,Y ∈ T ) if u(X ∈ S,Y ∈ T ) ≥ δ .              (4.9)
               q(X ∈ S,Y ∈ T | f = 1)


                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                        20
                     S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

    Ganor, Kol and Raz [15] proved that if f has (ε = 1/3, δ ) relative discrepancy under q(X,Y ), then f
has communication complexity Ω(log 1/δ ). They then prove a lower bound on the relative discrepancy
of the bursting noise function to give a communication lower bound. Note that the relative discrepancy
technique individually argues about each rectangle that has large measure under the distribution u(X,Y ).
Ganor, Kol and Raz [15] prove a lemma (Lemma 12 in [15]) that is very similar to our Lemma 4.1,
however, they argue about each rectangle with a large enough measure under u(X,Y ).
    In contrast, the fooling distribution technique allows us to argue about the distribution on average as
opposed to arguing about each rectangle individually and this is where our proof becomes more simpler
and more intuitive. Lemma 4.1 is an average-case version of the lemma proved in [15]. Together with our
generalization of Shearer’s Lemma (Lemma 4.2) and Lemma 4.3 we can argue about the messages on
average. Note that our fooling distribution p(X, F,Y, G) is analogous to the distribution u(X,Y ) in the
definition of relative discrepancy.
    Next we show some connections between relative discrepancy and fooling distributions. In fact, we
show that low relative discrepancy implies the existence of a fooling distribution. The converse does not
appear to be true.
Claim 4.7 (Relative Discrepancy implies Fooling Distribution). If f has relative discrepancy (ε, 2−2` )
then there exists a distribution u(X,Y ) such that if M ∈ {0, 1}` denotes the messages of a deterministic
protocol, then
                                                γ        γ
                                  q(M | f = 0) ≈ u(M) ≈ q(M | f = 1)
where γ = 2−` + ε.

Proof. Let u(X,Y ) be the distribution that satisfies (4.9) with δ = 2−2` . We will show that u(M) ≈
q(M | f = 0). The proof for u(M) ≈ q(M | f = 1) is similar. Define B = {m | u(m) < 2−2` } and note
that u(M ∈ B) < 2` 2−2` = 2−` . Also observe that when m ∈     / B, then by (4.9) and Proposition 2.13,
u(m) − q(m | f = 0) ≤ εu(m). Now for any event Q, we have

              u(M ∈ Q) − q(M ∈ Q | f = 0) ≤         ∑ u(m) + ∑ (u(m) − q(m | f = 0))
                                                m∈Q∩B          m∈Q∩B̄
                                                                  −`
                                              ≤ u(B) + εu(B̄) < 2       +ε .

Hence |u(M) − q(M | f = 0)| = maxQ (u(M ∈ Q) − q(M ∈ Q | f = 0)) < 2−` + ε.

    The above lemma gives another reason as to why our proof is simpler than the one given in [15]—
we are proving a weaker statement that still implies a communication lower bound. We mention that
in [15], a relaxed notion of relative discrepancy, called the adaptive relative discrepancy is also defined.
Adaptive relative discrepancy works with a partition of the input space into rectangles as opposed to
working with each rectangle individually, and an upper bound on this measure also suffices to prove a
communication lower bound. Using arguments similar to Claim 4.7 one can prove that the existence of a
fooling distribution implies that the adaptive relative discrepancy is small. Hence, the lower bounds given
by the fooling distribution technique are sandwiched between the lower bounds that can be obtained by
the relative discrepancy and the adaptive relative discrepancy methods.
    We remark that the results of Fontes et al. [12] do not rule out the possibility that one might be able
to separate information and communication complexity in the non-distributional setting (see Section 1.1)

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                             21
                                     A NUP R AO AND M AKRAND S INHA

by using either of the two techniques, fooling distributions or adaptive relative discrepancy. In fact,
Fontes et al. [12] show that the adaptive relative discrepancy is equal to the worst-case distributional
communication complexity of the function up to polynomial factors, but we do not know if the same
holds for fooling distributions.


5    Information upper bound
Let us recall the trivial protocol and the low-information protocol mentioned in Section 3. In the trivial
protocol Alice and Bob send each other X(z≤i ),Y (z≤i ) in each step i, until they have computed z. The
parties can compute F(z) + G(z) mod 2 with two more bits of communication. This protocol reveals at
least Ω(log n) bits of information as both parties learn the value of J with high probability. But note that
the z computed in the above manner is consistent and the input distribution q(X, F,Y, G) ensures that
F(z) + G(z) mod 2 is the same for every consistent z, so it is enough for the parties to find a consistent z
to complete the goal.
    For the low-information protocol, the parties send each other the values X(z≤i ),Y (z≤i ) with probability
1 − ε and send a uniformly random value otherwise. The parties abort the protocol if they see r rounds
where the messages they sent were not the same. The distribution on inputs ensures that they will sample
a consistent z with probability 1 − O(ε). When the parties sample a consistent z, the messages sent are
almost always sampled from a distribution that the receiving party knows, while if they sample a z that is
not consistent, the protocol aborts shortly after the inconsistency.
    The purpose of the noise is to prevent revealing a lot of information about J to both parties. When the
z that the parties sample is consistent they only know that the value of J is among the locations where the
values they have exchanged disagree. Since in this case there are εn disagreements in expectation, we
intuitively expect that the entropy of J should still be large at the end which means that the information
revealed about the value of J is small. In case the z sampled is not consistent (which happens with
probability O(ε)), the abort condition prevents the parties from revealing too much information. By
choosing the parameter ε carefully we can ensure that the total amount of information revealed is small.
    In Figure 2 below we describe the low-information protocol which is parameterized by ε.
Lemma 5.1. The protocol in Figure 2 outputs the correct answer with probability at least 1 − 4ε on
inputs drawn from the distribution q(X, F,Y, G).

Proof. Under the input distribution q(X, F,Y, G), whether a sample z with |z| ≥ J + 1 is consistent or not
is determined solely by the (J + 1)st step. So, the probability that the parties sample a z with |z| ≥ J + 1
and z being inconsistent is at most 2ε. Given that this event does not happen, the probability that the
parties abort in a particular step is at most (2ε)r−1 , since to abort one of them must choose to send
uniform values in at least r − 1 steps where their inputs are equal. By the union bound, the probability
that the parties abort in any step is at most n(2ε)r−1 . If neither of these bad events happens, the protocol
computes the correct answer. Thus the probability of making an error is at most 2ε + n(2ε)r−1 ≤ 4ε, by
the choice of r.

   The following theorem proves that the information cost of the protocol is small. We directly argue
about the average information revealed about the messages, as compared to the proof of [15] who use the

                       T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                               22
                    S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION


    Input: Alice is given (x, f ), Bob is given (y, g). Both know a parameter ε ∈ (0, 1).
    Output: f (z) + g(z) mod 2 for some consistent z.

    Set z to be the empty string;
                 log n
    Set r = d log(1/2ε) + 2e;
    for i = 1, 2, . . . , n do
                             (
                               x(z<i )                   with probability 1 − ε
        Alice sets ai =                                                           ,
                               uniform element of [k] with probability ε
                            (
                              y(z<i )                   with probability 1 − ε
        Bob sets bi =                                                           ;
                              uniform element of [k] with probability ε
        Send mi = ai , bi to each other;
        Set w ∈ [k] so that w = ai + bi mod k, and append w to the string z;
        if i ≥ r and ai0 6= bi0 for all i0 with i − r + 1 ≤ i0 ≤ i ; // if ai and bi disagree in each of the
        last r rounds
       then
            Terminate the protocol;
       end
    end
    Send f (z), g(z);

                                                    Figure 2: Protocol πε .


notion of the tree divergence cost to argue about the same.
Theorem 5.2. If M denotes the messages in the protocol of Figure 2,
                            )
           Iq (M : XF | Y G)                                                                
                              ≤ 2 log(k/ε) · 1 + 16ε · (log n + 3) · 2(2 log n)/(k log(1/2ε)) .
           Iq (M : Y G | XF)

    Setting ε = 1/ log n gives Theorem 1.1. To prove Theorem 5.2, we bound Iq (M : XF | Y G). The
second term is bounded in the same way.
    Let Mi ∈ [k] × [k] be the messages exchanged in the ith round for i ∈ [n] and Mn+1 ∈ {0, 1} × {0, 1}
be the messages exchanged in the (n + 1)th round. Then, by the chain rule, we can write
                                                "                   #                    "                        #
                                                                         |m|
                                                    q(M | x f yg)                            q(Mi | m<i x f yg)
            Iq (M : XF | Y G) =        E                                =∑     E                                      .   (5.1)
                                    q(x f yg)        q(M | yg)          i=1 q(mx f yg)        q(Mi | m<i yg)

    To bound (5.1), we apply Proposition 2.11 which allows us to replace the distribution q(Mi | m<i yg) in
the divergence expression above with a different distribution at the expense of increasing the divergence.
Define distribution u(MY G) = q(MY G | XF = Y G). It will be simpler to analyze the divergence with
respect to this distribution. Using Proposition 2.11, we then have

                        T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                                               23
                                          A NUP R AO AND M AKRAND S INHA



                                                                "                         #
                                               |m|
                                                                    q(Mi | m<i x f yg)
                                       (5.1) ≤ ∑      E                                       .
                                               i=1 q(mx f yg)        u(Mi | m<i yg)

   Next, we prove the following claim, which bounds the contribution to the divergence of each possible
message.
Claim 5.3.
                                                
                                                = 0        if i ≤ n and x(z<i ) = y(z<i ),
                             q(Mi | m<i x f yg) 
                                                 ≤ log(k/ε) if i ≤ n and x(z<i ) 6= y(z<i ),
                              u(Mi | m<i yg) 
                                                 =1         if i = n + 1.
                                                

    Before proving Claim 5.3, we show how to use it to bound the information. First note that the above
claim intuitively says that information is revealed by Mi only when X(Z<i ) 6= Y (Z<i ). Recall that with
probability 1 − O(ε), the parties sample a consistent z and hence, information is revealed only in one step
of the protocol (and also one bit at the end when exchanging values of F(z) and G(z)). However, in case
the parties do not sample a consistent z, there might be a lot of messages Mi such that X(Z<i ) 6= Y (Z<i ).
    To bound the information, next we show that on average the number of such messages is small since
the parties will abort if they see a lot of disagreements. Towards this end, define Qi = 1 if |M| ≥ i and
X(Z<i ) 6= Y (Z<i ), and 0 otherwise. Claim 5.3 implies that
                                                                   "      #
                                                                                     n
                                    Iq (M : FX | Y G) ≤ 1 + log(k/ε) · E            ∑ Qi ,
                                                                                q   i=1

so it only remains to bound Eq [∑ni=1 Qi ] which is the number of messages where information is revealed.
Claim 5.4.           "          #
                         n
                                             2rε
                 E
                 q
                         ∑ Qi ≤ 1 + (1 − 1/k)r ≤ 1 + 16ε · (log n + 3) · 2(2 log n)/(k log(1/2ε)) .
                     i=1

    Claims 5.3 and 5.4 complete the proof of Theorem 5.2. We prove them next.

Proof of Claim 5.3. For any i ∈ [n + 1], let us write Mi = Ai , Bi where Ai denotes Alice’s message and Bi
denotes Bob’s message. Note that, when i = n + 1, Ai and Bi are bits, and otherwise they are numbers in
[k]. By the chain rule, for every i ∈ [n + 1],
                                                                             "                       #
              q(Mi | m<i x f yg)     q(Ai | m<i x f yg)                        q(Bi | m<i x f ygai )
                                 =                      + Eq(ai |m<i x f yg)                           . (5.2)
               u(Mi | m<i yg)          u(Ai | m<i yg)                           u(Bi | m<i ygai )

   By the definition of the protocol in Figure 2, for every i ∈ [n + 1], Bi is determined by Y GM<i .
This is because for i ∈ [n], Bi is either a uniform random value or Y (z<i ) for some z<i determined
by M<i and when i = n + 1, Bi = G(z) for some z determined by M≤n . It follows that for i ∈ [n + 1],
q(Bi | m<i x f ygai ) = q(Bi | m<i yg). Similarly, by the definition of the distribution u(MY G), we have

                             T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                         24
                     S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

u(Bi | m<i x f ygai ) = u(Bi | m<i yg) = q(Bi | m<i yg) for every i ∈ [n + 1]. Hence, the second term in (5.2)
is zero since the distributions are the same. So, for i ∈ [n + 1] we get that

                                  q(Mi | m<i x f yg)   q(Ai | m<i x f yg)
                                                     =                    .                              (5.3)
                                   u(Mi | m<i yg)       u(Ai | m<i yg)

    When i = n + 1, using (5.3), a direct calculation shows

                                  q(Mn+1 | m≤n x f yg)
                                                       = 1 · log(1/2) = 1 .
                                   u(Mn+1 | m≤n yg)

    Let us consider the case when i ∈ [n] now. If x(z<i ) = y(z<i ), the definition of the protocol given in
Figure 2, together with (5.3) ensures that u(Ai | m<i yg) = q(Ai | m<i x f yg), proving the first bound. On the
other hand if x(z<i ) 6= y(z<i ), then q(Ai = ai | m<i x f yg) = u(Ai = ai | m<i yg), except when ai = x(z<i )
or ai = y(z<i ) (since otherwise the probability is ε/k in each case). Thus the divergence can be bounded
by the contribution of these two values. We have that q(Ai = x(z<i ) | m<i x f yg) = (1 − ε) + (ε/k) where
the first and second terms account respectively for the cases when Alice sends the correct value and when
Alice sends a random value. On the other hand, we have

                  u(Ai = x(z<i ) | m<i yg) = q(Ai = x(z<i ) | m<i yg, X = y, F = g) = ε/k .

Similarly, we get that

            q(Ai = y(z<i ) | m<i x f yg) = ε/k   and   u(Ai = y(z<i ) | m<i yg) = 1 − ε + (ε/k) .

    Therefore, using (5.3) we can bound

               q(Mi | m<i x f yg)                       1 − ε + (ε/k)                 ε/k
                                  = (1 − ε + (ε/k)) log               + (ε/k) log
                u(Mi | m<i yg)                               ε/k                  1 − ε + ε/k
                                        1 − ε + (ε/k)
                                  ≤ log               ≤ log(k/ε) ,
                                             ε/k

as required.

Proof of Claim 5.4. Let T be such that MT is the last message sent in the protocol. Let W be the event
that T > J and Z sampled by the protocol is not consistent. Since q(W) ≤ 2ε (if the parties do not
send X(Z≤J ) and Y (Z≤J ) at the (J + 1)st step), and Eq [∑ni=1 Qi | ¬W] ≤ 1 (at most one message where
information is revealed when Z is consistent), we have
                                    "       #         "             #
                                        n                  n
                                   E
                                    q
                                      ∑ Qi ≤ 2ε · Eq ∑ Qi W + 1.
                                       i=1                i=1


    If T > J, ∑ni=1 Qi = ∑Ti=J+1 Qi ≤ T − J, so we get Eq [∑ni=1 Qi | W] ≤ Eq [T − J | W]. Note that T − J
roughly behaves like a geometric random variable. To bound this expectation, let us bound the contribution

                         T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                              25
                                   A NUP R AO AND M AKRAND S INHA

when T − J ≤ r and when T − J > r separately. The probability that the protocol does not abort in r
rounds conditioned on the event W is at most (1 − 1/k)r , so we have

                 E [T − J | W] ≤ q(T − J ≤ r | W)r + q(T − J > r | W)(r + E [T − J | W])
                  q                                                          q

                               ≤ (1 − 1/k) r + (1 − (1 − 1/k) )(r + E [T − J | W])
                                           r                   r
                                                                      q
                                    r
             =⇒ E [T − J | W] ≤            ,
                q               (1 − 1/k)r
as required. The second inequality in the statement of the claim follows from the fact that 1/(1 − 1/k) ≤
22/k , for k ≥ 2, and by the choice of r.


6    Acknowledgments
We thank Paul Beame, Yuval Filmus, Moni Naor, Sivaramakrishnan Ramamoorthy and Ran Raz for
useful conversations and anonymous referees for careful reading and very helpful comments.


References
 [1] Z IV BAR -YOSSEF, T. S. JAYRAM , R AVI K UMAR , AND D. S IVAKUMAR: An information statistics
     approach to data stream and communication complexity. J. Comput. System Sci., 68(4):702–732,
     2004. Conference version in FOCS’02. [doi:10.1016/j.jcss.2003.11.006] 2
 [2] B OAZ BARAK , M ARK B RAVERMAN , X I C HEN , AND A NUP R AO: How to compress interactive
     communication. SIAM J. Comput., 42(3):1327–1363, 2013. Conference version in STOC’10.
     [doi:10.1137/100811969] 2
 [3] M ARK B RAVERMAN: Interactive information complexity. SIAM J. Comput., 44(6):1698–1739,
     2015. Conference version in STOC’12. [doi:10.1137/130938517] 2, 3, 4, 11
 [4] M ARK B RAVERMAN AND A NKIT G ARG: Public vs private coin in bounded-round information. In
     Proc. 41st Internat. Colloq. on Automata, Languages and Programming (ICALP’14), pp. 502–513.
     Springer, 2014. Full version ECCC TR13-130. [doi:10.1007/978-3-662-43948-7_42] 2
 [5] M ARK B RAVERMAN AND A NUP R AO: Information equals amortized communication.
     IEEE Trans. Inform. Theory, 60(10):6058–6069, 2014. Conference version in FOCS’11.
     [doi:10.1109/TIT.2014.2347282, arXiv:1106.3595] 2
 [6] M ARK B RAVERMAN , A NUP R AO , O MRI W EINSTEIN , AND A MIR Y EHUDAYOFF: Direct products
     in communication complexity. In Proc. 54th FOCS, pp. 746–755. IEEE Comp. Soc. Press, 2013.
     [doi:10.1109/FOCS.2013.85] 2, 6, 7
 [7] M ARK B RAVERMAN AND O MRI W EINSTEIN: A discrepancy lower bound for information complex-
     ity. Algorithmica, 76(3):846–864, 2016. Conference version in APPROX’12. [doi:10.1007/s00453-
     015-0093-8, arXiv:1112.2000] 2, 4

                      T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                           26
                   S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

 [8] A MIT C HAKRABARTI , YAOYUN S HI , A NTHONY W IRTH , AND A NDREW C HI -C HIH YAO: Infor-
     mational complexity and the direct sum problem for simultaneous message complexity. In Proc.
     42nd FOCS, pp. 270–278. IEEE Comp. Soc. Press, 2001. [doi:10.1109/SFCS.2001.959901] 2

 [9] FAN R. K. C HUNG , RONALD L. G RAHAM , P ETER F RANKL , AND JAMES B. S HEARER: Some
     intersection theorems for ordered sets and graphs. J. Combin. Theory Ser. A, 43(1):23–37, 1986.
     [doi:10.1016/0097-3165(86)90019-1] 13, 14

[10] T HOMAS M. C OVER AND J OY A. T HOMAS: Elements of Information Theory (Wiley Series in
     Telecommunications and Signal Processing). Wiley-Interscience, 2006. 6

[11] U RIEL F EIGE , P RABHAKAR R AGHAVAN , DAVID P ELEG , AND E LI U PFAL: Computing with
     noisy information. SIAM J. Comput., 23(5):1001–1018, 1994. Conference version in STOC’90.
     [doi:10.1137/S0097539791195877] 10

[12] L ILA F ONTES , R AHUL JAIN , I ORDANIS K ERENIDIS , S OPHIE L APLANTE , M ATHIEU L AURIÈRE ,
     AND J ÉRÉMIE ROLAND : Relative discrepancy does not separate information and communication
     complexity. ACM Trans. Comput. Theory, 9(1):4:1–4:15, 2016. Conference version in ICALP’15.
     [doi:10.1145/2967605] 4, 21, 22

[13] A NAT G ANOR , G ILLAT KOL , AND R AN R AZ: Exponential separation of information
     and communication. In Proc. 55th FOCS, pp. 176–185. IEEE Comp. Soc. Press, 2014.
     [doi:10.1109/FOCS.2014.27] 3

[14] A NAT G ANOR , G ILLAT KOL , AND R AN R AZ: Exponential separation of communica-
     tion and external information.  In Proc. 48th STOC, pp. 977–986. ACM Press, 2016.
     [doi:10.1145/2897518.2897535] 4

[15] A NAT G ANOR , G ILLAT KOL , AND R AN R AZ: Exponential separation of information and commu-
     nication for Boolean functions. J. ACM, 63(5):46:1–46:31, 2016. Conference version in STOC’15.
     [doi:10.1145/2907939] 3, 4, 6, 7, 10, 14, 19, 20, 21, 22

[16] P RAHLADH H ARSHA , R AHUL JAIN , DAVID A. M C A LLESTER , AND JAIKUMAR R ADHAKRISH -
     NAN: The communication complexity of correlation. IEEE Trans. Inform. Theory, 56(1):438–449,
     2010. Conference version in CCC’07. [doi:10.1109/TIT.2009.2034824] 2

[17] BALA K ALYANASUNDARAM AND G EORG S CHNITGER: The probabilistic communication com-
     plexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992. Conference version in 2nd
     Structure in Complexity Theory Conf., 1987. [doi:10.1137/0405044] 2

[18] I ORDANIS K ERENIDIS , S OPHIE L APLANTE , V IRGINIE L ERAYS , J ÉRÉMIE ROLAND , AND
     DAVID X IAO: Lower bounds on information complexity via zero-communication protocols
     and applications. SIAM J. Comput., 44(5):1550–1572, 2015. Conference version in FOCS’12.
     [doi:10.1137/130928273, arXiv:1204.1505] 2

[19] G ILLAT KOL: Interactive compression for product distributions. In Proc. 48th STOC, pp. 987–998.
     ACM Press, 2016. [doi:10.1145/2897518.2897537] 2

                     T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                          27
                                 A NUP R AO AND M AKRAND S INHA

[20] E YAL K USHILEVITZ AND N OAM N ISAN: Communication Complexity. Cambridge University
     Press, 1997. 8

[21] S IVARAMAKRISHNAN NATARAJAN R AMAMOORTHY AND A NUP R AO: How to compress asym-
     metric communication. In Proc. 30th Conf. on Computational Complexity (CCC’15), pp. 102–123.
     Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2015. [doi:10.4230/LIPIcs.CCC.2015.102] 2

[22] S IVARAMAKRISHNAN NATARAJAN R AMAMOORTHY AND M AKRAND S INHA: On the com-
     munication complexity of greater-than. In Proc. 53rd Ann. Allerton Conf. on Communica-
     tion, Control, and Computing (Allerton’15), pp. 442–444. IEEE Comp. Soc. Press, 2015.
     [doi:10.1109/ALLERTON.2015.7447037] 4

[23] JAIKUMAR R ADHAKRISHNAN: Entropy and counting. In Computational Mathematics, Modelling
     and Algorithms, pp. 146–168. Narosa Publishers, 2003. Available from author’s home page. 13, 14

[24] A NUP R AO AND M AKRAND S INHA: Simplified separation of information and communication.
     Electron. Colloq. on Comput. Complexity (ECCC), 2015. [ECCC:TR15-057]

[25] A LEXANDER A. R AZBOROV: On the distributional complexity of disjointness. Theoret. Comput.
     Sci., 106(2):385–390, 1992. Conference version in ICALP’90. [doi:10.1016/0304-3975(92)90260-
     M] 2

[26] C LAUDE E. S HANNON: A mathematical theory of communication. The Bell System Technical
     Journal, 27(3):379–423, 1948. [doi:10.1002/j.1538-7305.1948.tb01338.x] 2

[27] A LEXANDER A. S HERSTOV: Compressing interactive communication under product dis-
     tributions. SIAM J. Comput., 47(2):367–419, 2018. Conference version in FOCS’16.
     [doi:10.1137/16M109380X] 2

[28] E MANUELE V IOLA: The communication complexity of addition. Combinatorica, 35(6):703–747,
     2015. Conference version in SODA’13. [doi:10.1007/s00493-014-3078-3] 4


AUTHORS

      Anup Rao
      Associate professor
      Paul G. Allen School of Computer Science & Engineering
      University of Washington
      Seattle, WA
      anuprao cs washington edu
      homes.cs.washington.edu/~anuprao/




                     T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                       28
                 S IMPLIFIED S EPARATION OF I NFORMATION AND C OMMUNICATION

    Makrand Sinha
    Ph. D. student
    Paul G. Allen School of Computer Science & Engineering
    University of Washington
    Seattle, WA
    makrand cs washington edu
    homes.cs.washington.edu/~makrand/


ABOUT THE AUTHORS

    A NUP R AO is an associate professor of computer science at the Allen School of Computer
       Science & Engineering, University of Washington. He is interested in mathematics and
       theoretical computer science. When he is not exercising his brain on these topics, he
       likes to exercise his body rock climbing, skiing, going on alpine adventures, dancing and
       playing pickup basketball with undergraduates who become reluctant to foul him once
       they figure out that he is a professor. He loves to travel, and loves art of all kinds. Anup
       feels fortunate to have been advised by David Zuckerman.


    M AKRAND S INHA is a graduate student at the Allen School of Computer Science & En-
       gineering, University of Washington where he is advised by Anup Rao. He is truly
       interested in all facets of the human experience, but currently his efforts are focused
       towards understanding communication complexity and linear programs. When not scrib-
       bling mathematics on a piece of paper, he can be found enjoying fine coffee, literature,
       theater and a multitude of physical activities.




                    T HEORY OF C OMPUTING, Volume 14 (20), 2018, pp. 1–29                             29