Potential-Function Proofs for Gradient Methods

Authors Nikhil Bansal, Anupam Gupta,
Plaintext
                           T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32
                                        www.theoryofcomputing.org

                                         RESEARCH SURVEY



                           Potential-Function Proofs
                             for Gradient Methods
                                 Nikhil Bansal∗                     Anupam Gupta†
                Received March 3, 2019; Revised August 6, 2019; Published September 12, 2019




       Abstract: This note discusses proofs of convergence for gradient methods (also called
       “first-order methods”) based on simple potential-function arguments. We cover methods
       like gradient descent (for both smooth and non-smooth settings), mirror descent, and some
       accelerated variants. We hope the structure and presentation of these amortized-analysis
       proofs will be useful as a guiding principle in learning and using these proofs.


1     Introduction
The “gradient descent” framework is a class of iterative methods for solving convex minimization
problems—indeed, since the gradient gives the direction of steepest increase in function value, a natural
approach to minimize the convex function is to move in the direction opposite to the gradient. Variants of
this general versatile approach have been central to convex optimization for many years. In recent years,
with the increased use of continuous methods in discrete optimization, this technique has also become
central for algorithm design in general.
    ∗ Supported in part by NWO Vici grant 639.023.812 and an ERC consolidator grant 617951.
    † Supported in part by NSF awards CCF-1536002, CCF-1540541, and CCF-1617790.



ACM Classification: G.1.6
AMS Classification: 68Q25
Key words and phrases: convex optimization, potential functions, amortized analysis


© 2019 Nikhil Bansal and Anupam Gupta
c b Licensed under a Creative Commons Attribution License (CC-BY)                             DOI: 10.4086/toc.2019.v015a004
                                   N IKHIL BANSAL AND A NUPAM G UPTA

    In this note we give convergence arguments for many commonly studied versions of gradient methods
(also referred to as “first-order methods”) using simple potential-function arguments. We find that
presenting the proofs in the amortized-analysis framework is useful as a guiding principle, since it imparts
a clear structure and direction to proofs. We hope others will also find this perspective useful, both in
learning and teaching these techniques and proofs, and also in extending them to other domains.
    A disclaimer: previously existing proofs for gradient methods are usually not difficult, and their
individual components are not substantially different from the ones in this note. However, using an
explicit potential to guide our proofs makes them arguably more intuitive. In fact, the intuition of viewing
these gradient methods as trying to control a potential function is also known to the specialists; e. g., see
the text of Nemirovski and Yudin [18, pp. 85–88] for a continuous perspective via Lyapunov functions.
This is more explicit in recent papers [24, 29, 17, 30, 8] relating continuous and discrete updates to
understand the acceleration phenomenon, e. g., Krichene et al. [17] give the potential function we use in
§5.2. However, these potential-function proofs and intuitions have not yet permeated into the commonly
presented expositions. The current note is an attempt to make such ideas more widely known.

Basic definitions. Recall that a set K ⊆ Rd is convex if for all x, y ∈ K, the convex combination
λ x + (1 − λ )y ∈ K for all λ ∈ [0, 1]. A function f : Rd → R is convex over a convex set K if
                   f (λ x + (1 − λ )y) ≤ λ f (x) + (1 − λ ) f (y)    ∀x, y ∈ K, ∀λ ∈ [0, 1] .
This is called the zeroth-order definition. There are other equivalent notions: if the function is differen-
tiable, the first-order definition is that f is convex over K if
                                f (y) ≥ f (x) + h∇ f (x), y − xi    ∀x, y ∈ K .                           (1.1)
(The second-order definition says that a twice-differentiable f is convex if its Hessian matrix ∇2 f is
positive-semidefinite.) For this note, we assume our convex sets K are closed, and the convex functions f
are differentiable. However, the proofs extend to non-differentiable functions in the natural way, using
subgradients. (See, e. g., [13] for more definitions and background on convexity and subgradients.)

The problems. Given a convex function f : Rd → R, and an error parameter ε, the (unconstrained)
convex minimization problem is to find a point xb such that f (b x) − minx∈Rd f (x) ≤ ε. In the constrained
version of the problem, we are also given a convex set K, and the goal is to find a point xb ∈ K which has
         x) − minx∈K f (x) ≤ ε. In either case, let x∗ denote the minimizer for f (·). We will be interested
error f (b
in bounding the number of gradient queries required to converge to the approximate minimizer xb, as a
function of the distance between x0 and x∗ and some parameters of the function f , depending on the
particular variant of gradient descent.
    In online convex optimization over a convex set K, at each timestep t = 1, 2, . . ., the algorithm outputs
a point xt ∈ K and an adversary produces a convex function ft . The algorithm’s loss at timestep t is
                                                             T                      T
defined to be ft (xt ). Now the regret of the algorithm is ∑t=1  ft (xt ) − minx∈K ∑t=1  ft (x), and the goal is
to determine the points xt online (without the knowledge of the current or future functions { fs }s≥t ) to
minimize the regret. Note that this generalizes the convex optimization setting above, which corresponds
to setting ft = f at each time t, and that any algorithm with sublinear regret o(T ) can be used to find an
approximate optimum xb up to any desired accuracy ε.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                 2
                         P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Assumptions. We assume that our convex functions are closed, convex, and differentiable, and that
the convex sets K are also closed with non-empty interior. We assume access to a gradient oracle for
each of the functions ft we consider, i. e., given any point x, we can get the gradient ∇ ft (x) of function
 ft at point x. We only work with the Euclidean norm k·k2 for the first few sections; general norms are
discussed from §4 onwards.


References. In this survey, we focus only on the exposition of the proofs. We omit most citations,
and also discussion of the “bigger picture.” There are many excellent sources for other proofs of these
results, with comprehensive bibliographies; e. g., see the classic text by Shor [23], the authoritative
notes by Nesterov [20], and Ben-Tal and Nemirovski [3], the monographs of Bubeck [6] and Shalev-
Shwartz [22], the textbooks by Cesa-Bianchi and Lugosi [7], Hazan [12], and lecture notes by Duchi [10]
and Vishnoi [28].
     There are several other perspectives on these proofs that the reader may find useful. One useful
perspective is that of viewing these methods as discretizations of suitable continuous dynamics; this
appears even in the classic work of Nemirovski and Yudin [18], and has been widely used recently (see,
e. g., [24, 29, 17, 30]). Another useful perspective is exhibit a “dual” lower bound on the optimal value
via the convex conjugate, and use the duality gap to bound the error (see, e. g., [8, 21]). We refer the
interested reader to the respective papers for more details.
     Finally, we discuss some concurrent and related work. Independently of our work, Karimi and
Vavasis [15] give potential-based convergence proofs for conjugate gradient and accelerated methods;
their potentials are similar to ours. And following up on a preprint of our results, Taylor and Bach [26]
analyze stochastic gradient methods using potential functions.


1.1   Results and organization
All of the proofs use the same general potential: for some fixed point x∗ (which can be thought of as the
optimal or reference point) we have

                          Φt = at · ( f (xt ) − f (x∗ )) + bt · (distance from xt to x∗ ) .                 (1.2)

Here at , bt are non-negative, and naturally, different proofs use slightly different choice of these multipliers,
and even the distance functions may vary. However, the general approach remains the same: we show
that Φt+1 − Φt ≤ Bt (where Bt is often zero). Since the potential and distance terms remain non-negative,
the telescoping sum gives

                                     T −1
                                                                                 Φ0 + ∑t Bt
                        ΦT ≤ Φ0 + ∑ Bt           =⇒        f (xT ) − f (x∗ ) ≤              .
                                      t=0                                           aT

    We begin in §2 with proofs of the basic (projected) gradient descent, for general and strongly convex
functions; these even work in the online regret-minimization setting where the function may change
at each timestep. Here the analysis is more along the lines of amortized analysis: we show that the
amortized cost, namely the cost of the algorithm plus the increase in potential is at most the optimal cost

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                   3
                                    N IKHIL BANSAL AND A NUPAM G UPTA

(plus B). I. e., ft (xt ) + (Φt+1 − Φt ) ≤ ft (x∗ ) + B. This telescopes to imply that the average regret is
                                                                     !
                                       1 T                                  Φ0
                                           ∑   ( ft (xt ) − ft (x∗ )) ≤ B +    .
                                      T t=1                                 T
The potential here is very simple: we set at = 0 and just use the distance of the current point xt to the
optimal point x∗ (according to the “right” distance). For example, for basic gradient descent, the potential
is just a scaled version of kxt − x∗ k2 .
     Next, we give proofs of convergence for the case of smooth convex functions in §3. In the simplest
case we just set bt = 0 and use at = t in (1.2) to prove Bt ≈ 1/t. This gives an error of ≈ (log T )/T , which
is in the right ballpark. (This can be optimized using better settings of the multipliers.) The proofs for
projected smooth gradient descent, gradient descent for well-conditioned functions, and the Frank–Wolfe
method, all follow this template. For these proofs, we now use the “value-based” terms in (1.2), i. e., the
terms that depend on f (xt ) − f (x∗ ).
     We then extend our understanding to mirror descent. This is a substantial generalization of gradient
descent to general norms. While the language necessarily becomes more technical (relying on dual
norms and Bregman divergences), the ideas remain clean. Indeed, the structure of the potential-based
proofs from §2 remains essentially the same as for basic gradient descent; the potential is now based on a
Bregman divergence, a natural generalization of the squared distance. These proofs appear in §4.
     An orthogonal extension is to potential-based proofs of Nesterov’s accelerated gradient descent
method for smooth and well-conditioned convex functions. The ideas in this section build on the simple
calculations we would have seen in §2 and §3. We can now use the full power of both distance-based and
value-based terms in the potential function (1.2), trading them off against each other. Moreover, in §5.1
we show how the basic analysis for smooth convex functions from §3 directly suggests how to obtain the
accelerated algorithm by coupling together one cautious and one aggressive gradient descent step.

Organization. The paper follows the above outline. We start with proofs for general convex functions
in §2 using simple distance-based potentials, then proceed to smooth and well-conditioned convex
functions in §3 using more sophisticated potentials. We then discuss the generalization to mirror descent
via Bregman divergences in §4. Finally, we give proofs for accelerated versions in §5. Note that §4 and
§5 are independent, and may be read in either order.


2     Online analyses
2.1   Basic gradient descent
The basic analysis works even for the online convex optimization case: at each step we are given a
function ft , we play xt , and want to minimize the regret. In this case the update rule is:
                                            xt+1 ← xt − ηt ∇ ft (xt ) .                                        (2.1)
An equivalent form for this update, that is easily verified by taking derivatives with respect to x, is:
                                             n1                            o
                            xt+1 ← arg min kx − xt k2 + ηt hx, ∇ ft (xt )i .                             (2.2)
                                          x    2

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                     4
                         P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Intuitively, we want to move in the direction of the negative gradient, but do not want to move too far.
Theorem 2.1 (Basic gradient descent). Let f1 , . . . , fT : Rn → R be G-Lipschitz functions, i. e. k∇ ft (x)k ≤
G for all x,t. Then starting at point x0 ∈ Rn and using updates (2.1) with step size
                                                                D
                                                  ηt = η =      √
                                                               G T
for T steps guarantees an average regret of
                               1 T −1                         G2    D2  DG
                                      ft (xt ) − ft (x∗ ) ≤ η
                                                         
                                 ∑                               +     ≤√ ,
                               T t=0                          2    2ηT   T
for all x∗ with kx0 − x∗ k ≤ D.
Proof. Consider the potential function
                                                         1
                                                Φt =       kxt − x∗ k2                                         (2.3)
                                                        2η
which is positive for all t. We show that, for some upper bound B,
                                         ft (xt ) − ft (x∗ ) + Φt+1 − Φt ≤ B .                                 (2.4)
Summing over all times t, the average regret is
                    1 T −1                            1                 Φ0       D2
                      ∑   ( ft (xt ) − ft (x∗ )) ≤ B + (Φ0 − ΦT ) ≤ B +    = B+     .                          (2.5)
                    T t=0                             T                 T       2ηT
Now we can compute B, and then balance the two terms. While the potential uses differences of the
form xt − x∗ , the key is to express as much as possible in terms of xt+1 − xt , because the update rule (2.1)
implies
                                          xt+1 − xt = −η∇ ft (xt ) .                                     (2.6)

The change in potential. Using that ka + bk2 − kak2 = 2ha, bi + kbk2 for the Euclidean norm,
                   1                                                             1
                     (kxt+1 − x∗ k2 − kxt − x∗ k2 ) = hxt+1 − xt , xt − x∗ i + kxt+1 − xt k2
                   2                                                             2
                                                                                  η2
                                                    = ηt h∇ ft (xt ), x∗ − xt i + t k∇ ft (xt )k2 .            (2.7)
                                                                                   2

The amortized cost. Setting ηt = η for all steps,
              ft (xt ) − ft (x∗ ) + Φt+1 − Φt
                                                                       η
                    = ft (xt ) − ft (x∗ ) + h∇ ft (xt ), x∗ − xt i +     k∇ ft (xt )k2                    (by (2.7))
                                                                       2
                            η                 ηG2
                    ≤ 0+      k∇ ft (xt )k2 ≤                            (by convexity, and the bound on gradients.)
                            2                  2
                                                         √
Substituting for B in (2.5) and simplifying with η = D/(G T ), we get the theorem.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                     5
                                     N IKHIL BANSAL AND A NUPAM G UPTA


                                                  −ηt∇t         x0t+1

                                                           xt+1 = ΠK (x0t+1)
                                            xt
                                                           x∗
                                                  K



                                     Figure 1: Projected Gradient Descent.


     The regret bound implies a convergence result for the offline case, i. e., for the case where ft = f for
                                   T −1
all t. Here, setting xb := (1/T ) ∑t=0  xt shows
                                              
                           ∗             1                 1                         DG
                   x) − f (x ) = f
                f (b                       ∑ xt − f (x∗ ) ≤ ∑ ( f (xt ) − f (x∗ )) ≤ √ ≤ ε,
                                         T t               T t                        T

as long as T ≥ (DG/ε)2 and η = ε/G2 .
                                                                                         √
    If the time horizon T is unknown, setting a time-dependent step size of ηt = D/(G t) works, with
an identical proof. It is also well-known that the convergence bound above is the best possible in general,
modulo constant factors (see, e. g., [20, Thm 3.2.1] or [6, Thm 3.13]).

2.1.1   Projected gradient descent
If we want to solve the constrained minimization problem for a convex body K, we update as follows:

                                                  0
                                                 xt+1 ← xt − η ∇ ft (xt ),                              (2.8)
                                                               0
                                                   xt+1 ← ΠK (xt+1 ).                                   (2.9)

where ΠK (x0 ) := arg minx∈K kx − x0 k is the projection of x0 onto the convex body K. See Figure 1.

Proposition 2.2 (Pythagorean property). Given a convex body K ⊆ Rn , let a ∈ K and b0 ∈ Rn . Let
b = ΠK (b0 ). Then ha − b, b0 − bi ≤ 0. Hence ka − bk2 ≤ ka − b0 k2 .

Proof. For the first part, the separating hyperplane at b has b0 on one side, and all of K (and hence a)
on the other side. Hence the angle between b0 − b and a − b must be obtuse, giving the negative inner
product. For the second part, ka − b0 k2 = ka − bk2 + kb − b0 k2 + 2ha − b, b − b0 i. But the latter two terms
are positive, which proves the lemma.

    Using this, we get that for any point x∗ ∈ K,

                                           kxt+1 − x∗ k2 ≤ kxt+1
                                                             0
                                                                 − x∗ k2 .

                        T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                6
                          P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Using the same potential function (2.3), this inequality implies:
                                                                         1
              ft (xt ) − ft (x∗ ) + Φt+1 − Φt ≤ ft (xt ) − ft (x∗ ) +         0
                                                                           (kxt+1 − x∗ k2 − kxt − x∗ k2 ) ,
                                                                        2η
or in other words, the projection only helps and we can follow the analysis from §2.1 starting at (2.7) to
bound the amortized cost by ηG2 /2. So this gives a regret bound identical to that of Theorem 2.1.

2.2   Strong convexity analysis
Let us a prove a better regret (and convergence) bound when the functions are “not too flat.” A function f
over a convex set K is α-strongly convex, where α ≥ 0, if for all u, v ∈ K
                                                                             α
                      f (λ u + (1 − λ )v) ≤ λ f (u) + (1 − λ ) f (v) −         λ (1 − λ )kv − uk2                (2.10)
                                                                             2
for all λ ∈ [0, 1]. For the case of differentiable functions, this is equivalent to saying that for all x, y ∈ K,
                                                                          α
                                   f (y) ≥ f (x) + h∇ f (x), y − xi +       ky − xk2 .                           (2.11)
                                                                          2
For α-strongly convex functions ft , we use the same update step but vary the step size ηt . Specifically,

                                           xt+1 ← ΠK (xt − ηt ∇ ft (xt )) ,                                      (2.12)

where ηt = 1/(α(t + 1)).
Theorem 2.3 (Strongly convex functions). If the functions ft are α-strongly convex and G is an upper
bound on k∇ ft (x)k for all x ∈ K, the update rule (2.12) with ηt = 1/(α(t + 1)) guarantees an average
regret of
                                    1 T −1                    ∗
                                                                  G2 log T
                                      ∑    f t (xt ) − f t (x   ) ≤         .
                                   T t=0                            2T α
Proof. The potential function is now

                                             1                tα
                                    Φt =         kxt − x∗ k2 = kxt − x∗ k2 .                                     (2.13)
                                           2ηt−1               2

The change in potential. Let xt+1 0  := xt − ηt ∇ ft (xt ) denote the intermediate point after the gradient
step, but before performing the projection.
                             α(t + 1)                    αt
               Φt+1 − Φt =            kxt+1 − x∗ k2 − kxt − x∗ k2
                                2                         2
                             α        ∗ 2    1                                 
                            = kxt − x k +          kxt+1 − x∗ k2 − kxt − x∗ k2
                             2              2ηt
                             α        ∗ 2    1  0             ∗ 2          ∗ 2
                                                                                
                            ≤ kxt − x k +          kxt+1 − x k − kxt − x k                               (by Prop. 2.2)
                             2              2ηt
                             α                                      ηt
                            = kxt − x∗ k2 + h∇ ft (xt ), x∗ − xt i + k∇ ft (xt )k2 .                          (by (2.7))
                             2                                      2

                          T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                        7
                                    N IKHIL BANSAL AND A NUPAM G UPTA

The amortized cost.

                ft (xt ) − ft (x∗ ) + Φt+1 − Φt
                                                α                                      ηt
                         ≤ ft (xt ) − ft (x∗ ) + kxt − x∗ k2 + h∇ ft (xt ), x∗ − xt i + k∇ ft (xt )k2
                           |                    2    {z                            } 2
                                            ≤ 0 by α-strong convexity

                          ηt                 ηt G2
                      ≤      k∇ ft (xt )k2 ≤             (by bound on gradients).                       (2.14)
                          2                    2
Now summing over all time steps t, the total regret is

                                                                   ηt           G2 log T
                           ∑ ( ft (xt ) − ft (x∗ )) ≤ Φ0 + ∑ 2 G2 ≤ 0 +           2α
                                                                                         .
                            t                                  t

Hence total√regret only increases logarithmically as log T with time if the ft are strongly convex, as
opposed to T in Theorem 2.1.

    This bound of O(log T ) on the average regret is tight: Takimoto and Warmuth [25] show a matching
lower bound. However, in the offline optimization setting where we have a fixed function ft = f , using
the same analysis but a better averaging shows a convergence rate of O(1/T ) with respect to a convex
combination of the points xt .

Theorem 2.4 (Strongly convex functions: Take II). Let f be α-strongly convex with gradients satisfying
k∇ f (x)k ≤ G for all x, and xt be the iterates produced by applying the update rule (2.1) with ηt = 1/(αt).
                            T
For any T ≥ 1, let xT := ∑t=1   λt xt denote the convex combination of xt with

                                                             2t
                                                  λt =             .
                                                         T (T + 1)

Then,
                                                                     G2
                                            f (xT ) − f (x∗ ) ≤             .
                                                                   α(T + 1)
Proof. Instead of summing up (2.14) directly over t in the regret analysis above, we first multiply (2.14)
by t, and then sum over t to obtain
                                        T
                                                                        1
                                       ∑ t( ft (xt ) − ft (x∗ )) ≤ 2α T G2 .
                                       t=1

Using ft = f and dividing by T (T + 1)/2 throughout, and by the convexity of f we obtain

                                                                     G2
                                            f (xT ) − f (x∗ ) ≤             .
                                                                   α(T + 1)

   Finally, we remark that in the constrained case the same analysis with the same potential function
works, exactly for same reason as in 2.1.1.

                        T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                8
                         P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

3     Bounds for smooth functions
We now turn to the setting where the functions are Lipschitz smooth, i. e., when √    the gradient does not
change too rapidly. We know that in the online case, the average regret of O(1/ T ) is tight even for
linear functions [7]. However we get better guarantees for the offline setting where the function ft = f for
all time steps. The potential functions now look more like (1.2), and use the difference ( f (xt ) − f (x∗ )) in
function value, not just in action space.
     Define a function f to be β -Lipschitz smooth (or simply β -smooth) if for all u, v

                                                                           β
                     f (λ u + (1 − λ )v) ≥ λ f (u) + (1 − λ ) f (v) −        λ (1 − λ )kv − uk2            (3.1)
                                                                           2

for all λ ∈ [0, 1]. For the case of differentiable functions, this is equivalent to saying that for x, y, we have

                                                                      β
                                 f (y) ≤ f (x) + h∇ f (x), y − xi +     ky − xk2 .                         (3.2)
                                                                      2

Observe the inequalities here are in the opposite directions from the definitions of convexity (1.1) and
strong-convexity (2.11). Indeed, smoothness implies that the function does not “grow too fast” anywhere.
The smoothness condition is equivalent to requiring that the gradients are Lipschitz continuous, i. e.,
k∇ f (x) − ∇ f (y)k2 ≤ β kx − yk2 for all x, y.


3.1   Smooth gradient descent
The update rule in this case has a time-invariant multiplier (where we use ∇t := ∇ f (xt ) for brevity).

                                                             1
                                              xt+1 ← xt −      ∇t .                                        (3.3)
                                                             β

    We first show an analysis based on a very natural potential, that gives a slightly sub-optimal bound
with an additional log T factor. We improve this later by slightly modifying the potential.

Theorem 3.1 (Smooth functions). If f is β -smooth and D := maxx {kx − x∗ k2 | f (x) ≤ f (x0 )}, the update
rule (3.3) guarantees
                                                       D2 (1 + ln T )
                                 f (xT ) − f (x∗ ) ≤ β                .
                                                            2T
Proof. To show a convergence rate of O(1/t), perhaps the most natural approach is to consider the
potential
                                           Φt = t · ( f (xt ) − f (x∗ ))

and try to show that ΦT = O(1). This works, but gives a weaker bound of ΦT = O(log T ) (note that
conveniently, Φ0 = 0) and hence f (xT ) − f (x∗ ) = O(log T )/T . Later we get rid of the logarithmic term.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                  9
                                    N IKHIL BANSAL AND A NUPAM G UPTA

The potential change.

                          Φt+1 − Φt = (t + 1)( f (xt+1 ) − f (x∗ )) − t( f (xt ) − f (x∗ ))
                                      = (t + 1)( f (xt+1 ) − f (xt )) + ( f (xt ) − f (x∗ )) .         (3.4)

To bound the first term, we use the smoothness of f with x = xt and y = xt+1 = xt − ηt ∇t :

                                f (xt+1 ) ≤ f (xt ) − ηt · k∇t k22 + β2 · ηt2 · k∇t k22 .

The choice of ηt = 1/β minimizes the right hand side above to give
                                                                 1
                                          f (xt+1 ) ≤ f (xt ) − 2β k∇t k22 .                           (3.5)

    For the second term in (3.4), just use convexity and Cauchy-Schwarz:

                     f (xt ) − f (x∗ ) ≤ h∇t , xt − x∗ i ≤ k∇t k2 · kxt − x∗ k2                        (3.6)
                                                                         2                  ∗ 2
                                                        ≤ 1/2 (ak∇t k + (1/a)kxt − x k ) ,

for any parameter a > 0. Note that (3.5) ensures that f (xt ) ≤ f (xt−1 ) ≤ · · · ≤ f (x0 ), so let us define
D := max{kx − x∗ k2 | f (x) ≤ f (x0 )}. So the potential change is
                                                    1
                          Φt+1 − Φt ≤ (t + 1) · (− 2β )k∇t k22 + 21 (ak∇t k22 + D2 /a).                (3.7)

Choosing a = (t + 1)/β cancels the gradient terms. Hence the potential increase is at most D2 β /(2(t + 1)),
and
                               ΦT   1 T −1               1 T −1 D2              D2 (1 + ln T )
           f (xT ) − f (x∗ ) =    = ∑ (Φt+1 − Φt ) ≤ ∑                   β ≤β                  .
                               T   T t=0                T t=0 2(t + 1)                2T

    The intuition is evident from (3.5) and (3.6): we improve a lot by (3.5) when the gradients are large,
or else we are close to the optimum by (3.6).

A tighter analysis. The logarithmic dependence in Theorem 3.1 can be removed by a simple trick of
multiplying the potential by a linear term in t, which avoids the sum over 1/t.

Theorem 3.2 (Smooth functions: Take II). If f is β -smooth, and D := maxx {kx − x∗ k2 | f (x) ≤ f (x0 )},
the update rule (3.3) guarantees
                                                          2D2
                                   f (xT ) − f (x∗ ) ≤ β      .
                                                         T +1
Proof. Consider the following potential

                                        Φt = t(t + 1) · ( f (xt ) − f (x∗ )) .

The potential change is

            Φt+1 − Φt = (t + 1)(t + 2) · ( f (xt+1 ) − f (xt )) + 2(t + 1) · ( f (xt ) − f (x∗ )) .


                        T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                              10
                        P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Plugging in (3.5) and (3.6) gives
                                                                                       t +1
                                              1
                        ≤ (t + 1)(t + 2) · − 2β k∇t k2 + 2(t + 1) · k∇t k2 · D ≤ 2D2 β ·      ,
                                                                                         t +2
where the last inequality is the maximum value of the preceding expression obtained at k∇t k =
2β D/(t + 2). Summing over the time steps, ΦT ≤ T · 2D2 β , so
                                                          2D2 β · T     2D2
                                    f (xT ) − f (x∗ ) ≤             =β      .
                                                          T (T + 1)    T +1

3.1.1   Yet another proof
Let’s see yet another proof that gets rid of the logarithmic term. Interestingly, the potential function now
combines both the difference in the function value, and the distance in the “action” space.
Theorem 3.3 (Smooth functions: Take III). If f is β -smooth, the update rule (3.3) guarantees
                                                                kx0 − x∗ k2
                                        f (xT ) − f (x∗ ) ≤ β               .
                                                                    2T
Proof. Consider the potential of the form
                                                                 a
                                    Φt = t ( f (xt ) − f (x∗ )) + kxt − x∗ k2
                                                                 2
where a will be chosen based on the analysis below. As Φ0 = (a/2)kx0 − x∗ k2 , if we show that Φt is
non-increasing,
                       a                                                    a
                         kx0 − x∗ k2 = Φ0 ≥ ΦT = T ( f (xt ) − f (x∗ )) + kxt − x∗ k2
                       2                                                    2
which gives
                                                              a
                                       f (xt ) − f (x∗ ) ≤      kx0 − x∗ k2
                                                            2T
as desired.
    The potential difference can be written as:
                                                                      a
      Φt+1 − Φt = (t + 1) ( f (xt+1 ) − f (xt )) + f (xt ) − f (x∗ ) + (kxt+1 − x∗ k2 − kxt − x∗ k2 ) . (3.8)
                           |        {z         } |         {z     } 2|              {z             }
                                    (3.5)             (convexity)               (2.7)

Using the bounds from the mentioned inequalities,
                                                       a
                         z }| { z }| {                    z            }|             {
                                                   ∗
             ≤ (t + 1) · − 2β k∇t k2 + h∇t , xt − x i + 2 ηt h∇t , x − xt i + ηt k∇t k22
                            1      2                                ∗           2               (3.9)
                                                       2
where ηt = 1/β in this case. Now, we set a = 1/ηt = β to cancel the inner-product terms, which gives
                                                       
                                                        t
                                    Φt+1 − Φt ≤ −           k∇t k2 ≤ 0 .
                                                       2β
    This guarantee is almost the same as in Theorem 3.2, with a slightly better definition of the distance
term (kx0 − x∗ k2 vs. D2 ). But we will revisit and build on this proof when we talk about Nesterov
acceleration in §5.

                       T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                               11
                                       N IKHIL BANSAL AND A NUPAM G UPTA

3.1.2     Projected smooth gradient descent
We now consider the constrained minimization problem for a convex body K. As previously, the update
involves taking a step and then projecting back onto K:
                                                0
                                               xt+1 ← xt − (1/β ) ∇ f (xt ) ,
                                                               0
                                                   xt+1 ← ΠK (xt+1 ).                                            (3.10)
Theorem 3.4 (Constrained smooth optimization). If f is β -smooth, the update rule (3.10) guarantees
                                                                β
                                                                    kx0 − x∗ k2
                                           f (xT ) − f (x∗ ) ≤ 2                .
                                                                       T
Proof. We use the same potential as in Theorem 3.3:
                                                                      β
                                       Φt = t( f (xt ) − f (x∗ )) +     kxt − x∗ k2 .
                                                                      2
The potential difference is now written as:
  Φt+1 − Φt = t( f (xt+1 ) − f (xt )) + f (xt+1 ) − f (x∗ ) + β2 (kxt+1 − x∗ k2 − kxt − x∗ k2 )
                                                                 |            {z             }
                                                                       kak2 −ka+bk2 =−2ha,bi−kbk2

                ≤ t ( f (xt+1 ) − f (xt )) + f (xt+1 ) − f (x∗ ) − β2 2hxt − xt+1 , xt+1 − x∗ i + kxt − xt+1 k2 . (3.11)
                                                                                                               
                    |         {z         } |         {z       }
                            (?)                    (??)

As xt+1 is the projected point, we cannot directly use (3.5) to bound the first and second terms, but we
can show the following claim (which we prove later) that follows from smoothness:
Claim 3.5. For any y ∈ K, f (xt+1 ) − f (y) ≤ β hxt − xt+1 , xt − yi − (β /2)kxt − xt+1 k2 .
    Using Claim 3.5 with y = xt and y = x∗ to bound the first and second terms of (3.11), we get
              (?)                       (??)
             z}|{ z                    }|                   {
        ≤ t · 0 + β hxt − xt+1 , xt − x∗ i − β/2kxt − xt+1 k2 −β/2 2hxt − xt+1 , xt+1 − x∗ i + kxt − xt+1 k2
                                                                                                             

        = β hxt − xt+1 , xt − xt+1 i − β kxt − xt+1 k2 = 0 .
This completes the proof.
Proof of Claim 3.5. We write f (xt+1 ) − f (y) = ( f (xt+1 ) − f (xt )) + ( f (xt ) − f (y)). Now using smooth-
ness and convexity for the first and second terms respectively, we have
                                                                           
                   f (xt+1 ) − f (y) ≤ h∇t , xt+1 − xt i + β/2kxt+1 − xt k2 + h∇t , xt − yi
                                        = h∇t , xt+1 − yi + β/2kxt+1 − xt k2 .                                   (3.12)
                 0 ,x
Since β hxt+1 − xt+1  t+1 − yi ≤ 0 by the Pythagorean property Prop. 2.2,
                                                 0
                     h∇t , xt+1 − yi = hβ (xt − xt+1  ), xt+1 − yi ≤ β hxt − xt+1 , xt+1 − yi
                                     = β hxt − xt+1 , xt − yi − β kxt+1 − xt k2 .
Substituting into (3.12) gives the result.

                           T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                      12
                        P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

3.1.3   The Frank–Wolfe method
One drawback of projected gradient descent is the projection step: given a point x0 and a body K, finding
the closest point ΠK (x0 ) might be computationally expensive. Instead, we can use a different rule, the
Frank–Wolfe method (also called conditional gradient descent) [11], that implements each gradient step
using linear optimization over the body K. Loosely, at each timestep we find the point in K that is furthest
from the current point in the direction of the negative gradient, and move a small distance towards it.
                                                 yt



                                                       xt+1
                                                                xt
                                                      ∇f (xt)



                                     Figure 2: The Frank–Wolfe Update.

    Formally, the update rule for Frank–Wolfe method is simple:

                                              yt ← arg min h∇t , yi,
                                                         y∈K

                                           xt+1 ← (1 − ηt )xt + ηt yt .                                (3.13)

Setting ηt = 1/(t + 1) in hindsight will give the following result.

Theorem 3.6 (Smooth functions: Frank–Wolfe). If f is β -smooth, K is a convex body with diameter
D := maxx,y∈K kx − yk, then the update rule (3.13) guarantees

                                                                D2 (1 + ln T )
                                      f (xT ) − f (x∗ ) ≤ β                    .
                                                                     2T
Proof. We use the simplest potential function from Theorem 3.1:

                                            Φt = t · ( f (xt ) − f (x∗ )) ,

and hence the change in potential is again:

                         Φt+1 − Φt = (t + 1)( f (xt+1 ) − f (xt )) + ( f (xt ) − f (x∗ )) .            (3.14)

To bound the change in potential (3.14), we observe that xt+1 − xt = ηt (yt − xt ).

                                                          β
                 f (xt+1 ) − f (xt ) ≤ h∇t , xt+1 − xt i +  kxt+1 − xt k2                     (by smoothness)
                                                          2
                                                          β ηt2
                                   = ηt h∇t , yt − xt i +       kyt − xt k2
                                                            2

                       T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                               13
                                         N IKHIL BANSAL AND A NUPAM G UPTA

                                                                 β ηt2
                                        ≤ ηt h∇t , x∗ − xt i +         kyt − xt k2                        (by optimality of yt )
                                                                   2
                       f (xt ) − f (x∗ ) ≤ h∇t , xt − x∗ i .                                                    (by convexity)

Setting ηt := 1/(t + 1) cancels the linear terms and hence the potential change (3.14) is at most β ηt D2 /2.
Summing over t and using Φ0 = 0, the final potential ΦT ≤ D2 (1 + ln T ), and hence

                                                                   (1 + ln T ) · D2
                                           f (xT ) − f (x∗ ) = β                    .
                                                                         2T

   We can remove the logarithmic dependence in the error by multiplying the potential by (t + 1) as in
Theorem 3.2; this gives the following theorem, whose simple proof we omit.

Theorem 3.7 (Smooth functions: Frank–Wolfe, take II). If f is β -smooth, K is a convex body with
D := maxx,y∈K kx − yk, then the update rule (3.13) with ηt = 2/(t + 1) guarantees

                                                                          D2
                                                f (xT ) − f (x∗ ) ≤ 2β        .
                                                                         T +1

3.2    Well-conditioned functions
If a function is both α-strongly convex and β -smooth, it must be that α ≤ β . The ratio κ := β /α is called
the condition number of the convex function. We now show a much stronger convergence guarantee
for “well-conditioned” functions, i. e., functions with small κ values. The update rule is the same as for
smooth functions:

                                                                     1
                                                     xt+1 ← xt −       ∇t .                                                (3.15)
                                                                     β

Theorem 3.8 (GD: Well-conditioned). Given a function f that is both α-strongly convex and β -smooth,
define κ := β /α. The update rule (3.15) ensures

                           f (xT ) − f (x∗ ) ≤ exp(−T /κ) · ( f (x0 ) − f (x∗ ))           for all x∗ .

Proof. We set γ = 1/(κ − 1) for brevity,1 and use the potential

                                             Φt = (1 + γ)t · ( f (xt ) − f (x∗ )) .                                        (3.16)

This is a natural potential to use, as we wish to show that f (xT ) − f (x0 ) falls exponentially with T .

   1 Note that κ = 1 iff f (x) = ax| x + b| x + c for suitable scalars a, c and b ∈ Rn ; in this case it is easily checked that the

optimum solution x∗ is reached in a single step.


                            T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                                14
                         P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

The potential change. A little rearrangement gives us
                                                                                     
                                  t                                                ∗
                                                                                    
               Φt+1 − Φt = (1 + γ) · (1 + γ) f (xt+1 ) − f (xt ) + γ f (xt ) − f (x ) .                  (3.17)

We bound the two terms separately. Using the smoothness analysis from (3.5):

                                                                     1
                                         f (xt+1 ) − f (xt ) ≤ −       k∇t k2 .
                                                                    2β

And by the definition of strong convexity,

                                                                  α                1
                          f (xt ) − f (x∗ ) ≤ h∇t , xt − x∗ i −     kxt − x∗ k2 ≤    k∇t k2
                                                                  2               2α
where the second inequality uses ha, bi − kbk2 /2 ≤ kak2 /2. Plugging this back into (3.17) gives
                                                           
                                                1+γ      γ
                                           t
                                   (1 + γ) −         +        k∇t k2
                                                 2β     2α

which is 0 by our choice of γ. Hence, after T steps,

                 f (xT ) − f (x∗ ) ≤ (1 + γ)−T ( f (x0 ) − f (x∗ )) = (1 − 1/κ )T ( f (x0 ) − f (x∗ ))
                                  ≤ e−T /κ ( f (x0 ) − f (x∗ )) .                                        (3.18)

Hence the proof.

    Here we can show that the algorithm’s point xT also gets rapidly closer to x∗ . If x∗ is the optimal point,
we know ∇ f (x∗ ) = 0. Now smoothness gives f (x0 ) − f (x∗ ) ≤ (β /2)kx0 − x∗ k2 , and strong convexity
gives (α/2)kxT − x∗ k2 ≤ f (xT ) − f (x∗ ). Plugging into (3.18) gives us that

                                      kxT − x∗ k2 ≤ κe−T /κ · kx0 − x∗ k2 .

    A few remarks. Firstly, Theorem 3.8 implies that reducing the error by a factor of 1/2 can be achieved
by increasing T additively by κ ln 2. Hence if the condition number κ is constant, every constant number
of rounds of gradient descent gives us one additional bit of accuracy! This behavior, where getting error
bounded by ε requires O(log ε −1 ) steps, is called linear convergence in the numerical analysis literature.
    One may ask if the convergence for smooth, and for well-conditioned functions is optimal as a
function of T . The answer is no: a famed result of Nesterov gives faster (and optimal) convergence rates.
We see this result and a potential-function-based proof in §5.
    Finally, the proof of Theorem 3.8 can be extended to the constrained case using the same potential
function and the update rule (3.10), but now using an analog of Claim 3.5 that shows that for any y ∈ K,

                   f (xt+1 ) − f (y) ≤ β hxt − xt+1 , xt − yi − β/2kxt − xt+1 k2 − α/2kxt − yk2 .

We omit the simple proof.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                               15
                                   N IKHIL BANSAL AND A NUPAM G UPTA

4     The mirror-descent framework
The gradient descent algorithms in the previous sections work by adding some multiple of the gradient
to current point. However, this should strike the reader as somewhat strange, since the point xt and the
gradient ∇ f (xt ) are objects that lie in different spaces and should be handled accordingly. In particular, if
xt lies in some vector space E, the gradient ∇ f (xt ) lies in the dual vector space E ∗ . (This did not matter
earlier since Rn equipped with the Euclidean norm is self-dual, but now we want to consider general
norms and would like to be careful.)
     A key insight of Nemirovski and Yudin [18] was that substantially more general and powerful results
can be obtained, without much additional work, by considering these spaces separately. For example, it is
well-known (and we will show) that the classic multiplicative-weights update method can be obtained as
a special case of this general approach.


4.1   Basic mirror descent

The key idea in mirror descent is to define an injective mapping between E to E ∗ , which is called the
mirror map. Given a point xt , we first map it to E ∗ , make the gradient update there, and then use the
inverse map back to obtain the point xt+1 .
    We start some basic concepts and notation. Consider some vector space E with an inner product h·, ·i,
and define a norm k·k on E. To measure distances in E ∗ , we use the dual norm defined as

                                             kyk∗ := max hx, yi .                                          (4.1)
                                                      x:kxk=1


By definition we have

                                              hx, yi ≤ kxk · kyk∗ ,                                        (4.2)

which is often referred to as the generalized Cauchy-Schwarz inequality.
   A function h is α-strongly convex with respect to k·k if

                                                                      α
                                 h(y) ≥ h(x) + h∇h(x), y − xi +         ky − xk2 .                         (4.3)
                                                                      2

Such a strongly convex function h defines a map from E to E ∗ via its gradient: indeed, the map x 7→ ∇h(x)
takes the point x ∈ E into a point in the dual space E ∗ . The strong convexity ensures that the map is 1-1
(i. e., ∇h(x) 6= ∇h(y) for x 6= y). Moreover, the map ∇h(·) is also surjective, so for any θ ∈ E ∗ there is an
inverse x ∈ E such that ∇h(x) = θ . In fact, this inverse map is given by the gradient of the Fenchel dual
for h, i. e., ∇h∗ (θ ) = x ⇐⇒ ∇h(x) = θ . (For the reader not familiar with Fenchel duality, it suffices to
interpret ∇h∗ (θ ) merely as (∇h)−1 (θ ).) Readers interested in the technical details can see, e. g., [2] or [4,
Chapter 7].

                        T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                  16
                        P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

                                      Dual Space               Primal Space


                                              θt       ∇h               xt
                                                                             K
                                  −ηt ∇ft (xt )


                                                                       xt+1 = ΠhK (x0t+1 )
                                      0
                                     θt+1
                                                      ∇h∗      x0t+1


                                          Figure 3: Mirror Descent.


4.1.1   The update rules
Since ∇h : E → E ∗ gives us a map from the primal space E to the dual space E ∗ , we keep track of the
image point θt = ∇h(xt ) as well. Now, the updates are the natural ones, given by

                                          0
                                        θt+1 = θt − ηt ∇ ft (xt ),                                    (4.4)
                                         0
                                        xt+1 = ∇h∗ (θt+1
                                                      0
                                                         ),
                                                                0
                                        xt+1 = arg min Dh (x k xt+1 ).
                                                    x∈K


In other words, given xt ∈ E, we add ηt times the negative gradient to its image θt = ∇h(xt ) in the
                    0 , pull the result back to x0                                     0         ∗ 0
dual space to get θt+1                             t+1 ∈ E (using the inverse mapping xt+1 = ∇h (θt+1 )),
and project it back onto K to get xt . Of course, we may not want to use the Euclidean distance for the
projection; the “right” distance in this case is the Bregman divergence Dh (y k x) from x to y, which we
discuss shortly.
    An equivalent way to present the mirror descent update is the following:
                                                                                
                           xt+1 = arg min hηt ∇ ft (xt ), x − xt i + Dh (x k xt ) .                   (4.5)
                                        x∈K


This is a generalization of (2.2). The equivalence is easy to see in the unconstrained case (just take
derivatives), for the constrained case one uses the KKT conditions.

4.1.2   Bregman divergences
Given a strictly convex function h : Rd → R, define the Bregman divergence from x to y

                                Dh (y k x) := h(y) − h(x) − h∇h(x), y − xi

to be the “error” at y between the actual function value and the value given by linearization at some point
x. The convexity of h means this quantity is non-negative; if h is β -strongly convex with respect to the

                       T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                             17
                                    N IKHIL BANSAL AND A NUPAM G UPTA

norm k·k, then Dh (y k x) ≥ (β /2)ky − xk2 . Also, Dh (y k x) is a convex function of y (for a fixed x, this
is a convex function minus a linear term), and the gradient of the divergence with respect to the first
argument is ∇y (Dh (y k x)) = ∇h(y) − ∇h(x).
    For example, the function h(x) := (1/2)kxk22 is 1-strongly convex with respect to `2 (and hence strictly
convex), and the associated Bregman divergence Dh (y k x) = (1/2)ky − xk22 , half the squared `2 distance.
This distance is not a metric, since it does not satisfy the triangle inequality. Or consider the negative
entropy function h(x) := ∑i xi ln xi defined on the probability simplex 4n := {x ∈ [0, 1]n | ∑i xi = 1}.
For x, y ∈ 4n , the associated Bregman divergence Dh (y k x) is ∑i yi ln(yi /xi ), the relative entropy or
Kullback-Leibler (KL) divergence from x to y. This distance is not even symmetric in x and y.

Bregman projection. Given a convex body K and a strictly convex function h, we define the Bregman
projection of a point x0 on K as
                                        ΠhK (x0 ) = argminx∈K Dh (x k x0 ) .
If x0 ∈ K, then ΠhK (x0 ) = x0 because Dh (x0 k x0 ) = 0. For h(x) = (1/2)kxk2 , this corresponds to the usual
Euclidean projection. A very useful feature of Bregman projections is that they satisfy a “Pythagorean
inequality” with respect to the divergence, analogous to Fact 2.2.
Proposition 4.1 (Generalized Pythagorean property). Given a convex body K ⊆ Rn , let a ∈ K and b0 ∈ Rn .
Let b = ΠhK (b0 ). Then
                                    h∇h(b0 ) − ∇h(b), a − bi ≤ 0 .
In particular,
                                     Dh (a k b0 ) ≥ Dh (a k b) + Dh (b k b0 ) ,                          (4.6)
and hence Dh (a k b) ≤ Dh (a k b0 ).
Proof. Recall that for any convex function g and convex body K, if x∗ = argminx∈K g(x) is the minimizer
of g in K, then h∇g(x∗ ), y − x∗ i ≥ 0 for all y ∈ K. Using g(x) = Dh (x k b0 ), and noting that g(x) is convex
with ∇g(x) = ∇h(x) − ∇h(b0 ) and that the minimizer x∗ = b, we get h∇h(b) − ∇h(b0 ), a − bi ≥ 0 for all
a ∈ K.
    For the second part, expand the terms using the definition of Dh (a k b) and cancel the common terms,
the desired inequality turns out to be equivalent to h∇h(b0 ) − ∇h(b), a − bi ≤ 0. The last inequality uses
that the divergences are non-negative.

4.1.3   The analysis
We consider the more general online optimization setting, and prove the following regret bound.
Theorem 4.2. Let K be a convex body, f1 , . . . , fT be convex functions defined on K, k · k be a norm, and
h be an αh -strongly convex function with respect to k · k. The mirror descent algorithm starting at x0 and
taking constant step size ηt = η in every iteration, produces x1 , . . . , xT such that
                 T            n
                                             Dh (x∗ k x0 ) η ∑t=1
                                                              T
                                                                  k∇ ft (xt )k2∗
                 ∑ ft (xt ) − ∑ ft (x∗ ) ≤        η
                                                          +
                                                                  2αh
                                                                                 ,   for all x∗ ∈ K.     (4.7)
                 t=1         t=1


                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                               18
                          P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Proof. Define the potential

                                                                Dh (x∗ k xt )
                                                   Φt =                       .                                      (4.8)
                                                                     η

Observe that plugging in h(x) = (1/2)kxk22 gives us the potential function (2.3) for the Euclidean norm.

The potential change. For brevity, use ∇t := ∇ ft (xt ).
        Dh (x∗ k xt+1 ) − Dh (x∗ k xt )
                       ≤ Dh (x∗ k xt+1
                                   0
                                       ) − Dh (x∗ k xt )                                (generalized Pythagorean property)
                              ∗     0            0
                       = h(x ) − h(xt+1 ) − h∇h(xt+1 ), x∗ − xt+1
                                                              0
                                                                  i − h(x∗ ) + h(xt ) + h∇h(xt ), x∗ − xt i
                                                  | {z }                                            | {z }
                                                           0
                                                         θt+1                                          θt
                                     0          0            0           0
                       = h(xt ) − h(xt+1 ) − hθt+1   , xt − xt+1  i − hθt+1  − θt , x∗ − xt i
                                     0                     0                     0
                       = h(xt ) − h(xt+1 ) − hθt , xt − xt+1   i +hηt ∇t , xt − xt+1 i + hηt ∇t , x∗ − xt i
                          |                 {z                       }
                                      strong convexity
                         αh 0                             0
                       ≤−   kx − xt k2 + ηt h∇t , xt − xt+1 i + ηt h∇t , x∗ − xt i
                          2 t+1
                        η2
                       ≤ t k∇t k2∗ + ηt h∇t , x∗ − xt i .                                                            (4.9)
                        2αh
The last inequality uses generalized Cauchy-Schwarz to get ha, bi ≤ kbkkak∗ ≤ kbk2 /2 + kak2∗ /2. Ob-
serve that (4.9) precisely maps to (2.7) when we consider the Euclidean norm.

The amortized cost. Recall that we set ηt = η for all steps. Hence, dividing (4.9) and substituting,
                                                                                              η
                 ft (xt ) − ft (x∗ ) + (Φt+1 − Φt ) ≤ ft (xt ) − ft (x∗ ) + h∇t , x∗ − xt i +   k∇t k2∗ .
                                                      |                 {z               } 2αh
                                                                    ≤ 0 by convexity of ft

The total regret then becomes
                                                           η                Dh (x∗ k x0 ) η ∑t=1
                                                                                             T
                                                                                                 k∇t k2∗
                 ∑( ft (xt ) − ft (x∗ )) ≤ Φ0 + ∑ 2αh k∇t k2∗ ≤                  η
                                                                                         +
                                                                                              2αh
                                                                                                         .
                  t                                 t

Hence the proof.

4.1.4   Special cases
To get some intuition, let us look at some well-known special cases. If we use the `2 norm, and h(x) :=
(1/2)kxk22 which is clearly 1-strongly convex with respect to `2 , the associated Bregman divergence
Dh (x∗ k x) = (1/2)kx∗ − xk22 . Moreover, the Euclidean norm is self-dual, so if we bound k∇ ft k2 by G,
the total regret bound above is
                                         1 ∗
                                           kx − x0 k22 + ηT G2 /2 .
                                        2η

                          T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                         19
                                    N IKHIL BANSAL AND A NUPAM G UPTA

This is the same result for projected gradient descent we derived in Theorem 2.1—and in fact the algorithm
is also precisely the same.
     Now consider the `1 norm, with K being the probability simplex 4n := {x ∈ [0, 1]n | ∑i xi = 1}.
If we choose the negative entropy function h(x) := ∑i xi ln xi , then Dh (x∗ k x) is just the well-known
Kullback-Liebler divergence. Moreover, Pinsker’s inequality says that KL(p k q) ≥ (1/2)kp − qk21 , which
implies that h is 1-strongly convex with respect to `1 . Applying Theorem 4.2 now gives a regret bound of

                                          KL(x∗ k x0 ) η
                                                      + ∑k∇t k2∞ .
                                              η        2 t

Let’s also see what the mirror descent algorithm does in this case. The mirror map takes the point x to
∇h(x) = (1 + log xi )i , and the inverse map takes θ to ∇h∗ (θ ) = (eθi −1 )i . This point may be outside the
probability simplex, so we do a Bregman projection, which in this case corresponds to just a rescaling
x 7→ x/kxk1 . Unrolling the process, one can get a closed-form expression for the point xT :

                                                   (x0 )i · exp{∑t (∇t )i }
                                       (xT )i =                                  .
                                                  ∑ j (x0 ) j · exp{∑t (∇t ) j }

     For example, if we specialize even further to online linear optimization, where each function
ft (x) = h`t , xi for some `t ∈ [0, 1]n , the gradient is `t and its `∞ -norm is k`t k∞ ≤ 1, giving us the familiar
regret bound of
                                                 KL(x∗ k x0 ) ηT
                                                               +
                                                       η            2
that we get from the multiplicative weights/Hedge algorithms. Which is not surprising, since this
algorithm is precisely the Hedge algorithm!

4.2   An aside: Smooth functions and general norms
Let us consider minimizing a function that is smooth with respect to non-Euclidean norms, in the
unconstrained case. When we consider an arbitrary norm k·k, the definition of a smooth function (3.2)
extends seamlessly. Now we can define an update rule by naturally extending (2.2):
                                                  n1                                  o
                               xt+1 ← arg min           kx − xt k2 + ηt h∇t , x − xt i ,                    (4.10)
                                              x     2

where the norm is no longer the Euclidean norm, but the norm in question. To evaluate the minimum on
the right side, we can use basic Fenchel duality: given a function g, its Fenchel dual is defined as

                                         g? (θ ) := max{hθ , zi − g(z)} .
                                                        z


If we define g(z) = (1/2)kzk2 , it is known that g? (θ ) = (1/2)kzk2? (see [5, Example 3.27]). Hence
               n1                               o       n                  1          o
         min      kx − xt k2 + ηt h∇t , x − xt i = − max ηt h∇t , xt − xi − kxt − xk2
          x     2                                     x                    2

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                   20
                          P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

                                                         n             1    o   1
                                                  = − max hηt ∇t , zi − kzk2 = − kηt ∇t k2? .           (4.11)
                                                       z               2        2
If a function f is β -smooth with respect to the norm, then setting ηt = 1/β gives:
                    (3.2)                            β
          f (xt+1 ) ≤ f (xt ) + h∇t , xt+1 − xt i + kxt+1 − xt k2
                                                     2
                                                      1                             1          
                   = f (xt ) + β hηt ∇t , xt+1 − xt i + kxt+1 − xt k2 = f (xt ) + β · − kηt ∇t k2? ,
                                                       2                               2
where the last equality uses that xt+1 is the minimizer of the expression in (4.11). Summarizing, we get
                                                                      1
                                            f (xt+1 ) ≤ f (xt ) −       k∇t k2? .                       (4.12)
                                                                     2β
This is analogous to the expression (3.3). Now we can continue the proof as in §3.1, again defining
D := max{kx − x∗ k | f (x) ≤ f (x0 )}, and using the generalized Cauchy-Schwarz inequality to get the
general-norm analog of Theorem 3.2, due to Jaggi [14].
Theorem 4.3 (GD: Smooth functions for general norms). Given a function f that is β -smooth with
respect to the norm k·k, the update rule (4.10) ensures
                                                                        2D2
                                             f (xT ) − f (x∗ ) ≤ β          .
                                                                       T +1

5     Nesterov acceleration: A potential function proof
In §3, we proved a convergence rate of O(1/T ) for smooth functions, using both projected gradient
descent and the Frank–Wolfe method. But the lower bound is only Ω(1/T 2 ). In this case, the algorithm
can be improved: Yurii Nesterov showed how to do it using his accelerated gradient descent methods [19].
Recently there has been much interest in gaining a deeper understanding of this process, with proofs
using “momentum” methods and continuous-time updates [29, 24, 17, 30, 8].
    Let us now see potential-based proofs for his theorem, both for the smooth case, and for the well-
conditioned case. We consider only the unconstrained case (i. e., when K = Rn ) and the Euclidean norm;
the extension to general norms is sketched in §5.4.

5.1    An illustrative failed attempt
One way to motivate Nesterov’s accelerated algorithm is to revisit the proof for smooth functions in
§3.1.1. Let us recall the essential facts. The potential was
                                                                   a
                                      Φt = t ( f (xt ) − f (x∗ )) + kxt − x∗ k2
                                                                   2
for some a > 0. Hence the potential difference was:
                                                                      a
      Φt+1 − Φt = (t + 1) ( f (xt+1 ) − f (xt )) + f (xt ) − f (x∗ ) + (kxt+1 − x∗ k2 − kxt − x∗ k2 )
                          |         {z         } |         {z     } 2|              {z             }
                                    (3.5)              (convexity)                  (2.7)


                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                              21
                                     N IKHIL BANSAL AND A NUPAM G UPTA

                         z }| { z }| {                   z            }|            {
                                                   ∗                         ηt2
             ≤ (t + 1) · − 2β k∇t k2 + h∇t , xt − x i +a ηt h∇t , x − xt i + 2 k∇t k22 = − 2βt k∇t k2 ≤ 0 .
                            1      2                               ∗
                                                                                      

In that last expression we set ηt = 1/β and a = 1/ηt = β to cancel the inner-product terms.
     Observe that the potential may be decreasing considerably, by −t/(2β ) · k∇t k2 , but we are ignoring
this large decrease. If we want to a show an O(1/t 2 ) rate of convergence, a first (incorrect) attempt to get
a better analysis would be to try to apply the analysis above with the potential changed to
                                                                        a
                                  Φt = t(t + 1) ( f (xt ) − f (x∗ )) + kxt − x∗ k2 .
                                                                        2
In particular note the factor at = t(t + 1) instead of (t + 1) above.
     At first glance, the potential change Φt+1 − Φt would be
                                                                                                  2
                             1
                               k∇t k22 )) + 2(t + 1) · h∇t , xt − x∗ i + a ηt h∇t , x∗ − xt i + η2t k∇t k22 .
                                                                                                           
        (t + 2)(t + 1) · (− 2β                                                                                (5.1)
                         |     {z     }
                             (3.5)

Now if we change the step length ηt from 1/β to something more aggressive, say ηt = (t + 1)/(2β ), and
choose a = 4β , the inner-product terms cancel, and the potential reduction seems to be at most
                                         (t + 1)(t + 2) (t + 1)2
                                                                  
                                    2
                              k∇t k −                  +             ≤ 0.
                                               2β           2β
This would seem to give us an O(1/T 2 ) convergence, with the standard update rule.
     So where’s the mistake? It’s in our erroneous use of (3.5) in (5.1)—we used the cautious update
with ηt = 1/β to get the first term of −1/(2β ) · k∇t k22 , but the aggressive update with ηt = (t + 1)(2β )
elsewhere. To fix this, how about running two processes, one cautious and one aggressive, and then
combining them together linearly (with decreasing weight on the aggressive term) to get the new point
xt ? This is precisely what Nesterov’s Accelerated Method does; let’s see it now. (This is another way to
arrive at the elegant linear-coupling view that Allen-Zhu and Orecchia present in [1].)

5.2   Getting the acceleration
As we just sketched, one way to view the accelerated gradient descent method is to run two iterative
processes for yt and zt , and then combine them together to get the actual point xt . The proof is almost the
same as before.

The update steps. Start with x0 = y0 = z0 . At time t, play xt . For brevity, define ∇t := ∇ f (xt ). Now
consider the update rules, where the color is used to emphasize the subtle differences (in particular, z is
updated by the gradient at x in (5.3)):

                                             yt+1 ← xt − β1 ∇ f (xt ),                                       (5.2)
                                             zt+1 ← zt − ηt ∇ f (xt ),                                       (5.3)
                                      xt+1 ← (1 − τt+1 )yt+1 + τt+1 zt+1 .                                   (5.4)
In (5.3), we will choose the “aggressive” step size ηt = (t + 1)/(2β ) as we did in the above failed attempt.
In (5.4) the mixing weight is τt = 2/(t + 2), but this will arise organically below.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                   22
                           P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

The potential. This is the same one from the failed attempt:

                                Φ(t) = t(t + 1) · ( f (yt ) − f (x∗ )) + 2β · kzt − x∗ k2 .                      (5.5)

The potential change. Define ∆Φt = Φ(t + 1) − Φ(t). By the standard GD analysis in (2.7),
                           1                                 η2
                             (kzt+1 − x∗ k2 − kzt − x∗ k2 ) = t k∇t k2 + ηt h∇t , x∗ − zt i                      (5.6)
                           2                                  2
implies that
                                                                                  ηt2
                                                                                                              
                                                                           ∗               2            ∗
  ∆Φt = t(t + 1) · ( f (yt+1 ) − f (yt )) + 2(t + 1) · ( f (yt+1 ) − f (x )) + 4β     k∇t k + ηt h∇t , x − zt i .
                                                                                   2
By smoothness and the update rule for yt+1 , (3.5) implies
                                                                      1
                                             f (yt+1 ) ≤ f (xt ) −      k∇t k2 .
                                                                     2β
Substituting, and noting 2β ηt2 = (t + 1)2 /2β ≤ (t + 1)(t + 2)/2β by the choice of ηt , the resulting
(negative) squared-norm term can be dropped to give

               ∆Φt ≤ t(t + 1) · ( f (xt ) − f (yt )) + 2(t + 1) · ( f (xt ) − f (x∗ )) + 4β ηt h∇t , x∗ − zt i
                    ≤ t(t + 1) · h∇t , xt − yt i + 2(t + 1) · h∇t , xt − x∗ i + 2(t + 1) · h∇t , x∗ − zt i

using convexity for the first two terms, and ηt = (t + 1)/2β for the last one. Collecting like terms,

                                  ∆Φt ≤ (t + 1) · h∇t , (t + 2)xt − tyt − 2zt i = 0 ,                            (5.7)

by using (5.4) and τt = 2/(t + 2). Hence Φt ≤ Φ0 for all t ≥ 0. This proves:
Theorem 5.1 (Accelerated GD). Given a β -smooth function f , the update rules (5.2)-(5.4) ensure

                                                                     kz0 − x∗ k2
                                           f (yt ) − f (x∗ ) ≤ 2β                .
                                                                      t(t + 1)

5.2.1   An aside: Optimizing parameters and making connections
Suppose we choose the generic potential
                                                                         β
                                             2
                                    Φ(t) = λt−1 ( f (yt ) − f (x∗ )) +     kzt − x∗ k2 ,
                                                                         2
where λt = Θ(t 2 ), and try to optimize the calculation above. Having λt2 − λt−1
                                                                              2 = λ and τ = 1/λ makes
                                                                                   t     t     t
the calculations work out very cleanly. Solving this recurrence leads to the (somewhat exotic-looking)
weights
                                                           q
                                                                    2
                                                        1 + 1 + 4λt−1
                                     λ0 = 0,      λt =                                           (5.8)
                                                               2

                          T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                      23
                                    N IKHIL BANSAL AND A NUPAM G UPTA

used in the standard exposition of AGM2.
     The update rules (5.2)–(5.4) have sometimes been called AGM2 (Nesterov’s second accelerated method)
in the literature. A different set of update rules (called AGM1) are the following: for the optimized choice
of λt from (5.8), define:
                                               1
                                    yt+1 ← xt − ∇ f (xt ) ,                                             (5.9)
                                               β
                                                      
                                                1 − λt           1 − λt
                                    xt+1 ← 1 −            yt+1 +        yt .                           (5.10)
                                                 λt+1             λt+1
      Let us show the simple equivalence (also found in, e. g., [27, 9, 16]).
Lemma 5.2. Using updates (5.9–5.10) and setting zt := λt xt −(λt −1)yt = λt (xt −yt )+yt , and τt := 1/λt
leads to the updates (5.2–5.4).
Proof. Clearly yt is the same as above, so it suffices to show that zt and xt behave identically. Indeed,
rewriting the definition of zt and substituting τt = 1/λt gives
                                xt = (1 − τt )yt + τt zt , and
                         zt+1 − zt = (λt+1 (xt+1 − yt+1 ) + yt+1 ) − (λt (xt − yt ) + yt ) .           (5.11)
Moreover, rewriting (5.10) gives
                                 λt+1 (xt+1 − yt+1 ) − (1 − λt )(yt − yt+1 ) = 0 .                     (5.12)
Subtracting (5.12) from (5.11) gives
                                                                      λt
                                   zt+1 − zt = λt yt+1 − λt xt = −       ∇ f (xt ) .                   (5.13)
                                                                      β
Recalling that λt = 1/τt = (β ηt ), this is precisely the update rule zt+1 ← zt − ηt ∇ f (xt ). This shows the
equivalence of the two update rules.

5.3     The constrained case with acceleration
The update steps. The update rule is very similar to the one above, it just involves projecting the points
onto the body K. Formally, again we start with x0 = y0 = z0 . At time t, play xt . For brevity, define
∇t := ∇ f (xt ). Now consider the update rules, where again the color is used to emphasize the subtle
differences:
                                     yt+1 ← ΠK (xt − β1 ∇ f (xt )) ,                                (5.14)
                                          zt+1 ← ΠK (zt − ηt ∇ f (xt )) ,                              (5.15)
                                       xt+1 ← (1 − τt+1 )yt+1 + τt+1 zt+1 .                            (5.16)
    We now show that this update rule satisfies same guarantee as in Theorem 5.1 for the unconstrained
case.
Theorem 5.3 (Accelerated GD). Given a β -smooth function f , the update rules (5.14)-(5.16) ensure
                                                                 kz0 − x∗ k2
                                        f (yt ) − f (x∗ ) ≤ 2β               .
                                                                  t(t + 1)

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                              24
                          P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

The potential.
                              Φ(t) = t(t + 1) · ( f (yt ) − f (x∗ )) + 2β · kzt − x∗ k2 .                     (5.17)

Change in potential.
    Φt+1 − Φt = (t + 1)(t + 2)( f (yt+1 ) − f (xt )) −t(t + 1)( f (yt ) − f (xt )) + 2(t + 1)( f (xt ) − f (x∗ ))
                |              {z                  } |                            {z                            }
                                   (?)                                               (??)
                                   ∗ 2            ∗ 2
                                                        
                + 2β · kzt+1 − x k − kzt − x k .
                  |             {z            }
                                  (???)

Using convexity on both differences in (??) gives
                                ≤ −t(t + 1)h∇t , yt − xt i + 2(t + 1)h∇t , xt − x∗ i
                                = (t + 1) · h∇t , −t(yt − xt ) + 2(xt − x∗ )i .
Now (1 − τt )(yt − xt ) = τt (xt − zt ), and if τt = 2/(t + 2), then we get t(yt − xt ) = 2(xt − zt ), and hence
                                          (??) = 2(t + 1)h∇t , zt − x∗ i .                                    (5.18)
The expression (? ? ?) is
                                    2β (hzt+1 − zt , zt+1 − x∗ i − kzt+1 − zt k2 .
                               0
By the Pythagorean property, hzt+1 − zt+1 , zt+1 − x∗ i ≥ 0 where zt+1
                                                                   0   = zt − ηt ∇t . Setting ηt = (t + 1)/β
and multiplying by 2β , adding to the term above, (? ? ?) is upper bounded by
                                            −2(t + 1)h∇t , zt+1 − x∗ i.
Combine with (5.18) to cancel x∗ , it suffices to show the following claim:
Lemma 5.4.
                               (?)                                    (??)+(???)
                z              }|                  { z                    }|                   {
                (t + 1)(t + 2)( f (yt+1 ) − f (xt )) + 2(t + 1)h∇t , zt − zt+1 i − kzt+1 − zt k2 ≤ 0 .
Proof. By smoothness,
                                                                β
                                                                   kyt+1 − xt k2 .
                              f (yt+1 ) − f (xt ) ≤ h∇t , yt+1 − xt i +
                                                                 2
As                                                                  
                                                          β
                                    min h∇t , y − xt i + ky − xt k2
                                    y∈K                   2
in minimized by the definition of yt+1 , we get that for any v ∈ K, the RHS is
                                                           β
                                          ≤ h∇t , v − xt i + kv − xt k2 .
                                                           2
Define v := (1 − τt )yt + τt zt+1 ∈ K, so v − xt = τt (zt+1 − zt ). Substituting, we get
                                                                      τ 2β
                          f (yt+1 ) − f (xt ) ≤ τt h∇t , zt+1 − zt i + t kzt+1 − zt k2 .
                                                                        2
Setting τt = 2/(t + 2), the claim follows.

                         T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                       25
                                   N IKHIL BANSAL AND A NUPAM G UPTA

5.4   The extension to arbitrary norms
Given an arbitrary norm k·k, the update rules now use the gradient descent update (4.10) for smooth
functions for the y variables, and the mirror descent update rules (4.4) for the z variables:
                                                                                
                                              β           2
                            yt+1 ← arg min       ky − xt k + h∇ f (xt ), y − xt i ,
                                         y     2
                                                                           
                            zt+1 ← arg min hηt ∇ f (xt ), zi + Dh (z k zt ) ,                 (5.19)
                                            z

                            xt+1 ← (1 − τt+1 )yt+1 + τt+1 zt+1 .
Given the discussion in the preceding sections, the update rules are the natural ones: the first is the
update (4.10) for smooth functions, and the second is the usual mirror descent update rule (4.5) for the
strongly-convex function h. The step size is now set to
                                                         (t + 1)αh
                                                ηt =               ,
                                                             2β
where αh is such that h is αh -strongly convex with respect to k·k. Then, the potential function becomes:
                                                                       4β
                           Φ(t) = t(t + 1) · ( f (yt ) − f (x∗ )) +       · Dh (x∗ k zt ) ,            (5.20)
                                                                       αh

which on substituting Dh (x∗ k zt ) = (1/2)kx∗ − zt k2 and αh = 1 gives (5.5).
    We already have all the pieces to bound the change in potential. Use the mirror descent analysis (4.9)
to get
                                                            η2
                        Dh (x∗ k zt+1 ) − Dh (x∗ k zt ) ≤ t k∇t k2∗ + ηt h∇t , x∗ − xt i.
                                                           2αh
which replaces (5.6). Infer
                                                                1
                                         f (yt+1 ) ≤ f (xt ) −    k∇t k2∗
                                                               2β
from (4.12) in the smooth case. Substitute these into the analysis from §5.2 (with minor changes for the
αh term) to get the following theorem:
Theorem 5.5 (Accelerated GD: General norms). Given a β -smooth function f with respect to norm k·k,
the update rules (5.19) ensure
                                                    4β Dh (x∗ k z0 ) − Dh (x∗ k zt )
                              f (yt ) − f (x∗ ) ≤      ·                             .
                                                    αh          t(t + 1)

5.5   Acceleration for strongly convex functions
Now consider the case when the function f is well-conditioned with√condition number κ = β /α, and
describe the algorithm of Nesterov with convergence rate exp(−t/ κ) [20]. Again, this is the best
possible for gradient methods.
    Before describing the algorithm and its analysis, it is instructive to see how the result for the smoothed
case from Section 5.2 already gives (in principle) such a result for the well-conditioned case.

                        T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                               26
                        P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Reduction from smooth case. Theorems 5.1 and 5.3 for the (constrained) smoothed case give that
                                                                 kx0 − x∗ |2
                                      f (yt ) − f (x∗ ) ≤ 2β                 .
                                                                  t(t + 1)
Together with strong convexity,
                                                             α
                                       f (yt ) − f (x∗ ) ≥     kyt − x∗ k2 ,
                                                             2
                x∗ k2 ≤ 4κkx0 − x∗ k2 /t(t + 1).
this gives kyt −√
    So in t = 4 κ steps, the distance kyt − x∗ k is at most half the initial distance kx0 − x∗ k from the
optimum. Starting the algorithm again with yt as the initial
                                                          √ point, and iterating this process thus gives an
overall algorithm, that in t steps has error at most 2−t/4 κ kx − x k.
                                                                   0
    Restarting the algorithm after every few steps is not ideal, and we now describe Nesterov’s algorithm
with this improved convergence rate. For simplicity only consider the unconstrained case.

The update rules. We now use the following updates (which look very much like the AGM1 updates):
                                              1
                                 yt+1 ← xt − ∇ f (xt ) ,                                            (5.21)
                                              β
                                              √               √
                                                  κ −1           κ −1
                                 xt+1 ← 1 + √            yt+1 − √     yt .                          (5.22)
                                                  κ +1           κ +1
                                                          √
For the analysis, it will be convenient to define τ = 1/( κ + 1) and set
                                         zt+1 := τ1 xt+1 − 1−τ
                                                            τ yt+1 .                                (5.23)
We now show that that the error after t steps is
                                                                                
                                         ∗              −t       α +β        ∗ 2
                             f (yt ) − f (x ) ≤ (1 + γ)               kx0 − x k ,                   (5.24)
                                                                   2
             √
where γ = 1/( κ − 1) (as in §3.2, for κ = 1 the algorithm reaches optimum in a single step and y1 = x∗ ,
and hence we assume that κ > 1). This improves on the error of
                                                    β
                                         (1 + 1/κ)−t kx0 − x∗ k2
                                                    2
we get from §3.2.

The potential. Consider the potential
                                                              α           
                            Φ(t) = (1 + γ)t f (yt ) − f (x∗ ) + kzt − x∗ k2 .
                                                               2
Observe that
                                                                 α
                                   Φ0 = f (y0 ) − f (x∗ ) +        kz0 − x∗ k2 .
                                                                 2
As x0 = y0 = z0 , and by β -smoothness of f ,
                                                 α +β
                                         Φ0 ≤         kx0 − x∗ k2 .
                                                   2

                       T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                             27
                                        N IKHIL BANSAL AND A NUPAM G UPTA

Change in potential. To show the error bound (5.24), it suffices to show that ∆Φ(t) = Φ(t +1)−Φ(t) ≤
0 for each t. This is equivalent to showing
                                                                   α
          (1 + γ)( f (yt+1 ) − f (x∗ )) − ( f (yt ) − f (x∗ )) +     (1 + γ)kzt+1 − x∗ k2 − kzt − x∗ k2 ≤ 0 .
                                                                                                       
                                                                   2
We first bound the terms involving f in the most obvious way. As above, we use ∇t as short-hand for
∇ f (xt ). By β -smoothness and the update rule, again

                                                                       1
                                              f (yt+1 ) ≤ f (xt ) −      k∇t k2 .
                                                                      2β

So,

           (1 + γ)( f (yt+1 ) − f (x∗ )) − ( f (yt ) − f (x∗ ))
                                                                                    1
                              ≤ f (xt ) − f (yt ) + γ( f (xt ) − f (x∗ )) − (1 + γ) k∇t k2
                                                                                   2β
                                                                                     
                                                                   ∗     α        ∗ 2   1+γ
                              ≤ h∇t , xt − yt i + γ h∇t , xt − x i − kxt − x k −            k∇t k2 ,            (5.25)
                                                                          2              2β

where the last inequality used convexity and strong convexity respectively.
    We now want to remove references to yt . By definition (5.23),
                                                           √
                                        
                                    1
                             zt =     − 1 (xt − yt ) + xt = κ(xt − yt ) + xt ,
                                    τ
                         √                                 √
so we infer γ(zt − x∗ ) = κγ(xt − yt ) + γ(xt − x∗ ). Using κγ = 1 + γ, simple algebra gives

                                                           1
                             (xt − yt ) + γ(xt − x∗ ) =       γ(zt − x∗ ) + γ 2 (xt − x∗ ) .
                                                                                          
                                                          1+γ

      For brevity we use Xt := xt − x∗ , Zt = zt − x∗ , and substitute the above expression into (5.25) to get
                                   1                         αγ          1+γ
                                      h∇t , γZt + γ 2 Xt i −    kXt k2 −     k∇t k2 .                           (5.26)
                                  1+γ                        2            2β

Now, let us upper bound the terms in ∆Φ(t) involving z. Conveniently, we can relate zt+1 and zt using a
simple calculation that we defer for the moment.

Claim 5.6. zt+1 = (1 − √1κ )zt + √1κ xt − α √
                                            1
                                               ∇ and so zt+1 − x∗ = 1+γ
                                              κ t
                                                                     1        γ
                                                                        Zt + 1+γ         γ
                                                                                 Xt − α(1+γ) ∇t .

      Now use Claim 5.6 and expand using ka + b + ck2 = kak2 + kbk2 + kck2 + 2ha, bi + 2hb, ci + 2ha, ci:

      (1 + γ)kzt+1 − x∗ k2 − kzt − x∗ k2
                                      γ2                                      2γ 2
                                                                                            
           1                                                    2γ
       =         kZt k2 + γ 2 kXt k2 + 2 k∇t k2 + 2γhZt , Xt i − h∇t , Zt i −      h∇t , Xt i − kZt k2 .        (5.27)
         1+γ                          α                         α              α


                            T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                                   28
                        P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

Now sum (5.26) and α/2 times (5.27). The terms involving k∇t k2 cancel since
                                            1+γ       αγ 2
                                                =
                                             2β   2α 2 (1 + γ)
(by the definition of γ). Moreover, the inner-product terms involving ∇t also cancel. Hence the potential
change is at most
                                                                       
                          αγ      2            γ       α       2     1         αγ
                ∆Φ(t) ≤      kXt k − 1 +            + kZt k             −1 +      hZt , Xt i
                           2                1+γ        2            1+γ      1+γ
                              αγ                                  
                       =−             kXt k2 + kZt k2 − 2hZt , Xt i                               (5.28)
                            2(1 + γ)
                              αγ
                       =−            kZt − Xt k2 ≤ 0 .                                            (5.29)
                            2(1 + γ)
Hence the potential does not increase, as claimed. It only remains to prove Claim 5.6.

Proof of Claim 5.6. The expression of xt+1 from (5.22) can be written as (2 − 2τ)yt+1 − (1 − 2τ)yt .
Plugging into the expression for zt+1 from (5.23) gives
                                1
                          zt+1 =  ((2 − 2τ)yt+1 − (1 − 2τ)yt − (1 − τ)yt+1 )
                                τ
                                1
                               = ((1 − τ)yt+1 − (1 − 2τ)yt ) .
                                τ
Using the update rule (5.21) for yt+1 , and the relation xt = (1 − τ)yt + τzt to eliminate yt
                                                                               
                                 1                   1        (1 − 2τ)
                              =     (1 − τ) xt − ∇t −                  (xt − τzt )
                                 τ                   β          1−τ
                                 1 − 2τ         τ        1−τ
                              =          zt +       xt −      ∇t .
                                  1−τ         1−τ         τβ
             √
Using τ = 1/( κ + 1) and β = κα now gives the claim.

Acknowledgments
We thank the referees and Laci Babai for their detailed and thoughtful feedback that helped improve
the presentation substantially. We also thank Sébastien Bubeck, Daniel Dadush, Jelena Diakonikolas,
Nick Harvey, Elad Hazan, Greg Koumoutsos, Raghu Meka, Aryan Mokhtari, Marco Molinaro, Thomas
Rothvoß, and Kunal Talwar for their helpful comments.


References
 [1] Z EYUAN A LLEN -Z HU AND L ORENZO O RECCHIA: Linear coupling: an ultimate unification of
     gradient and mirror descent. In Proc. 8th Innovations in Theoret. Comp. Sci. Conf. (ITCS’17), pp. 3:1–
     3:22. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017. [doi:10.4230/LIPIcs.ITCS.2017.3,
     arXiv:1407.1537] 22

                       T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                             29
                                N IKHIL BANSAL AND A NUPAM G UPTA

 [2] A MIR B ECK AND M ARC T EBOULLE: Mirror descent and nonlinear projected subgradient methods
     for convex optimization. Oper. Res. Lett., 31(3):167–175, 2003. [doi:10.1016/S0167-6377(02)00231-
     6] 16

 [3] A HARON B EN -TAL AND A RKADI N EMIROVSKI: Lectures on Modern Convex Optimization.
     Society for Industrial and Applied Mathematics, 2001. [doi:10.1137/1.9780898718829] 3

 [4] D MITRI P. B ERTSEKAS , A NGELIA N EDI Ć , AND A SUMAN E. O ZDAGLAR: Convex Analysis and
     Optimization. Athena Scientific, 2003. 16

 [5] S TEPHEN B OYD AND L IEVEN VANDENBERGHE: Convex Optimization. Cambridge University
     Press, 2004. [doi:10.1017/CBO9780511804441] 20

 [6] S ÉBASTIEN B UBECK: Convex optimization: algorithms and complexity. Foundations and Trends
     in Machine Learning, 8(3-4):231–357, 2015. [doi:10.1561/2200000050, arXiv:1405.4980] 3, 6

 [7] N ICOLÒ C ESA -B IANCHI AND G ÁBOR L UGOSI: Prediction, Learning, and Games. Cambridge
     University Press, Cambridge, 2006. 3, 9

 [8] J ELENA D IAKONIKOLAS AND L ORENZO O RECCHIA: The approximate gap technique: a uni-
     fied approach to optimal first-order methods. SIAM J. Optim., 29(1):660–689, 2019. SIAM.
     [arXiv:1712.02485] 2, 3, 21

 [9] YOEL D RORI AND M ARC T EBOULLE: Performance of first-order methods for smooth convex
     minimization: a novel approach. Math. Program., 145(1-2):451–482, 2014. [doi:10.1007/s10107-
     013-0653-0, arXiv:1206.3209] 24

[10] J OHN C. D UCHI: Introductory Lectures on Stochastic Optimization, 2016. Available at author’s
     webpage. 3

[11] M ARGUERITE F RANK AND P HILIP W OLFE: An algorithm for quadratic programming. Naval
     Research Logistics Quarterly, 3(1-2):95–110, 1956. [doi:10.1002/nav.3800030109] 13

[12] E LAD H AZAN: Introduction to online convex optimization. Foundations and Trends in Optimization,
     2(3-4):157–325, 2016. [doi:10.1561/2400000013] 3

[13] J EAN -BAPTISTE H IRIART-U RRUTY AND C LAUDE L EMARÉCHAL: Fundamentals of Convex
     Analysis. Springer, 2001. [doi:10.1007/978-3-642-56468-0] 2

[14] M ARTIN JAGGI: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In Proc. 13th
     Internat. Conf. on Machine Learning (ICML’13), pp. 427–435, 2013. [PMLR]. 21

[15] S AHAR K ARIMI AND S TEPHEN A. VAVASIS: A single potential governing convergence of conju-
     gate gradient, accelerated gradient and geometric descent, 2017. [arXiv:1712.09498] 3

[16] D ONGHWAN K IM AND J EFFREY A. F ESSLER: Optimized first-order methods for smooth con-
     vex minimization. Math. Program., 159(1-2):81–107, 2016. [doi:10.1007/s10107-015-0949-3,
     arXiv:1406.5468] 24

                      T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                         30
                       P OTENTIAL -F UNCTION P ROOFS FOR G RADIENT M ETHODS

[17] WALID K RICHENE , A LEXANDRE M. BAYEN , AND P ETER L. BARTLETT: Accelerated mirror
     descent in continuous and discrete time. In Adv. in Neural Inform. Processing Systems 28 (NIPS’15),
     pp. 2845–2853, 2015. NIPS. 2, 3, 21

[18] A RKADI S EMËNOVICH N EMIROVSKI AND DAVID B ERKOVICH Y UDIN: Problem Complexity and
     Method Efficiency in Optimization. John Wiley & Sons, Inc., 1983. 2, 3, 16

[19] Y URII N ESTEROV: A method of solving a convex programming problem with convergence rate
     O(1/k2 ). Soviet Mathematics Doklady, 27:372–376, 1983. LINK at Math-Net.ru. 21

[20] Y URII N ESTEROV: Introductory Lectures on Convex Optimization – A Basic Course. Kluwer
     Academic Publishers, 2004. [doi:10.1007/978-1-4419-8853-9] 3, 6, 26

[21] JAVIER P EÑA: Convergence of first-order methods via the convex conjugate. Oper. Res. Lett.,
     45(6):561–564, 2017. [doi:10.1016/j.orl.2017.08.013, arXiv:1707.09084] 3

[22] S HAI S HALEV-S HWARTZ: Online learning and online convex optimization. Foundations and
     Trends in Machine Learning, 4(2):107–194, 2012. [doi:10.1561/2200000018] 3

[23] NAUM Z USELEVICH S HOR: Minimization Methods for Non-differentiable Functions. Springer,
     1985. [doi:10.1007/978-3-642-82118-9] 3

[24] W EIJIE S U , S TEPHEN B OYD , AND E MMANUEL J. C ANDÈS: A differential equation for model-
     ing Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. (JMLR),
     17(153):1–43, 2016. 2, 3, 21

[25] E IJI TAKIMOTO AND M ANFRED K. WARMUTH: The minimax strategy for Gaussian density
     estimation. In Proc. 13th Ann. Conf. on Computational Learning Theory (COLT’00), pp. 100–106.
     Morgan Kaufmann Publ., 2000. ACM DL. 8

[26] A DRIEN TAYLOR AND F RANCIS BACH: Stochastic first-order methods: non-asymptotic and
     computer-aided analyses via potential functions. In Proc. 32th Ann. Conf. on Computational
     Learning Theory (COLT’19), pp. 2934–2992. MLR Press, 2019. [PMLR]. [arXiv:1902.00947] 3

[27] PAUL T SENG: On accelerated proximal gradient methods for convex-concave optimization, 2008.
     Available at author’s webpage. 24

[28] N ISHEETH K. V ISHNOI: A mini-course on convex optimization, 2018. Available at author’s
     webpage. 3

[29] A NDRE W IBISONO , A SHIA C. W ILSON , AND M ICHAEL I. J ORDAN: A variational perspective
     on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA, 113(47):E7351–E7358, 2016.
     [doi:10.1073/pnas.1614734113, arXiv:1603.04245] 2, 3, 21

[30] A SHIA C. W ILSON , B EN R ECHT , AND M ICHAEL I. J ORDAN: A Lyapunov analysis of momentum
     methods in optimization, 2016. [arXiv:1611.02635] 2, 3, 21



                      T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                           31
                              N IKHIL BANSAL AND A NUPAM G UPTA

AUTHORS

    Nikhil Bansal
    Researcher
    Centrum voor Wiskunde en Informatica
    Amsterdam
    The Netherlands
    bansal gmail com
    http://www.win.tue.nl/~nikhil


    Anupam Gupta
    Professor
    Computer Science Department
    Carnegie Mellon University
    Pittsburgh, PA, USA
    anupamg cs cmu edu
    http://www.cs.cmu.edu/~anupamg


ABOUT THE AUTHORS

    N IKHIL BANSAL is a researcher at the Centrum voor Wiskunde en Informatica, Amsterdam.
        He attended the Indian Institute of Technology, Mumbai for his B. Tech. degree, and
        received his Ph. D. from Carnegie Mellon University, where he was advised by Avrim
        Blum. He got fascinated by algorithms after taking an undergraduate class by Prof. Ajit
        A. Diwan. Since then he has enjoyed thinking about various kinds of algorithmic
        questions.


    A NUPAM G UPTA is a faculty member in the Computer Science Department at Carnegie
       Mellon University. He grew up in Calcutta, India, attended the Indian Institute of
       Technology, Kanpur for his B. Tech. degree, and received his Ph. D. from the University
       of California, Berkeley, under the guidance of Alistair Sinclair. Anupam is interested
       in most kinds of algorithmic problems, and his research focus is on approximation and
       online algorithms, and metric embeddings.




                    T HEORY OF C OMPUTING, Volume 15 (4), 2019, pp. 1–32                          32