Optimal Convergence Rate of Hamiltonian Monte Carlo for Strongly Logconcave Distributions

Authors Zongchen Chen, Santosh S. Vempala,
Plaintext
                           T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18
                                        www.theoryofcomputing.org

                          S PECIAL ISSUE : APPROX-RANDOM 2019



       Optimal Convergence Rate of
    Hamiltonian Monte Carlo for Strongly
         Logconcave Distributions
                          Zongchen Chen∗                            Santosh S. Vempala†
               Received February 29, 2020; Revised September 23, 2020; Published April 30, 2022




       Abstract. We study the Hamiltonian Monte Carlo (HMC) algorithm for sampling from
       a strongly logconcave density proportional to e− 𝑓 where 𝑓 : ℝ 𝑑 → ℝ is 𝜇-strongly
       convex and 𝐿-smooth (the condition number is 𝜅 = 𝐿/𝜇). We show that the relaxation
       time (inverse of the spectral gap) of ideal HMC is 𝑂(𝜅), improving on the previous
       best bound of 𝑂(𝜅1.5 ) (Lee et al., 2018); we complement this with an example where
       the relaxation time is Ω(𝜅), for any step-size. When implemented with an ODE solver,
                                                                               ˜
       HMC returns an 𝜀-approximate point in 2-Wasserstein distance using 𝑂((𝜅𝑑)     0.5 𝜀−1 )
                                             ˜
       gradient evaluations per step and 𝑂((𝜅𝑑)     1.5 𝜀 ) total time.
                                                         −1


    A conference version [5] of this paper appeared in the Proceedings of the 23rd International Conference on
Randomization and Computation (RANDOM 2019).
  ∗ Supported by CCF-1563838 and CCF-1617306.
  † Supported by CCF-1563838, CCF-1717349 and DMS-1839323.



ACM Classification: Theory of computation - Random walks and Markov chains, Theory of
computation - Design and analysis of algorithms
AMS Classification: 60J05, 60J25, 68W20
Key words and phrases: logconcave distribution, sampling, Hamiltonian Monte Carlo, spectral
gap, strong convexity


© 2022 Zongchen Chen and Santosh S. Vempala
c b Licensed under a Creative Commons Attribution License (CC-BY)                   DOI: 10.4086/toc.2022.v018a009
                            Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

1   Introduction
Sampling logconcave densities is a basic problem that arises in machine learning, statistics,
optimization, computer science and other areas. The problem is described as follows. Let
 𝑓 : ℝ 𝑑 → ℝ be a convex function. Our goal is to sample from the density proportional to e− 𝑓 (𝑥) .
We study Hamiltonian Monte Carlo (HMC), one of the most widely used Markov chain Monte Carlo
(MCMC) algorithms for sampling from a probability distribution. In many settings, HMC is
believed to outperform other MCMC algorithms such as the Metropolis–Hastings algorithm or
Langevin dynamics. In terms of theory, rapid mixing has been established for HMC in recent
papers [14, 15, 18, 19, 20] in various settings. However, in spite of much progress, a gap remains
between the known upper and lower bounds even in the basic setting when 𝑓 is strongly convex
(e− 𝑓 is strongly logconcave) and has a Lipschitz gradient.
     Many sampling algorithms such as the Metropolis–Hastings algorithm or Langevin dynamics
maintain a position 𝑥 = 𝑥(𝑡) that changes with time, so that the distribution of 𝑥 will eventually
converge to the desired distribution, i. e., proportional to e− 𝑓 (𝑥) . In HMC, besides the position
𝑥 = 𝑥(𝑡), we also maintain a velocity 𝑣 = 𝑣(𝑡). In the simplest Euclidean setting, the Hamiltonian
𝐻(𝑥, 𝑣) is defined as
                                                        1
                                       𝐻(𝑥, 𝑣) = 𝑓 (𝑥) + k𝑣k 2 .                               (1.1)
                                                        2
Then in every step the pair (𝑥, 𝑣) is updated using the following system of differential equations
for a fixed time interval 𝑇:
                                  d𝑥(𝑡)
                                 d𝑡 = ∇𝑣 𝐻(𝑥, 𝑣) = 𝑣(𝑡),
                                
                                
                                
                                
                                                                                                 (1.2)
                                  d𝑣(𝑡)
                                        = −∇𝑥 𝐻(𝑥, 𝑣) = −∇ 𝑓 (𝑥(𝑡)).
                                
                                
                                
                                 d𝑡
The initial position 𝑥(0) = 𝑥0 is the position from the last step, and the initial velocity 𝑣(0) = 𝑣 0
is chosen randomly from the standard Gaussian distribution 𝑁(0, 𝐼). The updated position is
𝑥(𝑇) where 𝑇 can be thought of as the step-size. It is well known that the stationary distribution
of HMC is the density proportional to e− 𝑓 . Observe that

                       d𝐻(𝑥, 𝑣)
                                = ∇𝑥 𝐻(𝑥, 𝑣) · 𝑥 0(𝑡) + ∇𝑣 𝐻(𝑥, 𝑣) · 𝑣 0(𝑡) = 0,
                         d𝑡
so the Hamiltonian 𝐻(𝑥, 𝑣) does not change with 𝑡. We can also write (1.2) as the following
ordinary differential equation (ODE):

                           𝑥 00(𝑡) = −∇ 𝑓 (𝑥(𝑡)),   𝑥(0) = 𝑥 0 ,   𝑥 0(0) = 𝑣 0 .                (1.3)

We state HMC explicitly as the following algorithm; also see Figure 1 for an illustration.

Hamiltonian Monte Carlo algorithm
    Input: 𝑓 : ℝ 𝑑 → ℝ that is 𝜇-strongly convex and 𝐿-smooth, 𝜀 the error parameter.

                      T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                          2
         O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS




                                           x (0)




                                                                     x (3)
                                  x (2)

                                                           x (1)




                          Figure 1: A trajectory of Hamiltonian Monte Carlo


   1. Set starting point 𝑥 (0) , step-size 𝑇, number of steps 𝑁, and ODE error tolerance 𝛿.

   2. For 𝑘 = 1, . . . , 𝑁:

       (a) Let 𝑣 ∼ 𝑁(0, 𝐼);
       (b) Denote by 𝑥(𝑡) the solution to (1.2) with initial position 𝑥(0) = 𝑥 (𝑘−1) and initial
           velocity 𝑣(0) = 𝑣. Use the ODE solver to find a point 𝑥 (𝑘) such that

                                                   𝑥 (𝑘) − 𝑥(𝑇) ≤ 𝛿.

   3. Output 𝑥 (𝑁) .

    In our analysis, we first consider ideal HMC where in every step we have the exact solution
to the ODE (1.2) and neglect the numerical error from solving the ODEs or integration (𝛿 = 0).

1.1   Preliminaries
We recall some standard definitions here. Let 𝑓 : ℝ 𝑑 → ℝ be a continuously differentiable
function. For a positive real number 𝜇 we say that 𝑓 is 𝜇-strongly convex if for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,
                                                                   𝜇
                              𝑓 (𝑦) ≥ 𝑓 (𝑥) + h∇ 𝑓 (𝑥), 𝑦 − 𝑥i +     k𝑦 − 𝑥 k 2 .
                                                                   2

                         T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                  3
                                Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

We say that 𝑓 is 𝐿-smooth if ∇ 𝑓 is 𝐿-Lipschitz, i. e., for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,
                                       k∇ 𝑓 (𝑥) − ∇ 𝑓 (𝑦)k ≤ 𝐿 k𝑥 − 𝑦k .
If 𝑓 is 𝜇-strongly convex and 𝐿-smooth, then the condition number of 𝑓 is 𝜅 = 𝐿/𝜇. When 𝑓 is a
twice differentiable function, 𝑓 is 𝜇-strongly convex if and only if ∇2 𝑓 (𝑥)  𝜇𝐼 for all 𝑥 ∈ ℝ 𝑑 ,
where ∇2 𝑓 (𝑥) denotes the Hessian matrix of 𝑓 at 𝑥 and 𝐼 is the identity matrix; similarly, 𝑓 is
𝐿-smooth if and only if ∇2 𝑓 (𝑥)  𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 .
    Consider a discrete-time reversible Markov chain ℳ on ℝ 𝑑 with stationary distribution 𝜋. Let
                                                                  ∫                            
                                                      𝑑
                               𝐿2 (𝜋) = 𝑓 : ℝ → ℝ                         𝑓 (𝑥) 𝜋(d𝑥) < ∞
                                                                                2
                                                                    ℝ𝑑

be the Hilbert space with inner product
                                                          ∫
                                         h 𝑓 , 𝑔i =              𝑓 (𝑥)𝑔(𝑥)𝜋(d𝑥)
                                                           ℝ𝑑

for 𝑓 , 𝑔 ∈ 𝐿2 (𝜋). Denote by 𝑃 the transition kernel of ℳ. We can view 𝑃 as a self-adjoint operator
from 𝐿2 (𝜋) to itself: for 𝑓 ∈ 𝐿2 (𝜋),
                                                           ∫
                                         (𝑃 𝑓 )(𝑥) =               𝑓 (𝑦)𝑃(𝑥, d𝑦).
                                                              ℝ𝑑
                           ∫
Let 𝐿02 (𝜋) = { 𝑓 ∈ 𝐿2 (𝜋) : ℝ 𝑑 𝑓 (𝑥)𝜋(d𝑥) = 0} be a closed subspace of 𝐿2 (𝜋). The (absolute) spectral
gap of 𝑃 is defined to be
                                                       k𝑃 𝑓 k
                           𝛾(𝑃) = 1 − sup                     = 1 − sup | h𝑃 𝑓 , 𝑓 i |.
                                             𝑓 ∈𝐿0 (𝜋)
                                                        k𝑓k        𝑓 ∈𝐿0 (𝜋)
                                                 2                               2
                                                                              k 𝑓 k=1

The relaxation time of 𝑃 is
                                                                     1
                                                      𝜏rel (𝑃) =         .
                                                                    𝛾(𝑃)
    Let 𝜈1 , 𝜈2 be two distributions on ℝ 𝑑 with finite 𝑝’th moments where 𝑝 ≥ 1. The 𝑝-Wasserstein
distance between 𝜈1 and 𝜈2 is defined as
                                                                                             1/𝑝
                                                                                            
                                                                                        𝑝
                               𝑊𝑝 (𝜈1 , 𝜈2 ) =                           𝔼 k𝑋 − 𝑌k                  ,
                                                                          
                                                           inf
                                                     (𝑋 ,𝑌)∈𝒞(𝜈1 ,𝜈2 )

where 𝒞(𝜈1 , 𝜈2 ) is the set of all couplings of 𝜈1 and 𝜈2 . The Ricci curvature of 𝑃 is defined as
                                                                  𝑊1 (𝑃(𝑥, ·), 𝑃(𝑦, ·))
                                Ric(𝑃) = 1 −              sup                           .
                                                     𝑥,𝑦∈ℝ 𝑑 ,𝑥≠𝑦      k𝑥 − 𝑦k

It is known that 𝛾(𝑃) ≥ Ric(𝑃) (see [22, Proposition 29]), which provides an effective way to
bound the spectral gap through coupling arguments. We refer to [22] for more backgrounds on
the Ricci curvature.

                       T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                             4
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

1.2   Related work
Various versions of Langevin dynamics have been studied in many recent papers, see [8, 9, 29,
23, 10, 7, 6, 3, 12, 27, 26, 17]. The convergence rate of HMC has also been studied recently in
[14, 15, 18, 19, 20, 24]. The first bound in our setting was obtained by Mangoubi and Smith [18],
who gave an 𝑂(𝜅2 ) bound on the convergence rate of ideal HMC.
Theorem 1.1 ([18, Theorem 1]). Let 𝑓 : ℝ 𝑑 → ℝ be a twice differentiable function such that
                                𝑑 . Then the relaxation time of ideal HMC for sampling from the density
𝜇𝐼  ∇2 𝑓 (𝑥)  𝐿𝐼 for all 𝑥 ∈ ℝ√
                          √
∝ e− 𝑓 with step-size 𝑇 = 𝜇/(2 2𝐿) is 𝑂(𝜅2 ).
   This was improved by [14], which showed a bound of 𝑂(𝜅 1.5 ). They also gave an algorithm
with nearly optimal running time for solving the ODE that arises in the implementation of
HMC.
Theorem 1.2 ([14, Lemma 1.8]). Let 𝑓 : ℝ 𝑑 → ℝ be a twice differentiable function such that
𝜇𝐼  ∇2 𝑓 (𝑥)  𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 . Then the relaxation time of ideal HMC for sampling from the density
∝ e− 𝑓 with step-size 𝑇 = 𝜇1/4 /(2𝐿3/4 ) is 𝑂(𝜅 1.5 ).
    Both papers suggest that the correct bound is linear in 𝜅. It is mentioned in [18] that linear
dependency is the best one can expect. Meanwhile, [14] shows that there exists a choice of
step-sizes (time for running the ODE), which depends on 𝑓 , the current position, and the random
initial velocity, that achieves a linear convergence rate (Lemma 1.8, second part); however it was
far from clear how to determine these step-sizes algorithmically.
    Other papers focus on various aspects and use stronger assumptions (e. g., bounds on
higher-order gradients) to get better bounds on the overall convergence time or the number
of gradient evaluations in some ranges of parameters. For example, [20] shows that the
dependence on dimension for the number of gradient evaluations can be as low as 𝑑1/4 with
suitable regularity assumptions (and higher dependence on the condition number). We note
also that sampling logconcave functions is a polynomial-time solvable problem, without the
assumptions of strong convexity or gradient Lipschitzness, and even when the function e− 𝑓 is
given only by an oracle with no access to gradients [1, 16]. The Riemannian version of HMC
provides a faster polynomial-time algorithm for uniformly sampling polytopes [15]. However,
the dependence on the dimension is significantly higher for these algorithms, both for the
contraction rate and the time per step.
    After the conference version [5] of our paper, several articles appeared on sampling from
logconcave densities. For example, [2, 11, 21, 25, 28] study the Langevin dynamics and its
variants, and [4, 13] study Metropolized HMC. Shen and Lee [25] propose the Randomized
Midpoint Method to discretize the underdamped Langevin diffusion; their algorithm achieves
smaller number of total gradient calls 𝑁 = 𝑂(𝜅  ˜ 7/6 𝑑 1/6 𝜀−1/3 + 𝜅𝑑 1/3 𝜀−2/3 ) and less running time
compared to our HMC method.

1.3   Results
In this paper, we show that the relaxation time of ideal HMC is Θ(𝜅) for strongly logconcave
functions with Lipschitz gradient.

                       T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                           5
                                Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

Theorem 1.3. Suppose that 𝑓 is 𝜇-strongly convex and 𝐿-smooth. Then the relaxation time√ (inverse of
spectral gap) of ideal HMC for sampling from the density ∝ e− 𝑓 with step-size 𝑇 = 1/(2 𝐿) is 𝑂(𝜅),
where 𝜅 = 𝐿/𝜇 is the condition number.

    We remark that the only assumption we made about 𝑓 is strong convexity and smoothness.
(In particular, we do not require that 𝑓 is twice differentiable, which is assumed in both [14] and
[18].)
    We also establish a matching lower bound on the relaxation time of ideal HMC, implying
the tightness of Theorem 1.3.

Theorem 1.4. For any 𝐿 ≥ 𝜇 > 0 and 𝑇 > 0, there exists a 𝜇-strongly convex and 𝐿-smooth function 𝑓 ,
such that the relaxation time of ideal HMC for sampling from the density ∝ e− 𝑓 with step-size 𝑇 is Ω(𝜅),
where 𝜅 = 𝐿/𝜇 is the condition number.

   Using the ODE solver from [14] with nearly optimal running time, we obtain the following
convergence rate in 2-Wasserstein distance for the HMC algorithm.

Theorem 1.5. Let 𝜋 ∝ e− 𝑓 be the target distribution, and let 𝜋hmc be the
                                                                        √ distribution of the output of
HMC with starting point 𝑥 (0) = arg
                                  √ min 𝑥 𝑓 (𝑥), step-size 𝑇 = 1/(16000   𝐿), and ODE error tolerance
    √ 2
𝛿 = 𝜇𝑇 𝜀/16. For any 0 < 𝜀 < 𝑑, if we run HMC for 𝑁 = 𝑂 𝜅 log(𝑑/𝜀) steps where 𝜅 = 𝐿/𝜇,
                                                                                
then we have
                                                           𝜀
                                       𝑊2 (𝜋hmc , 𝜋) ≤ √ .                                        (1.4)
                                                            𝜇
                 √                                  √
Each step takes 𝑂 𝜅𝑑 3/2 𝜀−1 log(𝜅𝑑/𝜀) time and 𝑂 𝜅𝑑𝜀−1 log(𝜅𝑑/𝜀) evaluations of ∇ 𝑓 , amortized
                                                                         

over all steps.

   We note that since our new convergence rate allows larger steps, the ODE solver is run for a
longer time step. One interesting open problem is to find a matching lower bound for HMC
with respect to the 2-Wasserstein distance.
   The comparison of convergence rates, running times and numbers of gradient evaluations is
summarized in the following table with polylog factors omitted.

                      reference      convergence rate    # gradients    total time
                         [18]               𝜅2             𝜅6.5 𝑑 0.5    𝜅 6.5 𝑑1.5
                         [14]              𝜅1.5           𝜅1.75 𝑑 0.5    𝜅1.75 𝑑 1.5
                     this paper             𝜅              𝜅1.5 𝑑 0.5    𝜅 1.5 𝑑1.5

   Finally, we remark that Theorem 1.4 also implies an Ω(𝜅) computational lower bound for
                                                                                       ˜ 1/6 ).
HMC in terms of the number of gradient queries. This differs from [25] by a factor of 𝑂(𝜅
As pointed out by one of the referees, a comparison with optimization literature seems to
suggest that the optimal number of √
                                   gradient queries that can be achieved by any gradient-based
MCMC algorithm should scale as 𝜅. This corroborates with the fine analysis in [2] that the

                       T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                            6
         O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

                                                                         √
convergence rate of continuous underdamped Langevin dynamics is 𝑂( 𝑚) where 𝑚 is the
Poincaré constant of the measure, suggesting  that an ideal discretization of underdamped
                                    p                                   √
Langevin dynamics with step-size 𝑂( 1/𝐿) might be able to achieve 𝑂( 𝜅) convergence. In
this sense, HMC probably cannot achieve the optimal convergence.


2    Convergence of ideal HMC
In this section we show that the spectral gap of ideal HMC is Ω(1/𝜅), and thus prove Theorem 1.3.
We first show a contraction bound for ideal HMC, which roughly says that the distance of two
points is shrinking after one step of ideal HMC.
Lemma 2.1 (Contraction bound). Suppose that 𝑓 is 𝜇-strongly convex and 𝐿-smooth. Let 𝑥(𝑡) and
𝑦(𝑡) be the solution
              √      to (1.2) with initial positions 𝑥(0), 𝑦(0) and initial velocities 𝑥 0(0) = 𝑦 0(0). Then for
0 ≤ 𝑡 ≤ 1/(2 𝐿) we have
                                                       𝜇 
                               k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 1 − 𝑡 2 k𝑥(0) − 𝑦(0)k 2 .                                 (2.1)
                                                        4
                                         √
In particular, by setting 𝑡 = 𝑇 = 1/(𝑐 𝐿) for some constant 𝑐 ≥ 2 we get
                                                                
                                                     1
                                k𝑥(𝑇) − 𝑦(𝑇)k ≤ 1 − 2 k𝑥(0) − 𝑦(0)k 2
                                                 2
                                                                                                          (2.2)
                                                   4𝑐 𝜅
where 𝜅 = 𝐿/𝜇.
Proof. Consider the two ODEs for HMC:
                            𝑥 0(𝑡) = 𝑢(𝑡);                           𝑦 0(𝑡) = 𝑣(𝑡);
                                                                
                                                         and
                            𝑢 0(𝑡) = −∇ 𝑓 (𝑥(𝑡)).                    𝑣 0(𝑡) = −∇ 𝑓 (𝑦(𝑡)).
with initial points 𝑥(0), 𝑦(0) and initial velocities 𝑢(0) = 𝑣(0). We will show that
                                                          𝜇 2
                                 k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 1 −       𝑡 k𝑥(0) − 𝑦(0)k 2
                                                           4
                    √
for all 0 ≤ 𝑡 ≤ 1/(2 𝐿).
    Consider the derivative of 21 k𝑥(𝑡) − 𝑦(𝑡)k 2 :
                                   
           d 1
                k𝑥(𝑡) − 𝑦(𝑡)k 2 = h𝑥 0(𝑡) − 𝑦 0(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i = h𝑢(𝑡) − 𝑣(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i .           (2.3)
           d𝑡 2
Taking derivative on both sides, we get
                                   
          d2 1
                 k𝑥(𝑡) − 𝑦(𝑡)k 2 = h𝑢 0(𝑡) − 𝑣 0(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i + h𝑢(𝑡) − 𝑣(𝑡), 𝑥 0(𝑡) − 𝑦 0(𝑡)i
          d𝑡 2 2
                                        = − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i + k𝑢(𝑡) − 𝑣(𝑡)k 2
                                        = −𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 + k𝑢(𝑡) − 𝑣(𝑡)k 2 ,                       (2.4)


                         T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                                 7
                             Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

where we define
                                         h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i
                                𝜌(𝑡) =                                                .
                                                        k𝑥(𝑡) − 𝑦(𝑡)k 2
Since 𝑓 is 𝜇-strongly convex and 𝐿-smooth, we have 𝜇 ≤ 𝜌(𝑡) ≤ 𝐿 for all 𝑡 ≥ 0.
    We will give an upper bound on the term −𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 + k𝑢(𝑡) − 𝑣(𝑡)k 2 , while keeping
its dependency on 𝜌(𝑡). To lower bound k𝑥(𝑡) − 𝑦(𝑡)k 2 , we use the following crude bound.
                                             √
Claim 2.2 (Crude bound). For all 0 ≤ 𝑡 ≤ 1/(2 𝐿) we have
                       1
                         k𝑥(0) − 𝑦(0)k 2 ≤ k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 2 k𝑥(0) − 𝑦(0)k 2 .                        (2.5)
                       2
The proof of this claim is postponed to Section 2.1.
   Next we derive an upper bound on k𝑢(𝑡) − 𝑣(𝑡)k 2 . By taking the derivative of 12 k𝑢(𝑡) − 𝑣(𝑡)k 2 ,
we get
                                                                                
                           d                  d 1
             k𝑢(𝑡) − 𝑣(𝑡)k    k𝑢(𝑡) − 𝑣(𝑡)k =      k𝑢(𝑡) − 𝑣(𝑡)k 2
                           d𝑡                 d𝑡 2
                                                         = h𝑢 0(𝑡) − 𝑣 0(𝑡), 𝑢(𝑡) − 𝑣(𝑡)i
                                                         = − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i .

Thus, its absolute value is not greater than

       d                  |− h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i|
          k𝑢(𝑡) − 𝑣(𝑡)k =                                            ≤ k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k .
       d𝑡                               k𝑢(𝑡) − 𝑣(𝑡)k

Since 𝑓 is 𝐿-smooth and convex, it holds that

                  k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k 2 ≤ 𝐿 h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i ;

see, e. g., [30, Lemma 4] for a proof. Hence, we obtain

             k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k 2 ≤ 𝐿𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 2𝐿𝜌(𝑡) k𝑥(0) − 𝑦(0)k 2 ,

where the last inequality follows from the crude bound (2.5). Then, using the fact that 𝑢(0) = 𝑣(0)
and the Cauchy–Schwarz inequality, we can upper bound k𝑢(𝑡) − 𝑣(𝑡)k 2 by
                                                  ∫ 𝑡                           2
                                                          d
                         k𝑢(𝑡) − 𝑣(𝑡)k ≤  2
                                                             k𝑢(𝑠) − 𝑣(𝑠)k d𝑠
                                                    0     d𝑠
                                                  ∫ 𝑡 q                                  2
                                              ≤            2𝐿𝜌(𝑠) k𝑥(0) − 𝑦(0)k d𝑠
                                                    0
                                                        ∫ 𝑡          
                                              ≤ 2𝐿𝑡            𝜌(𝑠) d𝑠 k𝑥(0) − 𝑦(0)k 2 .
                                                          0


                       T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                                8
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

Define the function                                          ∫ 𝑡
                                            𝑃(𝑡) =                   𝜌(𝑠) d𝑠,
                                                                 0
so 𝑃(𝑡) is nonnegative and monotone increasing, with 𝑃(0) = 0. Also we have 𝜇𝑡 ≤ 𝑃(𝑡) ≤ 𝐿𝑡 for
all 𝑡 ≥ 0. Then,
                            k𝑢(𝑡) − 𝑣(𝑡)k 2 ≤ 2𝐿𝑡𝑃(𝑡) k𝑥(0) − 𝑦(0)k 2 .                  (2.6)
   Plugging (2.5) and (2.6) into (2.4), we deduce that
                                                                               
          d2 1                           1
             2
                 k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ −𝜌(𝑡)   k𝑥(0) − 𝑦(0)k 2 + 2𝐿𝑡𝑃(𝑡) k𝑥(0) − 𝑦(0)k 2 .
          d𝑡   2                         2
If we define
                                                     1
                                          𝛼(𝑡) =       k𝑥(𝑡) − 𝑦(𝑡)k 2 ,
                                                     2
then we have
                                    𝛼00(𝑡) ≤ −𝛼(0) 𝜌(𝑡) − 4𝐿𝑡𝑃(𝑡) .
                                                                                       

Integrating both sides and using 𝛼0(0) = 0, we obtain
                                     ∫ 𝑡
                           𝛼0(𝑡) =         𝛼00(𝑠)d𝑠
                                      0
                                             ∫ 𝑡                                ∫ 𝑡             
                                ≤ −𝛼(0)                      𝜌(𝑠)d𝑠 − 4𝐿               𝑠𝑃(𝑠)d𝑠
                                                     0                            0
                                                                           ∫ 𝑡         
                                ≤ −𝛼(0) 𝑃(𝑡) − 4𝐿𝑃(𝑡)                             𝑠d𝑠
                                                                             0
                                                                       
                                = −𝛼(0)𝑃(𝑡) 1 − 2𝐿𝑡 2 ,
                                                                                          √
where the second inequality is due to the monotonicity of 𝑃(𝑠). Since for all 0 ≤ 𝑡 ≤ 1/(2 𝐿) we
have 𝑃(𝑡) ≥ 𝜇𝑡 and 1 − 2𝐿𝑡 2 ≥ 1/2, we deduce that
                                                     𝜇
                                        𝛼0(𝑡) ≤ −𝛼(0) 𝑡.
                                                     2
Finally, one more integration yields
                                             ∫ 𝑡
                                                             0             𝜇 2   
                           𝛼(𝑡) = 𝛼(0) +                 𝛼 (𝑠)d𝑠 ≤ 𝛼(0) 1 − 𝑡 ,
                                                 0                         4
and the theorem follows.                                                                             
Proof of Theorem 1.3. Recall from Section 1.1 that the Ricci curvature of a Markov chain 𝑃 is
defined to be 1 − sup𝑥≠𝑦 𝑊1 (𝑃(𝑥, ·), 𝑃(𝑦, ·))/k𝑥 − 𝑦k where 𝑊1 is the 1-Wasserstein distance.
Then, Lemma 2.1 immediately √   implies that for any constant 𝑐 ≥ 2, the Ricci curvature of ideal
HMC with step-size 𝑇 = 1/(𝑐 𝐿) is at least 1/(8𝑐 2 𝜅). It follows from [22, Proposition 29]
that the spectral gap of ideal HMC is at least 1/(8𝑐 2 𝜅). Hence, the relaxation time is at most
8𝑐 2 𝜅 = 𝑂(𝜅).                                                                                 

                      T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                           9
                              Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

2.1     Proof of Claim 2.2
We present the proof of Claim 2.2 in this section. We remark that a similar crude bound was
established in [14] for general matrix ODEs. Here we prove the crude bound specifically for the
Hamiltonian ODE, but without assuming that 𝑓 is twice differentiable.

Proof of Claim 2.2. We first derive a crude upper bound on k𝑢(𝑡) − 𝑣(𝑡)k. Since 𝑓 is 𝐿-smooth,
we have
                     d                  − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i
                        k𝑢(𝑡) − 𝑣(𝑡)k =
                     d𝑡                              k𝑢(𝑡) − 𝑣(𝑡)k
                                      ≤ k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k ≤ 𝐿 k𝑥(𝑡) − 𝑦(𝑡)k .

Then from 𝑢(0) = 𝑣(0) we get
                                   ∫ 𝑡                                     ∫ 𝑡
                                          d
                 k𝑢(𝑡) − 𝑣(𝑡)k =             k𝑢(𝑠) − 𝑣(𝑠)k d𝑠 ≤ 𝐿                  k𝑥(𝑠) − 𝑦(𝑠)k d𝑠.
                                    0     d𝑠                                  0

      To obtain the upper bound for k𝑥(𝑡) − 𝑦(𝑡)k, we first bound its derivative by
                                                                                         ∫ 𝑡
      d                  |h𝑢(𝑡) − 𝑣(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i|
         k𝑥(𝑡) − 𝑦(𝑡)k =                              ≤ k𝑢(𝑡) − 𝑣(𝑡)k ≤ 𝐿                         k𝑥(𝑠) − 𝑦(𝑠)k d𝑠. (2.7)
      d𝑡                        k𝑥(𝑡) − 𝑦(𝑡)k                                             0

Therefore, we have
                                                         ∫ 𝑡                                 
                                                                  d
                     k𝑥(𝑡) − 𝑦(𝑡)k = k𝑥(0) − 𝑦(0)k +                 k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
                                                          0       d𝑠
                                                           ∫ 𝑡∫ 𝑠
                                   ≤ k𝑥(0) − 𝑦(0)k + 𝐿                      k𝑥(𝑟) − 𝑦(𝑟)k d𝑟d𝑠
                                                              0     0
                                                           ∫ 𝑡
                                   = k𝑥(0) − 𝑦(0)k + 𝐿            (𝑡 − 𝑠) k𝑥(𝑠) − 𝑦(𝑠)k d𝑠.
                                                              0

Then, applying a Grönwall-type inequality from [14, Lemma A.5], we deduce that
                                                   √  √
                 k𝑥(𝑡) − 𝑦(𝑡)k ≤ k𝑥(0) − 𝑦(0)k cosh 𝐿𝑡 ≤ 2 k𝑥(0) − 𝑦(0)k ,                                          (2.8)
                                     √                 √
where we also use the fact that cosh( 𝐿𝑡) ≤ cosh(1/2) ≤ 2.
  Next, we deduce from (2.7) and (2.8) that
                                                 ∫ 𝑡
                        d
                           k𝑥(𝑡) − 𝑦(𝑡)k ≥ −𝐿          k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
                        d𝑡                        0
                                                                    ∫ 𝑡             √        
                                           ≥ −𝐿 k𝑥(0) − 𝑦(0)k                cosh        𝐿𝑠 d𝑠
                                                                        0
                                              √                    √ 
                                           = − 𝐿 k𝑥(0) − 𝑦(0)k sinh 𝐿𝑡 .


                        T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                                          10
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

Thus, we obtain
                                                  ∫ 𝑡                         
                                                         d
              k𝑥(𝑡) − 𝑦(𝑡)k = k𝑥(0) − 𝑦(0)k +               k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
                                                   0     d𝑠
                                                  √                      ∫ 𝑡          √    
                            ≥ k𝑥(0) − 𝑦(0)k − 𝐿 k𝑥(0) − 𝑦(0)k                  sinh        𝐿𝑠 d𝑠
                                                                          0
                                                          √           1
                            = k𝑥(0) − 𝑦(0)k 2 − cosh           𝐿𝑡       ≥ √ k𝑥(0) − 𝑦(0)k ,
                                                                           2
                      √                        √
where we use 2 − cosh( 𝐿𝑡) ≥ 2 − cosh(1/2) ≥ 1/ 2.                                                    


3   Lower bound for ideal HMC
In this section, we show that the relaxation time of ideal HMC can be Θ(𝜅) for some 𝜇-strongly
convex and 𝐿-smooth function for any step-size 𝑇, and thus prove Theorem 1.4.
    Consider the following one-dimensional quadratic function:
                                                        𝑥2
                                            𝑓 (𝑥) =        ,                                       (3.1)
                                                       2𝜎2
        √              √
where 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇. Thus, 𝑓 is 𝜇-strongly convex and 𝐿-smooth. The probability density 𝜈
proportional to e− 𝑓 is the Gaussian distribution: for 𝑥 ∈ ℝ,
                                                      𝑥2
                                                                   
                                              1
                                   𝜈(𝑥) = √     exp − 2 .                                          (3.2)
                                            2𝜋𝜎      2𝜎
The following lemma shows that, for certain choice of 𝜎, ideal HMC for the Gaussian distribution
𝜈 can have relaxation time Ω(𝜅). Theorem 1.4 then follows immediately.
                                            √              √
Lemma 3.1. For any 𝑇 > 0 there exists 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇 such that the relaxation time of ideal HMC
for sampling from 𝜈 with step-size 𝑇 is at least 𝜅/(8𝜋2 ).
Proof. The Hamiltonian curve for 𝑓 is given by the ODE
                                                                𝑥(𝑡)
                                    𝑥 00(𝑡) = − 𝑓 0(𝑥(𝑡)) = −
                                                                 𝜎2
with initial position 𝑥(0) ∈ ℝ and initial velocity 𝑥 0(0) drawn from the standard Gaussian 𝑁(0, 1).
Solving the ODE and plugging in the step-size 𝑡 = 𝑇, we get

                            𝑥(𝑇) = 𝑥(0) cos(𝑇/𝜎) + 𝑣(0)𝜎 sin(𝑇/𝜎).
                                         √
   Consider first the case that 𝑇 > 4𝜋/ 𝐿. We may assume that 𝐿 ≥ 4𝜇 (i. e., 𝜅 ≥ 4), since
otherwise the lemma trivially holds. Pick a positive integer 𝑘 such that
                                        √            √
                                      𝑇 𝜇          𝑇 𝐿
                                            ≤𝑘≤         .
                                       2𝜋           2𝜋

                     T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                            11
                             Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

Note that such 𝑘 exists as           √       √      √
                                    𝑇 𝐿 𝑇 𝜇 𝑇 𝐿
                                         −       ≥      > 1.
                                     2𝜋      2𝜋     4𝜋
                                                √           √
We then choose 𝜎 = 𝑇/(2𝑘𝜋), which satisfies 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇. Observe that for such a choice
HMC becomes degenerate: we have 𝑥(𝑇) = 𝑥(0), meaning that the chain does not move at all.
Thus, in this extremal case the relaxation time becomes infinity and the lemma holds.
                                              √                          √
    Next we consider the case that 𝑇 ≤ 4𝜋/ 𝐿. We shall pick 𝜎 = 1/ 𝜇. Let 𝑃 denote the
transition kernel of ideal HMC. Then for 𝑥, 𝑦 ∈ ℝ we have
                                                                                               !
                                             1                        (𝑦 − 𝑥 cos(𝑇/𝜎))2
                      𝑃(𝑥, 𝑦) = √                           exp −                                  .
                                    2𝜋𝜎 sin(𝑇/𝜎)                       2𝜎2 sin2 (𝑇/𝜎)

Namely, given the current position 𝑥, the next position 𝑦 is from a normal distribution with
mean 𝑥 cos(𝑇/𝜎) and variance 𝜎2 sin2 (𝑇/𝜎). Denote the spectral gap of 𝑃 by 𝛾. Let ℎ(𝑥) = 𝑥 and
note that ℎ ∈ 𝐿02 (𝜈). Using the properties of spectral gaps, we deduce that

                                                    | h𝑃 𝑓 , 𝑓 i |          | h𝑃 ℎ, ℎi |
                             𝛾 = 1 − sup                             ≤ 1−                  .
                                       𝑓 ∈𝐿02 (𝜈)      k𝑓k   2
                                                                               k ℎ k2

Since we have                                    ∫ ∞
                                    kℎk =2
                                                         𝜈(𝑥)ℎ(𝑥)2 d𝑥 = 𝜎2
                                                    −∞

and                              ∫ ∞
                    h𝑃 ℎ, ℎi =         𝜈(𝑥)𝑃(𝑥, 𝑦)ℎ(𝑥)ℎ(𝑦)d𝑥d𝑦 = 𝜎2 cos(𝑇/𝜎),
                                  −∞

we then get
                                                  𝑇2          𝜇
                                 𝛾 ≤ 1 − | cos(𝑇/𝜎)| ≤≤ 8𝜋2 ·
                                                 2𝜎 2         𝐿
                                                           √
where the last inequality is due to our assumption 𝑇 ≤ 4𝜋/ 𝐿. Hence, 𝜏rel = 1/𝛾 ≥ 𝜅/(8𝜋2 ).
This completes the proof of the lemma.                                                 


4     Convergence rate of discretized HMC
In this section, we show how our improved contraction bound (Lemma 2.1) implies that HMC
                                      ˜
returns a good enough sample after 𝑂((𝜅𝑑)    1.5 ) steps. We will use the framework from [14] to

establish Theorem 1.5.
    We first state the ODE solver from [14], which solves an ODE in nearly optimal time when
the solution to the ODE can be approximated by a piece-wise polynomial. We state here only for
the special case of second order ODEs for the Hamiltonian system. We refer to [14] for general
𝑘th order ODEs.

                     T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                              12
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

Theorem 4.1 ([14, Theorem 2.5]). Let 𝑥(𝑡) be the solution to the ODE

                             𝑥 00(𝑡) = −∇ 𝑓 (𝑥(𝑡)),   𝑥(0) = 𝑥 0 ,     𝑥 0(0) = 𝑣 0 .                 (4.1)

where 𝑥 0 , 𝑣0 ∈ ℝ 𝑑 and 0 ≤ 𝑡 ≤ 𝑇. Suppose that the following conditions hold:

   1. There exists a piece-wise polynomial 𝑞(𝑡) such that 𝑞(𝑡) is a polynomial of degree 𝐷 on each interval
      [𝑇𝑗−1 , 𝑇𝑗 ] where 0 = 𝑇0 < 𝑇1 < · · · < 𝑇𝑚 = 𝑇, and for all 0 ≤ 𝑡 ≤ 𝑇 we have

                                                                     𝛿
                                               k𝑞(𝑡) − 𝑥 00(𝑡)k ≤       ;                             (4.2)
                                                                     𝑇2

   2. {𝑇𝑗 } 𝑚
            𝑗=1
                and 𝐷 are given as input to the ODE solver;

   3. The function 𝑓 has a 𝐿-Lipschitz gradient; i. e., for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,

                                         k∇ 𝑓 (𝑥) − ∇ 𝑓 (𝑦)k ≤ 𝐿 k𝑥 − 𝑦k .                            (4.3)

  √
If 𝐿𝑇 ≤ 1/16000, then the ODE solver can find a piece-wise polynomial 𝑥(𝑡)
                                                                      ˜ such that for all 0 ≤ 𝑡 ≤ 𝑇,

                                           k 𝑥(𝑡)
                                             ˜ − 𝑥(𝑡)k ≤ 𝑂(𝛿).                                        (4.4)

The ODE solver uses 𝑂(𝑚(𝐷 + 1) log(𝐶𝑇/𝛿)) evaluations of ∇ 𝑓 and 𝑂(𝑑𝑚(𝐷 + 1)2 log(𝐶𝑇/𝛿)) time
where
                                 𝐶 = 𝑂 (k𝑣 0 k + 𝑇 k∇ 𝑓 (𝑥 0 )k) .                      (4.5)

    The following lemma, which combines Theorem 3.2, Lemma 4.1 and Lemma 4.2 from [14],
establishes the conditions
                       √     of Theorem 4.1 in our setting. We remark that Lemmas 4.1 and 4.2
hold for all 𝑇 ≤ 1/(8 𝐿), and Theorem 3.2, √ though stated only for 𝑇 ≤ 𝑂(𝜇 /𝐿 ) in [14],
                                                                              1/4 3/4

holds in fact for the whole region 𝑇 ≤ 1/(2 𝐿) where the contraction bound (Lemma 2.1) is true.
We omit these proofs here and refer the readers to [14] for more details.

Lemma 4.2. Let 𝑓 be a twice differentiable function such that 𝜇𝐼  √       ∇2 𝑓 (𝑥)  𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 .
Choose the starting point 𝑥 (0) = arg min𝑥 𝑓 (𝑥), step-size 𝑇 = 1/(16000 𝐿), and ODE error tolerance
   √
𝛿 = 𝜇𝑇 2 𝜀/16 in the HMC algorithm. Let {𝑥 (𝑘) } 𝑁   𝑘=1
                                                         be the sequence of points we get from the HMC
algorithm and {𝑣 0 } 𝑁
                  (𝑘)
                     𝑘=1
                         be the sequence of random Gaussian vector we choose in each step. Let 𝜋 ∝ e− 𝑓
         √ distribution and let 𝜋hmc be the distribution of 𝑥 , i. e., the output of HMC. For any
be the target                                                  (𝑁)

0 < 𝜀 < 𝑑, if we run HMC for
                                                     
                                      log(𝑑/𝜀)
                                  𝑁=𝑂          = 𝑂 𝜅 log(𝑑/𝜀)
                                                              
                                                                                                      (4.6)
                                        𝜇𝑇 2


steps where 𝜅 = 𝐿/𝜇, then:

                        T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                            13
                                         Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

  1. ([14, Theorem 3.2]) We have that
                                                                                               𝜀
                                                                              𝑊2 (𝜋hmc , 𝜋) ≤ √ ;                                                      (4.7)
                                                                                                𝜇

  2. ([14, Lemma 4.1]) For each 𝑘, let 𝑥 𝑘 (𝑡) be the solution to the ODE (1.3) in the 𝑘th step of HMC.
     Then there is a piece-wise constant function 𝑞 𝑘 of 𝑚 𝑘 pieces such that 𝑞 𝑘 (𝑡) − 𝑥 00𝑘 (𝑡) ≤ 𝛿/𝑇 2 for
     all 0 ≤ 𝑡 ≤ 𝑇, where
                                       2𝐿𝑇 3  (𝑘−1)                       
                                  𝑚𝑘 =           𝑣0      + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) ) ;                          (4.8)
                                          𝛿
  3. ([14, Lemma 4.2]) We have that
                                                                      "𝑁                                  #
                                                                1   Õ                2
                                                                  𝔼   ∇ 𝑓 (𝑥 (𝑘−1) )   ≤ 𝑂(𝐿𝑑).                                                        (4.9)
                                                                𝑁
                                                                          𝑘=1


Proof of Theorem 1.5. The convergence of HMC is guaranteed √ by part 1 of Lemma 4.2. In the
𝑘th step, the number of evaluations of ∇ 𝑓 is 𝑂(𝑚 𝑘 log(𝐶 𝑘 𝜅/𝜀)) by Theorem 4.1 and part 2 of
Lemma 4.2, where                   √  
                                      𝜅      (𝑘−1)
                                                                      
                            𝑚𝑘 = 𝑂          𝑣0     + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) )
                                     𝜀
and                                                                                                             
                                                                              (𝑘−1)
                                                 𝐶𝑘 = 𝑂                   𝑣0            + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) )           .

Thus, the average number of evaluations of ∇ 𝑓 per step is not greater than
               "𝑁                                                 #                    "𝑁                        #
         1   Õ              √        1   Õ                   1
               𝑂(𝑚 𝑘 log(𝐶 𝑘 𝜅/𝜀)) ≤       𝑂(𝑚 𝑘 log 𝑚 𝑘 ) ≤   𝑂 𝔼 𝑀 log 𝑀 ,
                                                                         
           𝔼                           𝔼
         𝑁                           𝑁                       𝑁
                𝑘=1                                                                    𝑘=1

               Í𝑁                        (𝑘−1)
where 𝑀 =         𝑘=1 𝑚 𝑘 . Since each 𝑣 0     is sampled from the standard Gaussian distribution, we
         h          2i
              (𝑘−1)
have 𝔼       𝑣0        = 𝑑. Thus, by the Cauchy–Schwarz inequality and part 3 of Lemma 4.2, we
get
                                   𝑁                                   𝑁 
                                                                   𝑁𝜅 Õ
                                                                                                          
                                   Õ                                              2                        2
                                                                            (𝑘−1)                  (𝑘−1)
              𝔼 𝑀             ≤𝑁                     𝑚 𝑘2       ≤𝑂       𝔼 𝑣0         + 𝑇 𝔼 ∇ 𝑓 (𝑥
                                                         
                      2
                                         𝔼                                               2
                                                                                                         )
                                                                   𝜀2
                                   𝑘=1                                            𝑘=1
                                                                              𝑁 𝜅𝑑
                                                                              2
                                                                                   
                                                                ≤𝑂                         .
                                                                                  𝜀2

We then deduce again from the Cauchy–Schwarz inequality that
                                                                                                                                   p              
      𝔼[𝑀 log 𝑀]              ≤𝔼 𝑀               · 𝔼 log 𝑀 ≤ 𝔼 𝑀                                       · log (𝔼 𝑀) ≤ 𝔼[𝑀 ] · log                       ,
                      2                2
                                                                2
                                                                                             2
                                                                                                          2             2     2
                                                                                                                                        𝔼 [𝑀 2 ]

                              T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                                                                         14
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

where the second inequality is due to Jensen’s inequality and the fact that log2 (𝑥) is concave
when 𝑥 ≥ 3 (note that we can assume 𝑀 ≥ 3 by making 𝑁 ≥ 3). Therefore, the number of
evaluations of ∇ 𝑓 per step, amortized over all steps, is
                                                             √             !
                                                               𝜅𝑑     𝜅𝑑
                         p          p                              
                      1
                        𝑂 𝔼[𝑀 2 ] log 𝔼 [𝑀 2 ] ≤ 𝑂                log           .
                      𝑁                                        𝜀       𝜀

Using a similar argument we have the bound for the expected running time per step. This
completes the proof.                                                                 

Acknowledgments. We are grateful to Yin Tat Lee and Andre Wibisono for helpful discussions.
We also thank the anonymous referees for their valuable comments and suggestions.


References
 [1] David Applegate and Ravi Kannan: Sampling and integration of near log-concave functions.
     In Proc. 23rd STOC, pp. 156–163. ACM Press, 1991. [doi:10.1145/103418.103439] 5

 [2] Yu Cao, Jianfeng Lu, and Lihan Wang: On explicit 𝐿2 -convergence rate estimate for
     underdamped Langevin dynamics, 2019. [arXiv:1908.04746] 5, 6

 [3] Niladri S. Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L. Bartlett, and Michael I.
     Jordan: On the theory of variance reduction for stochastic gradient Monte Carlo. In Proc.
     35th Internat. Conf. Artificial Intelligence and Statistics (ICML’18), pp. 80:764–773. MLR Press,
     2018. PMLR. [arXiv:1802.05431] 5

 [4] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu: Fast mixing of Metropolized
     Hamiltonian Monte Carlo: Benefits of multi-step gradients. JMLR, 21(92):1–72, 2019. JMLR.
     [arXiv:1905.12247] 5

 [5] Zongchen Chen and Santosh S. Vempala: Optimal convergence rate of Hamiltonian Monte
     Carlo for strongly logconcave distributions. In Proc. 23rd Internat. Conf. on Randomization
     and Computation (RANDOM’19), pp. 64:1–12. Schloss Dagstuhl–Leibniz-Zentrum fuer
     Informatik, 2019. [doi:10.4230/LIPIcs.APPROX-RANDOM.2019.64, arXiv:1905.02313] 1, 5

 [6] Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I.
     Jordan: Sharp convergence rates for Langevin dynamics in the nonconvex setting, 2018.
     [arXiv:1805.01648] 5

 [7] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan: Underdamped
     Langevin MCMC: A non-asymptotic analysis. In Proc. 31st Ann. Conf. on Learning Theory
     (COLT’18), pp. 75:1–24. MLR Press, 2018. PMLR. [arXiv:1707.03663] 5

                     T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                          15
                            Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

 [8] Arnak S. Dalalyan: Theoretical guarantees for approximate sampling from smooth and log-
     concave densities. J. Royal Statistical Soc. Ser. B, 79(3):651–676, 2017. [doi:10.1111/rssb.12183,
     arXiv:1412.7392] 5

 [9] Arnak S. Dalalyan and Avetik Karagulyan: User-friendly guarantees for the Langevin
     Monte Carlo with inaccurate gradient. Stoch. Proc. Appl., 129(12):5278–5311, 2019.
     [doi:10.1016/j.spa.2019.02.016, arXiv:1710.00095] 5

[10] Arnak S. Dalalyan and Lionel Riou-Durand: On sampling from a log-concave density
     using kinetic Langevin diffusions. Bernoulli, 26(3):1956–1988, 2020. [doi:10.3150/19-BEJ1178,
     arXiv:1807.09382] 5

[11] Arnak S. Dalalyan, Lionel Riou-Durand, and Avetik Karagulyan: Bounding the
     error of discretized Langevin algorithms for non-strongly log-concave targets, 2019.
     [arXiv:1906.08530] 5

[12] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu: Log-concave sampling:
     Metropolis-Hastings algorithms are fast! JMLR, 20(183):1–42, 2019. JMLR, Preliminary
     version in COLT’18. [arXiv:1801.02309] 5

[13] Yin Tat Lee, Ruoqi Shen, and Kevin Tian: Logsmooth gradient concentration and tighter
     runtimes for Metropolized Hamiltonian Monte Carlo. In Proc. 33rd Ann. Conf. on Learning
     Theory (COLT’20), pp. 125:2565–2597. MLR Press, 2020. PMLR. [arXiv:2002.04121] 5

[14] Yin Tat Lee, Zhao Song, and Santosh S. Vempala: Algorithmic theory of ODEs and
     sampling from well-conditioned logconcave densities, 2018. [arXiv:1812.06243] 2, 5, 6, 10,
     12, 13, 14

[15] Yin Tat Lee and Santosh S. Vempala: Convergence rate of Riemannian Hamiltonian Monte
     Carlo and faster polytope volume computation. In Proc. 50th STOC, pp. 1115–1121. ACM
     Press, 2018. [doi:10.1145/3188745.3188774, arXiv:1710.06261] 2, 5

[16] László Lovász and Santosh S. Vempala: Fast algorithms for logconcave functions: Sam-
     pling, rounding, integration and optimization. In Proc. 47th FOCS, pp. 57–68. IEEE Comp.
     Soc., 2006. [doi:10.1109/FOCS.2006.28] 5

[17] Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and
     Michael I. Jordan: Is there an analog of Nesterov acceleration for MCMC?, 2019.
     [arXiv:1902.00996] 5

[18] Oren Mangoubi and Aaron Smith: Mixing of Hamiltonian Monte Carlo on strongly
     log-concave distributions 1: Continuous dynamics, 2017. [arXiv:1708.07114] 2, 5, 6

[19] Oren Mangoubi and Aaron Smith: Mixing of Hamiltonian Monte Carlo on strongly
     log-concave distributions 2: Numerical integrators. In Proc. 22nd Internat. Conf. Artificial
     Intelligence and Statistics (AISTATS’19), pp. 89:586–595. MLR Press, 2019. PMLR. 2, 5

                      T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                          16
        O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS

[20] Oren Mangoubi and Nisheeth Vishnoi: Dimensionally tight bounds for second-order
     Hamiltonian Monte Carlo. In Proc. 31st Adv. Neural Info. Proc. Sys. (NeurIPS’18), pp.
     6027–6037. Curran Assoc., 2018. NeurIPS. 2, 5

[21] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan:
     High-order Langevin diffusion yields an accelerated MCMC algorithm. JMLR, 22(42):1–41,
     2021. JMLR. [arXiv:1908.10859] 5

[22] Yann Ollivier: Ricci curvature of Markov chains on metric spaces. J. Functional Anal.,
     256(3):810–864, 2009. [doi:10.1016/j.jfa.2008.11.001, arXiv:math/0701886] 4, 9

[23] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky: Non-convex learning via
     stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Proc. 30th Ann. Conf.
     on Learning Theory (COLT’17), pp. 65:1–30. MLR Press, 2017. PMLR. [arXiv:1702.03849] 5

[24] Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes: Positive curvature and
     Hamiltonian Monte Carlo. In Proc. 27th Adv. Neural Info. Proc. Sys. (NIPS’14), pp. 586–594.
     Curran Assoc., 2014. NIPS. 5

[25] Ruoqi Shen and Yin Tat Lee: The Randomized Midpoint Method for log-concave sampling.
     In Proc. 32nd Adv. Neural Info. Proc. Sys. (NeurIPS’19), pp. 2100–2111. Curran Assoc., 2019.
     NeurIPS. [arXiv:1909.05503] 5, 6

[26] Santosh S. Vempala and Andre Wibisono: Rapid convergence of the unadjusted Langevin
     algorithm: Isoperimetry suffices. In Proc. 32nd Adv. Neural Info. Proc. Sys. (NeurIPS’19), pp.
     8094–8106. Curran Assoc., 2019. NeurIPS. [arXiv:1903.08568] 5

[27] Andre Wibisono: Sampling as optimization in the space of measures: The Langevin
     dynamics as a composite optimization problem. In Proc. 31st Ann. Conf. on Learning Theory
     (COLT’18), pp. 75:2093–3027. MLR Press, 2018. PMLR. [arXiv:1802.08089] 5

[28] Andre Wibisono: Proximal Langevin algorithm: Rapid convergence under isoperimetry,
     2019. [arXiv:1911.01469] 5

[29] Yuchen Zhang, Percy S. Liang, and Moses Charikar: A hitting time analysis of stochastic
     gradient Langevin dynamics. In Proc. 30th Ann. Conf. on Learning Theory (COLT’17), pp.
     65:1980–2022. MLR Press, 2017. PMLR. [arXiv:1702.05575] 5

[30] Xingyu Zhou: On the Fenchel Duality between strong convexity and Lipschitz Continuous
     Gradient, 2018. SLIDES. [arXiv:1803.06573] 8




                     T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                       17
                        Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA

AUTHORS

    Zongchen Chen
    Instructor
    Department of Mathematics
    Massachusetts Institute of Technology
    Cambridge, MA 02139, USA
    zongchen mit edu
    https://sites.google.com/view/zongchenchen


    Santosh S. Vempala
    Professor
    College of Computing
    Georgia Institute of Technology
    Atlanta, GA 30332, USA
    vempala gatech edu
    https://www.cc.gatech.edu/~vempala/


ABOUT THE AUTHORS

    Zongchen Chen is an instructor (postdoc) in Mathematics at the Massachusetts
      Institute of Technology. He obtained his Ph. D. in Algorithms, Combinatorics
      and Optimization at the Georgia Institute of Technology in 2021, advised by
      Eric Vigoda. Before that, he received his B. S. in Mathematics and Applied
      Mathematics from the Zhiyuan College at the Shanghai Jiao Tong University in
      2016. He has broad interests in randomized algorithms, discrete probability, and
      combinatorics. Currently, his research focuses on Markov chain Monte Carlo
      (MCMC) methods for sampling from Gibbs distributions, and machine learning
      problems related to undirected graphical models.


    Santosh Vempala is the Frederick Storey professor in the College of Computing
       and former director of the Algorithms and Randomness Center at Georgia Tech.
       His research interests, ironically, are in algorithms, randomness, and geometry.
       He graduated from CMU in 1997 following the advice of Avrim Blum and
       was at MIT till 2006 except for a year as a Miller fellow at UC Berkeley. He
       gets unreasonably excited when a phenomenon that appears complex from one
       perspective turns out to be simple from another. At Georgia Tech, he started the
       Computing-for-Good (C4G) program.




                  T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18                    18