Authors Zongchen Chen, Santosh S. Vempala,
License CC-BY-3.0
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18
www.theoryofcomputing.org
S PECIAL ISSUE : APPROX-RANDOM 2019
Optimal Convergence Rate of
Hamiltonian Monte Carlo for Strongly
Logconcave Distributions
Zongchen Chen∗ Santosh S. Vempala†
Received February 29, 2020; Revised September 23, 2020; Published April 30, 2022
Abstract. We study the Hamiltonian Monte Carlo (HMC) algorithm for sampling from
a strongly logconcave density proportional to e− 𝑓 where 𝑓 : ℝ 𝑑 → ℝ is 𝜇-strongly
convex and 𝐿-smooth (the condition number is 𝜅 = 𝐿/𝜇). We show that the relaxation
time (inverse of the spectral gap) of ideal HMC is 𝑂(𝜅), improving on the previous
best bound of 𝑂(𝜅1.5 ) (Lee et al., 2018); we complement this with an example where
the relaxation time is Ω(𝜅), for any step-size. When implemented with an ODE solver,
˜
HMC returns an 𝜀-approximate point in 2-Wasserstein distance using 𝑂((𝜅𝑑) 0.5 𝜀−1 )
˜
gradient evaluations per step and 𝑂((𝜅𝑑) 1.5 𝜀 ) total time.
−1
A conference version [5] of this paper appeared in the Proceedings of the 23rd International Conference on
Randomization and Computation (RANDOM 2019).
∗ Supported by CCF-1563838 and CCF-1617306.
† Supported by CCF-1563838, CCF-1717349 and DMS-1839323.
ACM Classification: Theory of computation - Random walks and Markov chains, Theory of
computation - Design and analysis of algorithms
AMS Classification: 60J05, 60J25, 68W20
Key words and phrases: logconcave distribution, sampling, Hamiltonian Monte Carlo, spectral
gap, strong convexity
© 2022 Zongchen Chen and Santosh S. Vempala
c b Licensed under a Creative Commons Attribution License (CC-BY) DOI: 10.4086/toc.2022.v018a009
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
1 Introduction
Sampling logconcave densities is a basic problem that arises in machine learning, statistics,
optimization, computer science and other areas. The problem is described as follows. Let
𝑓 : ℝ 𝑑 → ℝ be a convex function. Our goal is to sample from the density proportional to e− 𝑓 (𝑥) .
We study Hamiltonian Monte Carlo (HMC), one of the most widely used Markov chain Monte Carlo
(MCMC) algorithms for sampling from a probability distribution. In many settings, HMC is
believed to outperform other MCMC algorithms such as the Metropolis–Hastings algorithm or
Langevin dynamics. In terms of theory, rapid mixing has been established for HMC in recent
papers [14, 15, 18, 19, 20] in various settings. However, in spite of much progress, a gap remains
between the known upper and lower bounds even in the basic setting when 𝑓 is strongly convex
(e− 𝑓 is strongly logconcave) and has a Lipschitz gradient.
Many sampling algorithms such as the Metropolis–Hastings algorithm or Langevin dynamics
maintain a position 𝑥 = 𝑥(𝑡) that changes with time, so that the distribution of 𝑥 will eventually
converge to the desired distribution, i. e., proportional to e− 𝑓 (𝑥) . In HMC, besides the position
𝑥 = 𝑥(𝑡), we also maintain a velocity 𝑣 = 𝑣(𝑡). In the simplest Euclidean setting, the Hamiltonian
𝐻(𝑥, 𝑣) is defined as
1
𝐻(𝑥, 𝑣) = 𝑓 (𝑥) + k𝑣k 2 . (1.1)
2
Then in every step the pair (𝑥, 𝑣) is updated using the following system of differential equations
for a fixed time interval 𝑇:
d𝑥(𝑡)
d𝑡 = ∇𝑣 𝐻(𝑥, 𝑣) = 𝑣(𝑡),
(1.2)
d𝑣(𝑡)
= −∇𝑥 𝐻(𝑥, 𝑣) = −∇ 𝑓 (𝑥(𝑡)).
d𝑡
The initial position 𝑥(0) = 𝑥0 is the position from the last step, and the initial velocity 𝑣(0) = 𝑣 0
is chosen randomly from the standard Gaussian distribution 𝑁(0, 𝐼). The updated position is
𝑥(𝑇) where 𝑇 can be thought of as the step-size. It is well known that the stationary distribution
of HMC is the density proportional to e− 𝑓 . Observe that
d𝐻(𝑥, 𝑣)
= ∇𝑥 𝐻(𝑥, 𝑣) · 𝑥 0(𝑡) + ∇𝑣 𝐻(𝑥, 𝑣) · 𝑣 0(𝑡) = 0,
d𝑡
so the Hamiltonian 𝐻(𝑥, 𝑣) does not change with 𝑡. We can also write (1.2) as the following
ordinary differential equation (ODE):
𝑥 00(𝑡) = −∇ 𝑓 (𝑥(𝑡)), 𝑥(0) = 𝑥 0 , 𝑥 0(0) = 𝑣 0 . (1.3)
We state HMC explicitly as the following algorithm; also see Figure 1 for an illustration.
Hamiltonian Monte Carlo algorithm
Input: 𝑓 : ℝ 𝑑 → ℝ that is 𝜇-strongly convex and 𝐿-smooth, 𝜀 the error parameter.
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 2
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
x (0)
x (3)
x (2)
x (1)
Figure 1: A trajectory of Hamiltonian Monte Carlo
1. Set starting point 𝑥 (0) , step-size 𝑇, number of steps 𝑁, and ODE error tolerance 𝛿.
2. For 𝑘 = 1, . . . , 𝑁:
(a) Let 𝑣 ∼ 𝑁(0, 𝐼);
(b) Denote by 𝑥(𝑡) the solution to (1.2) with initial position 𝑥(0) = 𝑥 (𝑘−1) and initial
velocity 𝑣(0) = 𝑣. Use the ODE solver to find a point 𝑥 (𝑘) such that
𝑥 (𝑘) − 𝑥(𝑇) ≤ 𝛿.
3. Output 𝑥 (𝑁) .
In our analysis, we first consider ideal HMC where in every step we have the exact solution
to the ODE (1.2) and neglect the numerical error from solving the ODEs or integration (𝛿 = 0).
1.1 Preliminaries
We recall some standard definitions here. Let 𝑓 : ℝ 𝑑 → ℝ be a continuously differentiable
function. For a positive real number 𝜇 we say that 𝑓 is 𝜇-strongly convex if for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,
𝜇
𝑓 (𝑦) ≥ 𝑓 (𝑥) + h∇ 𝑓 (𝑥), 𝑦 − 𝑥i + k𝑦 − 𝑥 k 2 .
2
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 3
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
We say that 𝑓 is 𝐿-smooth if ∇ 𝑓 is 𝐿-Lipschitz, i. e., for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,
k∇ 𝑓 (𝑥) − ∇ 𝑓 (𝑦)k ≤ 𝐿 k𝑥 − 𝑦k .
If 𝑓 is 𝜇-strongly convex and 𝐿-smooth, then the condition number of 𝑓 is 𝜅 = 𝐿/𝜇. When 𝑓 is a
twice differentiable function, 𝑓 is 𝜇-strongly convex if and only if ∇2 𝑓 (𝑥) 𝜇𝐼 for all 𝑥 ∈ ℝ 𝑑 ,
where ∇2 𝑓 (𝑥) denotes the Hessian matrix of 𝑓 at 𝑥 and 𝐼 is the identity matrix; similarly, 𝑓 is
𝐿-smooth if and only if ∇2 𝑓 (𝑥) 𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 .
Consider a discrete-time reversible Markov chain ℳ on ℝ 𝑑 with stationary distribution 𝜋. Let
∫
𝑑
𝐿2 (𝜋) = 𝑓 : ℝ → ℝ 𝑓 (𝑥) 𝜋(d𝑥) < ∞
2
ℝ𝑑
be the Hilbert space with inner product
∫
h 𝑓 , 𝑔i = 𝑓 (𝑥)𝑔(𝑥)𝜋(d𝑥)
ℝ𝑑
for 𝑓 , 𝑔 ∈ 𝐿2 (𝜋). Denote by 𝑃 the transition kernel of ℳ. We can view 𝑃 as a self-adjoint operator
from 𝐿2 (𝜋) to itself: for 𝑓 ∈ 𝐿2 (𝜋),
∫
(𝑃 𝑓 )(𝑥) = 𝑓 (𝑦)𝑃(𝑥, d𝑦).
ℝ𝑑
∫
Let 𝐿02 (𝜋) = { 𝑓 ∈ 𝐿2 (𝜋) : ℝ 𝑑 𝑓 (𝑥)𝜋(d𝑥) = 0} be a closed subspace of 𝐿2 (𝜋). The (absolute) spectral
gap of 𝑃 is defined to be
k𝑃 𝑓 k
𝛾(𝑃) = 1 − sup = 1 − sup | h𝑃 𝑓 , 𝑓 i |.
𝑓 ∈𝐿0 (𝜋)
k𝑓k 𝑓 ∈𝐿0 (𝜋)
2 2
k 𝑓 k=1
The relaxation time of 𝑃 is
1
𝜏rel (𝑃) = .
𝛾(𝑃)
Let 𝜈1 , 𝜈2 be two distributions on ℝ 𝑑 with finite 𝑝’th moments where 𝑝 ≥ 1. The 𝑝-Wasserstein
distance between 𝜈1 and 𝜈2 is defined as
1/𝑝
𝑝
𝑊𝑝 (𝜈1 , 𝜈2 ) = 𝔼 k𝑋 − 𝑌k ,
inf
(𝑋 ,𝑌)∈𝒞(𝜈1 ,𝜈2 )
where 𝒞(𝜈1 , 𝜈2 ) is the set of all couplings of 𝜈1 and 𝜈2 . The Ricci curvature of 𝑃 is defined as
𝑊1 (𝑃(𝑥, ·), 𝑃(𝑦, ·))
Ric(𝑃) = 1 − sup .
𝑥,𝑦∈ℝ 𝑑 ,𝑥≠𝑦 k𝑥 − 𝑦k
It is known that 𝛾(𝑃) ≥ Ric(𝑃) (see [22, Proposition 29]), which provides an effective way to
bound the spectral gap through coupling arguments. We refer to [22] for more backgrounds on
the Ricci curvature.
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 4
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
1.2 Related work
Various versions of Langevin dynamics have been studied in many recent papers, see [8, 9, 29,
23, 10, 7, 6, 3, 12, 27, 26, 17]. The convergence rate of HMC has also been studied recently in
[14, 15, 18, 19, 20, 24]. The first bound in our setting was obtained by Mangoubi and Smith [18],
who gave an 𝑂(𝜅2 ) bound on the convergence rate of ideal HMC.
Theorem 1.1 ([18, Theorem 1]). Let 𝑓 : ℝ 𝑑 → ℝ be a twice differentiable function such that
𝑑 . Then the relaxation time of ideal HMC for sampling from the density
𝜇𝐼 ∇2 𝑓 (𝑥) 𝐿𝐼 for all 𝑥 ∈ ℝ√
√
∝ e− 𝑓 with step-size 𝑇 = 𝜇/(2 2𝐿) is 𝑂(𝜅2 ).
This was improved by [14], which showed a bound of 𝑂(𝜅 1.5 ). They also gave an algorithm
with nearly optimal running time for solving the ODE that arises in the implementation of
HMC.
Theorem 1.2 ([14, Lemma 1.8]). Let 𝑓 : ℝ 𝑑 → ℝ be a twice differentiable function such that
𝜇𝐼 ∇2 𝑓 (𝑥) 𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 . Then the relaxation time of ideal HMC for sampling from the density
∝ e− 𝑓 with step-size 𝑇 = 𝜇1/4 /(2𝐿3/4 ) is 𝑂(𝜅 1.5 ).
Both papers suggest that the correct bound is linear in 𝜅. It is mentioned in [18] that linear
dependency is the best one can expect. Meanwhile, [14] shows that there exists a choice of
step-sizes (time for running the ODE), which depends on 𝑓 , the current position, and the random
initial velocity, that achieves a linear convergence rate (Lemma 1.8, second part); however it was
far from clear how to determine these step-sizes algorithmically.
Other papers focus on various aspects and use stronger assumptions (e. g., bounds on
higher-order gradients) to get better bounds on the overall convergence time or the number
of gradient evaluations in some ranges of parameters. For example, [20] shows that the
dependence on dimension for the number of gradient evaluations can be as low as 𝑑1/4 with
suitable regularity assumptions (and higher dependence on the condition number). We note
also that sampling logconcave functions is a polynomial-time solvable problem, without the
assumptions of strong convexity or gradient Lipschitzness, and even when the function e− 𝑓 is
given only by an oracle with no access to gradients [1, 16]. The Riemannian version of HMC
provides a faster polynomial-time algorithm for uniformly sampling polytopes [15]. However,
the dependence on the dimension is significantly higher for these algorithms, both for the
contraction rate and the time per step.
After the conference version [5] of our paper, several articles appeared on sampling from
logconcave densities. For example, [2, 11, 21, 25, 28] study the Langevin dynamics and its
variants, and [4, 13] study Metropolized HMC. Shen and Lee [25] propose the Randomized
Midpoint Method to discretize the underdamped Langevin diffusion; their algorithm achieves
smaller number of total gradient calls 𝑁 = 𝑂(𝜅 ˜ 7/6 𝑑 1/6 𝜀−1/3 + 𝜅𝑑 1/3 𝜀−2/3 ) and less running time
compared to our HMC method.
1.3 Results
In this paper, we show that the relaxation time of ideal HMC is Θ(𝜅) for strongly logconcave
functions with Lipschitz gradient.
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 5
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
Theorem 1.3. Suppose that 𝑓 is 𝜇-strongly convex and 𝐿-smooth. Then the relaxation time√ (inverse of
spectral gap) of ideal HMC for sampling from the density ∝ e− 𝑓 with step-size 𝑇 = 1/(2 𝐿) is 𝑂(𝜅),
where 𝜅 = 𝐿/𝜇 is the condition number.
We remark that the only assumption we made about 𝑓 is strong convexity and smoothness.
(In particular, we do not require that 𝑓 is twice differentiable, which is assumed in both [14] and
[18].)
We also establish a matching lower bound on the relaxation time of ideal HMC, implying
the tightness of Theorem 1.3.
Theorem 1.4. For any 𝐿 ≥ 𝜇 > 0 and 𝑇 > 0, there exists a 𝜇-strongly convex and 𝐿-smooth function 𝑓 ,
such that the relaxation time of ideal HMC for sampling from the density ∝ e− 𝑓 with step-size 𝑇 is Ω(𝜅),
where 𝜅 = 𝐿/𝜇 is the condition number.
Using the ODE solver from [14] with nearly optimal running time, we obtain the following
convergence rate in 2-Wasserstein distance for the HMC algorithm.
Theorem 1.5. Let 𝜋 ∝ e− 𝑓 be the target distribution, and let 𝜋hmc be the
√ distribution of the output of
HMC with starting point 𝑥 (0) = arg
√ min 𝑥 𝑓 (𝑥), step-size 𝑇 = 1/(16000 𝐿), and ODE error tolerance
√ 2
𝛿 = 𝜇𝑇 𝜀/16. For any 0 < 𝜀 < 𝑑, if we run HMC for 𝑁 = 𝑂 𝜅 log(𝑑/𝜀) steps where 𝜅 = 𝐿/𝜇,
then we have
𝜀
𝑊2 (𝜋hmc , 𝜋) ≤ √ . (1.4)
𝜇
√ √
Each step takes 𝑂 𝜅𝑑 3/2 𝜀−1 log(𝜅𝑑/𝜀) time and 𝑂 𝜅𝑑𝜀−1 log(𝜅𝑑/𝜀) evaluations of ∇ 𝑓 , amortized
over all steps.
We note that since our new convergence rate allows larger steps, the ODE solver is run for a
longer time step. One interesting open problem is to find a matching lower bound for HMC
with respect to the 2-Wasserstein distance.
The comparison of convergence rates, running times and numbers of gradient evaluations is
summarized in the following table with polylog factors omitted.
reference convergence rate # gradients total time
[18] 𝜅2 𝜅6.5 𝑑 0.5 𝜅 6.5 𝑑1.5
[14] 𝜅1.5 𝜅1.75 𝑑 0.5 𝜅1.75 𝑑 1.5
this paper 𝜅 𝜅1.5 𝑑 0.5 𝜅 1.5 𝑑1.5
Finally, we remark that Theorem 1.4 also implies an Ω(𝜅) computational lower bound for
˜ 1/6 ).
HMC in terms of the number of gradient queries. This differs from [25] by a factor of 𝑂(𝜅
As pointed out by one of the referees, a comparison with optimization literature seems to
suggest that the optimal number of √
gradient queries that can be achieved by any gradient-based
MCMC algorithm should scale as 𝜅. This corroborates with the fine analysis in [2] that the
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 6
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
√
convergence rate of continuous underdamped Langevin dynamics is 𝑂( 𝑚) where 𝑚 is the
Poincaré constant of the measure, suggesting that an ideal discretization of underdamped
p √
Langevin dynamics with step-size 𝑂( 1/𝐿) might be able to achieve 𝑂( 𝜅) convergence. In
this sense, HMC probably cannot achieve the optimal convergence.
2 Convergence of ideal HMC
In this section we show that the spectral gap of ideal HMC is Ω(1/𝜅), and thus prove Theorem 1.3.
We first show a contraction bound for ideal HMC, which roughly says that the distance of two
points is shrinking after one step of ideal HMC.
Lemma 2.1 (Contraction bound). Suppose that 𝑓 is 𝜇-strongly convex and 𝐿-smooth. Let 𝑥(𝑡) and
𝑦(𝑡) be the solution
√ to (1.2) with initial positions 𝑥(0), 𝑦(0) and initial velocities 𝑥 0(0) = 𝑦 0(0). Then for
0 ≤ 𝑡 ≤ 1/(2 𝐿) we have
𝜇
k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 1 − 𝑡 2 k𝑥(0) − 𝑦(0)k 2 . (2.1)
4
√
In particular, by setting 𝑡 = 𝑇 = 1/(𝑐 𝐿) for some constant 𝑐 ≥ 2 we get
1
k𝑥(𝑇) − 𝑦(𝑇)k ≤ 1 − 2 k𝑥(0) − 𝑦(0)k 2
2
(2.2)
4𝑐 𝜅
where 𝜅 = 𝐿/𝜇.
Proof. Consider the two ODEs for HMC:
𝑥 0(𝑡) = 𝑢(𝑡); 𝑦 0(𝑡) = 𝑣(𝑡);
and
𝑢 0(𝑡) = −∇ 𝑓 (𝑥(𝑡)). 𝑣 0(𝑡) = −∇ 𝑓 (𝑦(𝑡)).
with initial points 𝑥(0), 𝑦(0) and initial velocities 𝑢(0) = 𝑣(0). We will show that
𝜇 2
k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 1 − 𝑡 k𝑥(0) − 𝑦(0)k 2
4
√
for all 0 ≤ 𝑡 ≤ 1/(2 𝐿).
Consider the derivative of 21 k𝑥(𝑡) − 𝑦(𝑡)k 2 :
d 1
k𝑥(𝑡) − 𝑦(𝑡)k 2 = h𝑥 0(𝑡) − 𝑦 0(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i = h𝑢(𝑡) − 𝑣(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i . (2.3)
d𝑡 2
Taking derivative on both sides, we get
d2 1
k𝑥(𝑡) − 𝑦(𝑡)k 2 = h𝑢 0(𝑡) − 𝑣 0(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i + h𝑢(𝑡) − 𝑣(𝑡), 𝑥 0(𝑡) − 𝑦 0(𝑡)i
d𝑡 2 2
= − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i + k𝑢(𝑡) − 𝑣(𝑡)k 2
= −𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 + k𝑢(𝑡) − 𝑣(𝑡)k 2 , (2.4)
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 7
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
where we define
h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i
𝜌(𝑡) = .
k𝑥(𝑡) − 𝑦(𝑡)k 2
Since 𝑓 is 𝜇-strongly convex and 𝐿-smooth, we have 𝜇 ≤ 𝜌(𝑡) ≤ 𝐿 for all 𝑡 ≥ 0.
We will give an upper bound on the term −𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 + k𝑢(𝑡) − 𝑣(𝑡)k 2 , while keeping
its dependency on 𝜌(𝑡). To lower bound k𝑥(𝑡) − 𝑦(𝑡)k 2 , we use the following crude bound.
√
Claim 2.2 (Crude bound). For all 0 ≤ 𝑡 ≤ 1/(2 𝐿) we have
1
k𝑥(0) − 𝑦(0)k 2 ≤ k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 2 k𝑥(0) − 𝑦(0)k 2 . (2.5)
2
The proof of this claim is postponed to Section 2.1.
Next we derive an upper bound on k𝑢(𝑡) − 𝑣(𝑡)k 2 . By taking the derivative of 12 k𝑢(𝑡) − 𝑣(𝑡)k 2 ,
we get
d d 1
k𝑢(𝑡) − 𝑣(𝑡)k k𝑢(𝑡) − 𝑣(𝑡)k = k𝑢(𝑡) − 𝑣(𝑡)k 2
d𝑡 d𝑡 2
= h𝑢 0(𝑡) − 𝑣 0(𝑡), 𝑢(𝑡) − 𝑣(𝑡)i
= − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i .
Thus, its absolute value is not greater than
d |− h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i|
k𝑢(𝑡) − 𝑣(𝑡)k = ≤ k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k .
d𝑡 k𝑢(𝑡) − 𝑣(𝑡)k
Since 𝑓 is 𝐿-smooth and convex, it holds that
k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k 2 ≤ 𝐿 h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑥(𝑡) − 𝑦(𝑡)i ;
see, e. g., [30, Lemma 4] for a proof. Hence, we obtain
k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k 2 ≤ 𝐿𝜌(𝑡) k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ 2𝐿𝜌(𝑡) k𝑥(0) − 𝑦(0)k 2 ,
where the last inequality follows from the crude bound (2.5). Then, using the fact that 𝑢(0) = 𝑣(0)
and the Cauchy–Schwarz inequality, we can upper bound k𝑢(𝑡) − 𝑣(𝑡)k 2 by
∫ 𝑡 2
d
k𝑢(𝑡) − 𝑣(𝑡)k ≤ 2
k𝑢(𝑠) − 𝑣(𝑠)k d𝑠
0 d𝑠
∫ 𝑡 q 2
≤ 2𝐿𝜌(𝑠) k𝑥(0) − 𝑦(0)k d𝑠
0
∫ 𝑡
≤ 2𝐿𝑡 𝜌(𝑠) d𝑠 k𝑥(0) − 𝑦(0)k 2 .
0
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 8
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
Define the function ∫ 𝑡
𝑃(𝑡) = 𝜌(𝑠) d𝑠,
0
so 𝑃(𝑡) is nonnegative and monotone increasing, with 𝑃(0) = 0. Also we have 𝜇𝑡 ≤ 𝑃(𝑡) ≤ 𝐿𝑡 for
all 𝑡 ≥ 0. Then,
k𝑢(𝑡) − 𝑣(𝑡)k 2 ≤ 2𝐿𝑡𝑃(𝑡) k𝑥(0) − 𝑦(0)k 2 . (2.6)
Plugging (2.5) and (2.6) into (2.4), we deduce that
d2 1 1
2
k𝑥(𝑡) − 𝑦(𝑡)k 2 ≤ −𝜌(𝑡) k𝑥(0) − 𝑦(0)k 2 + 2𝐿𝑡𝑃(𝑡) k𝑥(0) − 𝑦(0)k 2 .
d𝑡 2 2
If we define
1
𝛼(𝑡) = k𝑥(𝑡) − 𝑦(𝑡)k 2 ,
2
then we have
𝛼00(𝑡) ≤ −𝛼(0) 𝜌(𝑡) − 4𝐿𝑡𝑃(𝑡) .
Integrating both sides and using 𝛼0(0) = 0, we obtain
∫ 𝑡
𝛼0(𝑡) = 𝛼00(𝑠)d𝑠
0
∫ 𝑡 ∫ 𝑡
≤ −𝛼(0) 𝜌(𝑠)d𝑠 − 4𝐿 𝑠𝑃(𝑠)d𝑠
0 0
∫ 𝑡
≤ −𝛼(0) 𝑃(𝑡) − 4𝐿𝑃(𝑡) 𝑠d𝑠
0
= −𝛼(0)𝑃(𝑡) 1 − 2𝐿𝑡 2 ,
√
where the second inequality is due to the monotonicity of 𝑃(𝑠). Since for all 0 ≤ 𝑡 ≤ 1/(2 𝐿) we
have 𝑃(𝑡) ≥ 𝜇𝑡 and 1 − 2𝐿𝑡 2 ≥ 1/2, we deduce that
𝜇
𝛼0(𝑡) ≤ −𝛼(0) 𝑡.
2
Finally, one more integration yields
∫ 𝑡
0 𝜇 2
𝛼(𝑡) = 𝛼(0) + 𝛼 (𝑠)d𝑠 ≤ 𝛼(0) 1 − 𝑡 ,
0 4
and the theorem follows.
Proof of Theorem 1.3. Recall from Section 1.1 that the Ricci curvature of a Markov chain 𝑃 is
defined to be 1 − sup𝑥≠𝑦 𝑊1 (𝑃(𝑥, ·), 𝑃(𝑦, ·))/k𝑥 − 𝑦k where 𝑊1 is the 1-Wasserstein distance.
Then, Lemma 2.1 immediately √ implies that for any constant 𝑐 ≥ 2, the Ricci curvature of ideal
HMC with step-size 𝑇 = 1/(𝑐 𝐿) is at least 1/(8𝑐 2 𝜅). It follows from [22, Proposition 29]
that the spectral gap of ideal HMC is at least 1/(8𝑐 2 𝜅). Hence, the relaxation time is at most
8𝑐 2 𝜅 = 𝑂(𝜅).
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 9
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
2.1 Proof of Claim 2.2
We present the proof of Claim 2.2 in this section. We remark that a similar crude bound was
established in [14] for general matrix ODEs. Here we prove the crude bound specifically for the
Hamiltonian ODE, but without assuming that 𝑓 is twice differentiable.
Proof of Claim 2.2. We first derive a crude upper bound on k𝑢(𝑡) − 𝑣(𝑡)k. Since 𝑓 is 𝐿-smooth,
we have
d − h∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡)), 𝑢(𝑡) − 𝑣(𝑡)i
k𝑢(𝑡) − 𝑣(𝑡)k =
d𝑡 k𝑢(𝑡) − 𝑣(𝑡)k
≤ k∇ 𝑓 (𝑥(𝑡)) − ∇ 𝑓 (𝑦(𝑡))k ≤ 𝐿 k𝑥(𝑡) − 𝑦(𝑡)k .
Then from 𝑢(0) = 𝑣(0) we get
∫ 𝑡 ∫ 𝑡
d
k𝑢(𝑡) − 𝑣(𝑡)k = k𝑢(𝑠) − 𝑣(𝑠)k d𝑠 ≤ 𝐿 k𝑥(𝑠) − 𝑦(𝑠)k d𝑠.
0 d𝑠 0
To obtain the upper bound for k𝑥(𝑡) − 𝑦(𝑡)k, we first bound its derivative by
∫ 𝑡
d |h𝑢(𝑡) − 𝑣(𝑡), 𝑥(𝑡) − 𝑦(𝑡)i|
k𝑥(𝑡) − 𝑦(𝑡)k = ≤ k𝑢(𝑡) − 𝑣(𝑡)k ≤ 𝐿 k𝑥(𝑠) − 𝑦(𝑠)k d𝑠. (2.7)
d𝑡 k𝑥(𝑡) − 𝑦(𝑡)k 0
Therefore, we have
∫ 𝑡
d
k𝑥(𝑡) − 𝑦(𝑡)k = k𝑥(0) − 𝑦(0)k + k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
0 d𝑠
∫ 𝑡∫ 𝑠
≤ k𝑥(0) − 𝑦(0)k + 𝐿 k𝑥(𝑟) − 𝑦(𝑟)k d𝑟d𝑠
0 0
∫ 𝑡
= k𝑥(0) − 𝑦(0)k + 𝐿 (𝑡 − 𝑠) k𝑥(𝑠) − 𝑦(𝑠)k d𝑠.
0
Then, applying a Grönwall-type inequality from [14, Lemma A.5], we deduce that
√ √
k𝑥(𝑡) − 𝑦(𝑡)k ≤ k𝑥(0) − 𝑦(0)k cosh 𝐿𝑡 ≤ 2 k𝑥(0) − 𝑦(0)k , (2.8)
√ √
where we also use the fact that cosh( 𝐿𝑡) ≤ cosh(1/2) ≤ 2.
Next, we deduce from (2.7) and (2.8) that
∫ 𝑡
d
k𝑥(𝑡) − 𝑦(𝑡)k ≥ −𝐿 k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
d𝑡 0
∫ 𝑡 √
≥ −𝐿 k𝑥(0) − 𝑦(0)k cosh 𝐿𝑠 d𝑠
0
√ √
= − 𝐿 k𝑥(0) − 𝑦(0)k sinh 𝐿𝑡 .
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 10
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
Thus, we obtain
∫ 𝑡
d
k𝑥(𝑡) − 𝑦(𝑡)k = k𝑥(0) − 𝑦(0)k + k𝑥(𝑠) − 𝑦(𝑠)k d𝑠
0 d𝑠
√ ∫ 𝑡 √
≥ k𝑥(0) − 𝑦(0)k − 𝐿 k𝑥(0) − 𝑦(0)k sinh 𝐿𝑠 d𝑠
0
√ 1
= k𝑥(0) − 𝑦(0)k 2 − cosh 𝐿𝑡 ≥ √ k𝑥(0) − 𝑦(0)k ,
2
√ √
where we use 2 − cosh( 𝐿𝑡) ≥ 2 − cosh(1/2) ≥ 1/ 2.
3 Lower bound for ideal HMC
In this section, we show that the relaxation time of ideal HMC can be Θ(𝜅) for some 𝜇-strongly
convex and 𝐿-smooth function for any step-size 𝑇, and thus prove Theorem 1.4.
Consider the following one-dimensional quadratic function:
𝑥2
𝑓 (𝑥) = , (3.1)
2𝜎2
√ √
where 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇. Thus, 𝑓 is 𝜇-strongly convex and 𝐿-smooth. The probability density 𝜈
proportional to e− 𝑓 is the Gaussian distribution: for 𝑥 ∈ ℝ,
𝑥2
1
𝜈(𝑥) = √ exp − 2 . (3.2)
2𝜋𝜎 2𝜎
The following lemma shows that, for certain choice of 𝜎, ideal HMC for the Gaussian distribution
𝜈 can have relaxation time Ω(𝜅). Theorem 1.4 then follows immediately.
√ √
Lemma 3.1. For any 𝑇 > 0 there exists 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇 such that the relaxation time of ideal HMC
for sampling from 𝜈 with step-size 𝑇 is at least 𝜅/(8𝜋2 ).
Proof. The Hamiltonian curve for 𝑓 is given by the ODE
𝑥(𝑡)
𝑥 00(𝑡) = − 𝑓 0(𝑥(𝑡)) = −
𝜎2
with initial position 𝑥(0) ∈ ℝ and initial velocity 𝑥 0(0) drawn from the standard Gaussian 𝑁(0, 1).
Solving the ODE and plugging in the step-size 𝑡 = 𝑇, we get
𝑥(𝑇) = 𝑥(0) cos(𝑇/𝜎) + 𝑣(0)𝜎 sin(𝑇/𝜎).
√
Consider first the case that 𝑇 > 4𝜋/ 𝐿. We may assume that 𝐿 ≥ 4𝜇 (i. e., 𝜅 ≥ 4), since
otherwise the lemma trivially holds. Pick a positive integer 𝑘 such that
√ √
𝑇 𝜇 𝑇 𝐿
≤𝑘≤ .
2𝜋 2𝜋
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 11
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
Note that such 𝑘 exists as √ √ √
𝑇 𝐿 𝑇 𝜇 𝑇 𝐿
− ≥ > 1.
2𝜋 2𝜋 4𝜋
√ √
We then choose 𝜎 = 𝑇/(2𝑘𝜋), which satisfies 1/ 𝐿 ≤ 𝜎 ≤ 1/ 𝜇. Observe that for such a choice
HMC becomes degenerate: we have 𝑥(𝑇) = 𝑥(0), meaning that the chain does not move at all.
Thus, in this extremal case the relaxation time becomes infinity and the lemma holds.
√ √
Next we consider the case that 𝑇 ≤ 4𝜋/ 𝐿. We shall pick 𝜎 = 1/ 𝜇. Let 𝑃 denote the
transition kernel of ideal HMC. Then for 𝑥, 𝑦 ∈ ℝ we have
!
1 (𝑦 − 𝑥 cos(𝑇/𝜎))2
𝑃(𝑥, 𝑦) = √ exp − .
2𝜋𝜎 sin(𝑇/𝜎) 2𝜎2 sin2 (𝑇/𝜎)
Namely, given the current position 𝑥, the next position 𝑦 is from a normal distribution with
mean 𝑥 cos(𝑇/𝜎) and variance 𝜎2 sin2 (𝑇/𝜎). Denote the spectral gap of 𝑃 by 𝛾. Let ℎ(𝑥) = 𝑥 and
note that ℎ ∈ 𝐿02 (𝜈). Using the properties of spectral gaps, we deduce that
| h𝑃 𝑓 , 𝑓 i | | h𝑃 ℎ, ℎi |
𝛾 = 1 − sup ≤ 1− .
𝑓 ∈𝐿02 (𝜈) k𝑓k 2
k ℎ k2
Since we have ∫ ∞
kℎk =2
𝜈(𝑥)ℎ(𝑥)2 d𝑥 = 𝜎2
−∞
and ∫ ∞
h𝑃 ℎ, ℎi = 𝜈(𝑥)𝑃(𝑥, 𝑦)ℎ(𝑥)ℎ(𝑦)d𝑥d𝑦 = 𝜎2 cos(𝑇/𝜎),
−∞
we then get
𝑇2 𝜇
𝛾 ≤ 1 − | cos(𝑇/𝜎)| ≤≤ 8𝜋2 ·
2𝜎 2 𝐿
√
where the last inequality is due to our assumption 𝑇 ≤ 4𝜋/ 𝐿. Hence, 𝜏rel = 1/𝛾 ≥ 𝜅/(8𝜋2 ).
This completes the proof of the lemma.
4 Convergence rate of discretized HMC
In this section, we show how our improved contraction bound (Lemma 2.1) implies that HMC
˜
returns a good enough sample after 𝑂((𝜅𝑑) 1.5 ) steps. We will use the framework from [14] to
establish Theorem 1.5.
We first state the ODE solver from [14], which solves an ODE in nearly optimal time when
the solution to the ODE can be approximated by a piece-wise polynomial. We state here only for
the special case of second order ODEs for the Hamiltonian system. We refer to [14] for general
𝑘th order ODEs.
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 12
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
Theorem 4.1 ([14, Theorem 2.5]). Let 𝑥(𝑡) be the solution to the ODE
𝑥 00(𝑡) = −∇ 𝑓 (𝑥(𝑡)), 𝑥(0) = 𝑥 0 , 𝑥 0(0) = 𝑣 0 . (4.1)
where 𝑥 0 , 𝑣0 ∈ ℝ 𝑑 and 0 ≤ 𝑡 ≤ 𝑇. Suppose that the following conditions hold:
1. There exists a piece-wise polynomial 𝑞(𝑡) such that 𝑞(𝑡) is a polynomial of degree 𝐷 on each interval
[𝑇𝑗−1 , 𝑇𝑗 ] where 0 = 𝑇0 < 𝑇1 < · · · < 𝑇𝑚 = 𝑇, and for all 0 ≤ 𝑡 ≤ 𝑇 we have
𝛿
k𝑞(𝑡) − 𝑥 00(𝑡)k ≤ ; (4.2)
𝑇2
2. {𝑇𝑗 } 𝑚
𝑗=1
and 𝐷 are given as input to the ODE solver;
3. The function 𝑓 has a 𝐿-Lipschitz gradient; i. e., for all 𝑥, 𝑦 ∈ ℝ 𝑑 ,
k∇ 𝑓 (𝑥) − ∇ 𝑓 (𝑦)k ≤ 𝐿 k𝑥 − 𝑦k . (4.3)
√
If 𝐿𝑇 ≤ 1/16000, then the ODE solver can find a piece-wise polynomial 𝑥(𝑡)
˜ such that for all 0 ≤ 𝑡 ≤ 𝑇,
k 𝑥(𝑡)
˜ − 𝑥(𝑡)k ≤ 𝑂(𝛿). (4.4)
The ODE solver uses 𝑂(𝑚(𝐷 + 1) log(𝐶𝑇/𝛿)) evaluations of ∇ 𝑓 and 𝑂(𝑑𝑚(𝐷 + 1)2 log(𝐶𝑇/𝛿)) time
where
𝐶 = 𝑂 (k𝑣 0 k + 𝑇 k∇ 𝑓 (𝑥 0 )k) . (4.5)
The following lemma, which combines Theorem 3.2, Lemma 4.1 and Lemma 4.2 from [14],
establishes the conditions
√ of Theorem 4.1 in our setting. We remark that Lemmas 4.1 and 4.2
hold for all 𝑇 ≤ 1/(8 𝐿), and Theorem 3.2, √ though stated only for 𝑇 ≤ 𝑂(𝜇 /𝐿 ) in [14],
1/4 3/4
holds in fact for the whole region 𝑇 ≤ 1/(2 𝐿) where the contraction bound (Lemma 2.1) is true.
We omit these proofs here and refer the readers to [14] for more details.
Lemma 4.2. Let 𝑓 be a twice differentiable function such that 𝜇𝐼 √ ∇2 𝑓 (𝑥) 𝐿𝐼 for all 𝑥 ∈ ℝ 𝑑 .
Choose the starting point 𝑥 (0) = arg min𝑥 𝑓 (𝑥), step-size 𝑇 = 1/(16000 𝐿), and ODE error tolerance
√
𝛿 = 𝜇𝑇 2 𝜀/16 in the HMC algorithm. Let {𝑥 (𝑘) } 𝑁 𝑘=1
be the sequence of points we get from the HMC
algorithm and {𝑣 0 } 𝑁
(𝑘)
𝑘=1
be the sequence of random Gaussian vector we choose in each step. Let 𝜋 ∝ e− 𝑓
√ distribution and let 𝜋hmc be the distribution of 𝑥 , i. e., the output of HMC. For any
be the target (𝑁)
0 < 𝜀 < 𝑑, if we run HMC for
log(𝑑/𝜀)
𝑁=𝑂 = 𝑂 𝜅 log(𝑑/𝜀)
(4.6)
𝜇𝑇 2
steps where 𝜅 = 𝐿/𝜇, then:
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 13
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
1. ([14, Theorem 3.2]) We have that
𝜀
𝑊2 (𝜋hmc , 𝜋) ≤ √ ; (4.7)
𝜇
2. ([14, Lemma 4.1]) For each 𝑘, let 𝑥 𝑘 (𝑡) be the solution to the ODE (1.3) in the 𝑘th step of HMC.
Then there is a piece-wise constant function 𝑞 𝑘 of 𝑚 𝑘 pieces such that 𝑞 𝑘 (𝑡) − 𝑥 00𝑘 (𝑡) ≤ 𝛿/𝑇 2 for
all 0 ≤ 𝑡 ≤ 𝑇, where
2𝐿𝑇 3 (𝑘−1)
𝑚𝑘 = 𝑣0 + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) ) ; (4.8)
𝛿
3. ([14, Lemma 4.2]) We have that
"𝑁 #
1 Õ 2
𝔼 ∇ 𝑓 (𝑥 (𝑘−1) ) ≤ 𝑂(𝐿𝑑). (4.9)
𝑁
𝑘=1
Proof of Theorem 1.5. The convergence of HMC is guaranteed √ by part 1 of Lemma 4.2. In the
𝑘th step, the number of evaluations of ∇ 𝑓 is 𝑂(𝑚 𝑘 log(𝐶 𝑘 𝜅/𝜀)) by Theorem 4.1 and part 2 of
Lemma 4.2, where √
𝜅 (𝑘−1)
𝑚𝑘 = 𝑂 𝑣0 + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) )
𝜀
and
(𝑘−1)
𝐶𝑘 = 𝑂 𝑣0 + 𝑇 ∇ 𝑓 (𝑥 (𝑘−1) ) .
Thus, the average number of evaluations of ∇ 𝑓 per step is not greater than
"𝑁 # "𝑁 #
1 Õ √ 1 Õ 1
𝑂(𝑚 𝑘 log(𝐶 𝑘 𝜅/𝜀)) ≤ 𝑂(𝑚 𝑘 log 𝑚 𝑘 ) ≤ 𝑂 𝔼 𝑀 log 𝑀 ,
𝔼 𝔼
𝑁 𝑁 𝑁
𝑘=1 𝑘=1
Í𝑁 (𝑘−1)
where 𝑀 = 𝑘=1 𝑚 𝑘 . Since each 𝑣 0 is sampled from the standard Gaussian distribution, we
h 2i
(𝑘−1)
have 𝔼 𝑣0 = 𝑑. Thus, by the Cauchy–Schwarz inequality and part 3 of Lemma 4.2, we
get
𝑁 𝑁
𝑁𝜅 Õ
Õ 2 2
(𝑘−1) (𝑘−1)
𝔼 𝑀 ≤𝑁 𝑚 𝑘2 ≤𝑂 𝔼 𝑣0 + 𝑇 𝔼 ∇ 𝑓 (𝑥
2
𝔼 2
)
𝜀2
𝑘=1 𝑘=1
𝑁 𝜅𝑑
2
≤𝑂 .
𝜀2
We then deduce again from the Cauchy–Schwarz inequality that
p
𝔼[𝑀 log 𝑀] ≤𝔼 𝑀 · 𝔼 log 𝑀 ≤ 𝔼 𝑀 · log (𝔼 𝑀) ≤ 𝔼[𝑀 ] · log ,
2 2
2
2
2 2 2
𝔼 [𝑀 2 ]
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 14
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
where the second inequality is due to Jensen’s inequality and the fact that log2 (𝑥) is concave
when 𝑥 ≥ 3 (note that we can assume 𝑀 ≥ 3 by making 𝑁 ≥ 3). Therefore, the number of
evaluations of ∇ 𝑓 per step, amortized over all steps, is
√ !
𝜅𝑑 𝜅𝑑
p p
1
𝑂 𝔼[𝑀 2 ] log 𝔼 [𝑀 2 ] ≤ 𝑂 log .
𝑁 𝜀 𝜀
Using a similar argument we have the bound for the expected running time per step. This
completes the proof.
Acknowledgments. We are grateful to Yin Tat Lee and Andre Wibisono for helpful discussions.
We also thank the anonymous referees for their valuable comments and suggestions.
References
[1] David Applegate and Ravi Kannan: Sampling and integration of near log-concave functions.
In Proc. 23rd STOC, pp. 156–163. ACM Press, 1991. [doi:10.1145/103418.103439] 5
[2] Yu Cao, Jianfeng Lu, and Lihan Wang: On explicit 𝐿2 -convergence rate estimate for
underdamped Langevin dynamics, 2019. [arXiv:1908.04746] 5, 6
[3] Niladri S. Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L. Bartlett, and Michael I.
Jordan: On the theory of variance reduction for stochastic gradient Monte Carlo. In Proc.
35th Internat. Conf. Artificial Intelligence and Statistics (ICML’18), pp. 80:764–773. MLR Press,
2018. PMLR. [arXiv:1802.05431] 5
[4] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu: Fast mixing of Metropolized
Hamiltonian Monte Carlo: Benefits of multi-step gradients. JMLR, 21(92):1–72, 2019. JMLR.
[arXiv:1905.12247] 5
[5] Zongchen Chen and Santosh S. Vempala: Optimal convergence rate of Hamiltonian Monte
Carlo for strongly logconcave distributions. In Proc. 23rd Internat. Conf. on Randomization
and Computation (RANDOM’19), pp. 64:1–12. Schloss Dagstuhl–Leibniz-Zentrum fuer
Informatik, 2019. [doi:10.4230/LIPIcs.APPROX-RANDOM.2019.64, arXiv:1905.02313] 1, 5
[6] Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I.
Jordan: Sharp convergence rates for Langevin dynamics in the nonconvex setting, 2018.
[arXiv:1805.01648] 5
[7] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan: Underdamped
Langevin MCMC: A non-asymptotic analysis. In Proc. 31st Ann. Conf. on Learning Theory
(COLT’18), pp. 75:1–24. MLR Press, 2018. PMLR. [arXiv:1707.03663] 5
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 15
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
[8] Arnak S. Dalalyan: Theoretical guarantees for approximate sampling from smooth and log-
concave densities. J. Royal Statistical Soc. Ser. B, 79(3):651–676, 2017. [doi:10.1111/rssb.12183,
arXiv:1412.7392] 5
[9] Arnak S. Dalalyan and Avetik Karagulyan: User-friendly guarantees for the Langevin
Monte Carlo with inaccurate gradient. Stoch. Proc. Appl., 129(12):5278–5311, 2019.
[doi:10.1016/j.spa.2019.02.016, arXiv:1710.00095] 5
[10] Arnak S. Dalalyan and Lionel Riou-Durand: On sampling from a log-concave density
using kinetic Langevin diffusions. Bernoulli, 26(3):1956–1988, 2020. [doi:10.3150/19-BEJ1178,
arXiv:1807.09382] 5
[11] Arnak S. Dalalyan, Lionel Riou-Durand, and Avetik Karagulyan: Bounding the
error of discretized Langevin algorithms for non-strongly log-concave targets, 2019.
[arXiv:1906.08530] 5
[12] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu: Log-concave sampling:
Metropolis-Hastings algorithms are fast! JMLR, 20(183):1–42, 2019. JMLR, Preliminary
version in COLT’18. [arXiv:1801.02309] 5
[13] Yin Tat Lee, Ruoqi Shen, and Kevin Tian: Logsmooth gradient concentration and tighter
runtimes for Metropolized Hamiltonian Monte Carlo. In Proc. 33rd Ann. Conf. on Learning
Theory (COLT’20), pp. 125:2565–2597. MLR Press, 2020. PMLR. [arXiv:2002.04121] 5
[14] Yin Tat Lee, Zhao Song, and Santosh S. Vempala: Algorithmic theory of ODEs and
sampling from well-conditioned logconcave densities, 2018. [arXiv:1812.06243] 2, 5, 6, 10,
12, 13, 14
[15] Yin Tat Lee and Santosh S. Vempala: Convergence rate of Riemannian Hamiltonian Monte
Carlo and faster polytope volume computation. In Proc. 50th STOC, pp. 1115–1121. ACM
Press, 2018. [doi:10.1145/3188745.3188774, arXiv:1710.06261] 2, 5
[16] László Lovász and Santosh S. Vempala: Fast algorithms for logconcave functions: Sam-
pling, rounding, integration and optimization. In Proc. 47th FOCS, pp. 57–68. IEEE Comp.
Soc., 2006. [doi:10.1109/FOCS.2006.28] 5
[17] Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and
Michael I. Jordan: Is there an analog of Nesterov acceleration for MCMC?, 2019.
[arXiv:1902.00996] 5
[18] Oren Mangoubi and Aaron Smith: Mixing of Hamiltonian Monte Carlo on strongly
log-concave distributions 1: Continuous dynamics, 2017. [arXiv:1708.07114] 2, 5, 6
[19] Oren Mangoubi and Aaron Smith: Mixing of Hamiltonian Monte Carlo on strongly
log-concave distributions 2: Numerical integrators. In Proc. 22nd Internat. Conf. Artificial
Intelligence and Statistics (AISTATS’19), pp. 89:586–595. MLR Press, 2019. PMLR. 2, 5
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 16
O PTIMAL C ONVERGENCE R ATE OF HMC FOR S TRONGLY L OGCONCAVE D ISTRIBUTIONS
[20] Oren Mangoubi and Nisheeth Vishnoi: Dimensionally tight bounds for second-order
Hamiltonian Monte Carlo. In Proc. 31st Adv. Neural Info. Proc. Sys. (NeurIPS’18), pp.
6027–6037. Curran Assoc., 2018. NeurIPS. 2, 5
[21] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan:
High-order Langevin diffusion yields an accelerated MCMC algorithm. JMLR, 22(42):1–41,
2021. JMLR. [arXiv:1908.10859] 5
[22] Yann Ollivier: Ricci curvature of Markov chains on metric spaces. J. Functional Anal.,
256(3):810–864, 2009. [doi:10.1016/j.jfa.2008.11.001, arXiv:math/0701886] 4, 9
[23] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky: Non-convex learning via
stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Proc. 30th Ann. Conf.
on Learning Theory (COLT’17), pp. 65:1–30. MLR Press, 2017. PMLR. [arXiv:1702.03849] 5
[24] Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes: Positive curvature and
Hamiltonian Monte Carlo. In Proc. 27th Adv. Neural Info. Proc. Sys. (NIPS’14), pp. 586–594.
Curran Assoc., 2014. NIPS. 5
[25] Ruoqi Shen and Yin Tat Lee: The Randomized Midpoint Method for log-concave sampling.
In Proc. 32nd Adv. Neural Info. Proc. Sys. (NeurIPS’19), pp. 2100–2111. Curran Assoc., 2019.
NeurIPS. [arXiv:1909.05503] 5, 6
[26] Santosh S. Vempala and Andre Wibisono: Rapid convergence of the unadjusted Langevin
algorithm: Isoperimetry suffices. In Proc. 32nd Adv. Neural Info. Proc. Sys. (NeurIPS’19), pp.
8094–8106. Curran Assoc., 2019. NeurIPS. [arXiv:1903.08568] 5
[27] Andre Wibisono: Sampling as optimization in the space of measures: The Langevin
dynamics as a composite optimization problem. In Proc. 31st Ann. Conf. on Learning Theory
(COLT’18), pp. 75:2093–3027. MLR Press, 2018. PMLR. [arXiv:1802.08089] 5
[28] Andre Wibisono: Proximal Langevin algorithm: Rapid convergence under isoperimetry,
2019. [arXiv:1911.01469] 5
[29] Yuchen Zhang, Percy S. Liang, and Moses Charikar: A hitting time analysis of stochastic
gradient Langevin dynamics. In Proc. 30th Ann. Conf. on Learning Theory (COLT’17), pp.
65:1980–2022. MLR Press, 2017. PMLR. [arXiv:1702.05575] 5
[30] Xingyu Zhou: On the Fenchel Duality between strong convexity and Lipschitz Continuous
Gradient, 2018. SLIDES. [arXiv:1803.06573] 8
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 17
Z ONGCHEN C HEN AND S ANTOSH S. V EMPALA
AUTHORS
Zongchen Chen
Instructor
Department of Mathematics
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
zongchen mit edu
https://sites.google.com/view/zongchenchen
Santosh S. Vempala
Professor
College of Computing
Georgia Institute of Technology
Atlanta, GA 30332, USA
vempala gatech edu
https://www.cc.gatech.edu/~vempala/
ABOUT THE AUTHORS
Zongchen Chen is an instructor (postdoc) in Mathematics at the Massachusetts
Institute of Technology. He obtained his Ph. D. in Algorithms, Combinatorics
and Optimization at the Georgia Institute of Technology in 2021, advised by
Eric Vigoda. Before that, he received his B. S. in Mathematics and Applied
Mathematics from the Zhiyuan College at the Shanghai Jiao Tong University in
2016. He has broad interests in randomized algorithms, discrete probability, and
combinatorics. Currently, his research focuses on Markov chain Monte Carlo
(MCMC) methods for sampling from Gibbs distributions, and machine learning
problems related to undirected graphical models.
Santosh Vempala is the Frederick Storey professor in the College of Computing
and former director of the Algorithms and Randomness Center at Georgia Tech.
His research interests, ironically, are in algorithms, randomness, and geometry.
He graduated from CMU in 1997 following the advice of Avrim Blum and
was at MIT till 2006 except for a year as a Miller fellow at UC Berkeley. He
gets unreasonably excited when a phenomenon that appears complex from one
perspective turns out to be simple from another. At Georgia Tech, he started the
Computing-for-Good (C4G) program.
T HEORY OF C OMPUTING, Volume 18 (9), 2022, pp. 1–18 18