Authors Christian Coester, James R. Lee,
License CC-BY-3.0
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24
www.theoryofcomputing.org
Pure Entropic Regularization for Metrical
Task Systems
Christian Coesterโ James R. Leeโ
Received July 23, 2019; Revised October 8, 2020; Published December 29, 2022
Abstract. We show that on every ๐-point HST metric, there is a randomized online
algorithm for metrical task systems (MTS) that is 1-competitive for service costs and
๐(log ๐)-competitive for movement costs. In general, these refined guarantees are
optimal up to the implicit constant. While an ๐(log ๐)-competitive algorithm for
MTS on HST metrics was developed by Bubeck et al. (SODAโ19), that approach could
only establish an ๐((log ๐)2 )-competitive ratio when the service costs are required
to be ๐(1)-competitive. Our algorithm can be viewed as an instantiation of online
mirror descent with the regularizer derived from a multiscale conditional entropy.
In fact, our algorithm satisfies a set of even more refined guarantees; we are able
to exploit this property to combine it with known random embedding theorems and
obtain, for any ๐-point metric space, a randomized algorithm that is 1-competitive
for service costs and ๐((log ๐)2 )-competitive for movement costs.
An extended abstract of this paper appeared in the Proceedings of the 32nd Ann. Conference on Learning
Theory (COLT 2019) [17].
โ Supported by EPSRC Award 1652110. Part of this work was carried out while C. Coester was visiting the
University of Washington, hosted by J. R. Lee.
โ Supported by NSF grants CCF-1616297 and CCF-1407779 and a Simons Investigator Award
ACM Classification: F.2.0
AMS Classification: 68W27
Key words and phrases: online algorithms, competitive analysis, mirror descent, metrical task
systems, decision making under uncertainty
ยฉ 2022 Christian Coester and James R. Lee
c b Licensed under a Creative Commons Attribution License (CC-BY) DOI: 10.4086/toc.2022.v018a023
C HRISTIAN C OESTER AND JAMES R. L EE
1 Introduction
Let (๐ , ๐๐ ) be a finite metric space with |๐ | = ๐ > 1. The Metrical Task Systems (MTS) problem,
introduced in [11] is described as follows. The input is a sequence h๐ ๐ก : ๐ โ โ+ : ๐ก โฅ 1i of
nonnegative cost functions on the state space ๐. At every time ๐ก, an online algorithm maintains
a state ๐๐ก โ ๐.
The corresponding cost is the sum of a service cost ๐ ๐ก (๐๐ก ) and a movement cost ๐๐ (๐๐กโ1 , ๐๐ก ).
Formally, an online algorithm is a sequence of mappings ๐ = h๐1 , ๐2 , . . . , i where, for every ๐ก โฅ 1,
๐๐ก : (โ+๐ )๐ก โ ๐ maps a sequence of cost functions h๐1 , . . . , ๐ ๐ก i to a state. The initial state ๐0 โ ๐
is fixed. The total cost of the algorithm ๐ in servicing ๐ = h๐ ๐ก : ๐ก โฅ 1i is defined as:
ร
cost๐ (๐) := [๐ ๐ก (๐๐ก (๐1 , . . . , ๐ ๐ก )) + ๐๐ (๐๐กโ1 (๐1 , . . . , ๐ ๐กโ1 ), ๐๐ก (๐1 , . . . , ๐ ๐ก ))] .
๐กโฅ1
The cost of the offline optimum, denoted costโ (๐), is the infimum of ๐กโฅ1 [๐ ๐ก (๐๐ก ) + ๐๐ (๐๐กโ1 , ๐๐ก )]
ร
over any sequence h๐๐ก : ๐ก โฅ 1i of states. A randomized online algorithm ๐ is said to be ๐ผ-competitive
if for every ๐0 โ ๐, there is a constant ๐ฝ > 0 such that for all cost sequences ๐:
๐ผ cost๐ (๐) โค ๐ผ ยท costโ (๐) + ๐ฝ .
For the ๐-point uniform metric, a simple coupon-collector argument shows that the com-
petitive ratio is ฮฉ(log ๐), and this is tight [11]. A long-standing conjecture is that this ฮ(log ๐)
competitive ratio holds for an arbitrary ๐-point metric space. The lower bound has almost
been established [8, 9]; for any ๐-point metric space, the competitive ratio is ฮฉ(log ๐/log log ๐).
Following a long sequence of works (see, e. g., [20, 10, 7, 6, 19, 18]), an upper bound of ๐((log ๐)2 )
was shown in [13].
Relation to adversarial multi-arm bandits MTS is naturally related to the adversarial setting of
the classical multi-arm bandits model in sequential decision making, and provides a very general
framework for โbandits with switching costs.โ Unlike in the setting of regret minimization,
where one competes against the best static strategy in hindsight (see, e. g., [12]), competitive
analysis compares the performance of an online algorithm to the best dynamical offline algorithm.
Thus this model emphasizes the importance of an adaptivity in the face of changing
environments. For MTS, the online algorithm has full information: access to the complete cost
function ๐ ๐ก is available when deciding on a point ๐๐ก (๐1 , . . . , ๐ ๐ก ) โ ๐ at which to play. And yet
one of the fascinating relationships between MTS and adversarial bandits is the parallel between
adaptivityโbeing willing to โtry outโ new strategiesโand the classical exploration/exploitation
tradeoff that occurs in models where one only has access to partial information about the loss
functions.
HST metrics The methods of [5] show that the competitive ratio for MTS is ๐(log ๐) on
weighted star metrics. Recently, the authors of [13] generalized this result by designing
an algorithm with competitive ratio ๐(๐๐ log ๐) on any weighted ๐-point tree metric with
combinatorial depth ๐๐ . We now discuss a special class of metrics.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 2
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
Let ๐ = (๐ , ๐ธ) be a finite tree with root ๐ฃ and vertex weights {๐ค ๐ข > 0 : ๐ข โ ๐ }, let โ โ ๐
denote the leaves of ๐, and suppose that the vertex weights on ๐ are non-increasing along
root-leaf paths. Consider the metric space (โ, ๐๐ ), where ๐๐ (โ , โ 0) is the weighted length of the
path connecting โ and โ 0 when the edge from a node ๐ข to its parent is ๐ค ๐ข . We will use ๐๐ for
the combinatorial (i. e., unweighted) depth of ๐.
(โ, ๐๐ ) is called an HST metric (or, equivalently for finite metric spaces, an ultrametric). If, for
some ๐ > 1, the weights on ๐ satisfy the stronger inequality ๐ค ๐ฃ โค ๐ค ๐ข /๐ whenever ๐ฃ is a child
of ๐ข, the space (โ, ๐๐ ) is said to be a ๐-HST metric. Such metric spaces play a special role in
MTS since every ๐-point metric space can be probabilistically approximated by a distribution
over such spaces [6, 18]. Indeed, the ๐((log ๐)2 )-competitive ratio for general metric spaces
established in [13] is a consequence of their ๐(log ๐)-competitive algorithm for HSTs.
1.1 Refined guarantees
The authors of [4] observe that there is a more refined way to analyze competive algorithms for
MTS. For a randomized online algorithm ๐ and a cost sequence ๐, we denote, respectively, S๐ (๐)
and M๐ (๐) for the (expected) service cost and movement cost, that is
ร ร
S๐ (๐) := ๐ผ ๐ ๐ก (๐๐ก ) and M๐ (๐) := ๐ผ ๐๐ (๐๐กโ1 , ๐๐ก ) .
๐กโฅ1 ๐กโฅ1
If there are numbers ๐ผ, ๐ผ0 , ๐ฝ, ๐ฝ0 > 0 such that for every cost ๐, it holds that
S๐ (๐) โค ๐ผ ยท costโ (๐) + ๐ฝ
M๐ (๐) โค ๐ผ0 ยท costโ (๐) + ๐ฝ0 ,
one says that ๐ is ๐ผ-competitive for service costs and ๐ผ0-competitive for movement costs.
In [4], it is shown that on every ๐-point HST metric, and for every ๐ > 0, there is an
online algorithm that is simultaneously (1 + ๐)-competitive for service costs and ๐((log(๐/๐))2 )-
competitive for movement costs. The authors of [13] improve this slightly to show that
actually there is an online algorithm that is simultaneously 1-competitive for service costs and
๐((log ๐)2 )-competitive for movement costs. We obtain the optimal refined guarantees.
Theorem 1.1. On any ๐-point HST metric ๐, there is a randomized online algorithm that is 1-competitive
for service costs and ๐(log ๐)-competitive for movement costs.
Remark 1.2 (Optimality of the refined guarantees). Any finitely competitive algorithm for MTS
on an ๐-point uniform metric cannot be better than ฮฉ(log ๐)-competitive for movement costs,
regardless of its competitive ratio for service costs. This is because this lower bound holds even
if the cost functions only take values 0 and โ. Moreover, it cannot be better than 1-competitive
for service costs, regardless of its competitive ratio for movement costs. To see this, consider the
case where each cost function is the constant function 1.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 3
C HRISTIAN C OESTER AND JAMES R. L EE
Finely competitive guarantees Suppose that for some numbers ๐ผ0 , ๐ผ 1 , ๐พ, ๐ฝ, ๐ฝ0 > 0, a random-
ized online algorithm ๐ satisfies, for every cost ๐ and every offline algorithm ๐โ :
S๐ (๐) โค ๐ผ 0 S๐โ (๐) + ๐ผ 1 M๐โ (๐) + ๐ฝ (1.1)
0
M๐ (๐) โค ๐พ S๐ (๐) + ๐ฝ . (1.2)
In this case, we say that ๐ is (๐ผ0 , ๐ผ1 , ๐พ)-finely competitive. We establish the following.
Theorem 1.3. On any ๐-point HST metric ๐, for every ๐
โฅ 1, there is an online randomized algorithm
๐ that is 1, 1/๐
, ๐(๐
log ๐) -finely competitive. In fact, one can take ๐ฝ = 0 and ๐ฝ โค ๐(๐
diam(๐)).
0
Combined with the random embedding from [18], this yields the following consequence for
general ๐-point metric spaces.
Corollary 1.4. On any ๐-point metric space, there is an online randomized algorithm that is 1-competitive
for service costs and ๐((log ๐)2 )-competitive for movement costs.
Proof. Consider an ๐-point metric space (๐ , ๐๐ ). It is known [18] that there exists a random
HST metric (๐, ๐๐ ) so that โ(๐) = ๐ and for all ๐ฅ, ๐ฆ โ ๐:
1. Pr[๐๐ (๐ฅ, ๐ฆ) โฅ ๐๐ (๐ฅ, ๐ฆ)] = 1,
2. ๐ผ[๐๐ (๐ฅ, ๐ฆ)] โค ๐ท ยท ๐๐ (๐ฅ, ๐ฆ),
and ๐ท โค ๐(log ๐).
Let ๐๐ be the randomized algorithm for (๐, ๐๐ ) guaranteed by Theorem 1.3 with ๐
= ๐ท. Let
๐ denote the algorithm that results from sampling (๐, ๐๐ ) and then using ๐๐ . We use M๐ to
denote movement cost measured in ๐๐ and M๐ for movement cost measured in ๐๐ .
Then for any cost ๐ and any offline algorithm ๐โ , we have
S๐ (๐) = ๐ผ[S๐๐ (๐)] โค S๐โ (๐) + ๐
โ1 ๐ผ[M๐๐โ (๐)] + ๐(1)
โค S๐โ (๐) + ๐
โ1 ๐ท M๐๐โ (๐) + ๐(1)
= S๐โ (๐) + M๐๐โ (๐) + ๐(1) ,
and
M๐๐ (๐) = ๐ผ[M๐๐๐ (๐)] โค ๐ผ[M๐๐๐ (๐)] โค ๐(๐
log ๐) ๐ผ[S๐๐ (๐)] + ๐(1),
completing the proof.
1.2 The fractional model on trees
We will work in the following deterministic fractional setting, which is equivalent to the
randomized integral setting described earlier (see [13, ยง2]). The state of a fractional algorithm is
given by a point in the polytope
๏ฃฑ
๏ฃด ร ๏ฃผ
๏ฃด
๐ฅ โ โ+๐ : ๐ฅ ๐ฃ = 1, ๐ฅ ๐ข =
๏ฃฒ
๏ฃด ๏ฃฝ
๏ฃด
K๐ := ๐ฅ๐ฃ โ๐ข โ ๐ \ โ , (1.3)
๐ฃโ๐(๐ข)
๏ฃด
๏ฃด ๏ฃด
๏ฃด
๏ฃณ ๏ฃพ
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 4
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
where we use ๐(๐ข) for the set of children of ๐ข in ๐. For ๐ข โ ๐ฃ , we will also write p(๐ข) for the
parent of ๐ข in ๐.
A state ๐ฅ โ K๐ corresponds to the situation that the state of a randomized integral algorithm
is a leaf descendant of ๐ข with probability ๐ฅ ๐ข . Note that K๐ is simply an affine encoding of the
probability simplex on โ. In the fractional setting, changing from state ๐ฅ to ๐ฅ 0 incurs movement
cost k๐ฅ โ ๐ฅ 0 kโ1 (๐ค) , where
ร
k๐ง kโ1 (๐ค) := ๐ค ๐ข |๐ง ๐ข |
๐ขโ๐
denotes the weighted โ1 -norm on โ๐ .
1.3 Mirror descent, metric filtrations, and regularization
Following [13], our algorithm is based on the mirror descent framework as established in [14].
This is a method for regularized online convex optimization, an approach that was previously
explored for competitive analysis in [1, 15].
A central component of mirror descent is choosing the appropriate mirror map (which we
will often refer to as the โregularizerโ). This is a strictly convex function ฮฆ : K๐ โ โ that endows
K๐ with a geometric (Riemannian) structure, specifying how to perform constrained vector flow.
In other words, it specifies how one can move in a preferred direction while remaining inside
K๐ .
The paper [13] employs the following regularizer:
1 ร
ฮฆ0 (๐ฅ) := ๐ค ๐ข (๐ฅ ๐ข + ๐ฟ ๐ข ) log (๐ฅ ๐ข + ๐ฟ ๐ข ) , (1.4)
๐
๐ขโ๐\{๐ฃ }
with ๐ = ฮ(log |โ|) and ๐ฟ ๐ข = |โ๐ข |/|โ|, where โ๐ข is the set of leaves in the subtree rooted at ๐ข.
1.3.1 Metric filtrations
It is straightforward that one can think of ฮฆ0 as a type of multiscale entropy (this is the negative
of the associated Shannon entropy, since we use the analystโs convention that the entropy is
convex). To understand this notion, let us forget momentarily the weights on ๐. Then the
structure of ๐ gives a natural filtration over probability measures on the leaves โ. Suppose that
๐ฟ is a random variable taking values in โ and, for ๐ข โ ๐, denote by โฐ๐ข the event {๐ฟ โ โ๐ข }.
Then the chain rule for Shannon entropy yields
ร 1 ร Pr[โฐp(๐ข) ]
Pr[โฐโ ] log = Pr[โฐ๐ข ] log .
Pr[โฐโ ] Pr[โฐ๐ข ]
โ โโ ๐ขโ๐\{๐ฃ }
If we now imagine that uncertainty at higher scales is more costly than uncertainty at lower
scales, then we might define an analogous weighted entropy by
ร Pr[โฐp(๐ข) ]
๐ค ๐ข Pr[โฐ๐ข ] log . (1.5)
Pr[โฐ๐ข ]
๐ขโ๐\{๐ฃ }
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 5
C HRISTIAN C OESTER AND JAMES R. L EE
Such a notion is natural in the context of โmetric learningโ problems.
Ignoring the {๐ฟ ๐ข } values for a moment, consider that (1.4) is not analogous to (1.5). Indeed,
it corresponds to the quantity
ร 1
๐ค ๐ข Pr[โฐ๐ข ] log , (1.6)
Pr[โฐ๐ข ]
๐ขโ๐\{๐ฃ }
and now one can see a fundamental reason why the algorithm associated to (1.4) only achieves
an ๐(๐๐ log ๐) competitive ratio, where ๐๐ is the combinatorial depth of ๐: The quantity (1.6)
overmeasures the metric uncertainty.
= log ๐, where
ร
Suppose that ๐ฟ is a uniformly random leaf. Then โ โโ Pr[โฐโ ] log Pr[โฐ 1
] โ
๐ = |โ|. But, in general, one could have ๐ขโ๐ Pr[โฐ๐ข ] log Pr[โฐ โฅ ฮฉ(๐๐ log ๐). This fact was not
ร 1
๐ข]
lost on the authors of [13], but they bypass the problem by combining mirror descent on stars
with a recursive composition method called โunfair gluing.โ
1.3.2 Multiscale conditional entropy
We employ a regularizer that is a more faithful analog of (1.5):
ร ๐ค
๐ฅ๐ข
๐ข
๐ฅ ๐ข + ๐ฟ ๐ข ๐ฅ p(๐ข) log + ๐ฟ๐ข ,
ฮฆ(๐ฅ) := (1.7)
๐๐ข ๐ฅ p(๐ข)
๐ขโ๐\{๐ฃ }
where p(๐ข) denotes the parent of ๐ข.
If one ignores the additional parameters {๐๐ข โฅ 1, ๐ฟ ๐ข > 0}, this is precisely the negative
weighted Shannon entropy written according to the chain rule. Here, we set
|โ๐ข |
๐๐ข := (1.8)
|โp(๐ข) |
๐๐ข := 1 + log(1/๐๐ข ) (1.9)
๐ฟ ๐ข := ๐๐ข /๐๐ข . (1.10)
The numbers {๐๐ข } are the conditional probabilites of the uniform distribution on leaves.
The {๐ฟ ๐ข } values are employed as โnoiseโ added to the entropy calculation. Such noise is a
fundamental aspect for competitive analysis, and distinguishes it from the application of mirror
descent to regret minimization problems (see, e. g., [12]).1 The effect of these noise parameters
appears ubiquitously in applications of the primal-dual method to competitive analysis (see
[16]), and manifests itself as an additive term in the update rules (see equation (1.11) below).
Intuitively, it ensures that the conditional probability ๐ฅ๐ฅp(๐ข)
๐ข
is updated fast enough even when it
is close to 0.
Finally, the numbers {๐๐ข : ๐ข โ ๐ } are commonly referred to as โlearning ratesโ in the study
of online learning. They represent the rate at which information is discounted in the resulting
1One finds aspects of this โmixing with the uniform distributionโ in the bandits setting as well, but used for
variance reduction, a seemingly very different purpose.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 6
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
algorithm; for MTS, this corresponds to the relative importance of costs arriving now vs. costs
that arrived in the past.
1.3.3 The dynamics
We will derive in Section 3 the following continuous time evolution of the resulting mirror descent
algorithm (๐ฅ(๐ก) โ K๐ : ๐ก โ [0, โ)) for a cost path ๐ : [0, โ) โ โ+โ :
!
๐ฅ ๐ข (๐ก) ๐๐ข ๐ฅ ๐ข (๐ก) ร ๐ฅ (๐ก)
โ
๐๐ก = + ๐ฟ๐ข ๐ฝ p(๐ข) (๐ก) โ ๐โ (๐ก) (1.11)
๐ฅ p(๐ข) (๐ก) ๐ค ๐ข ๐ฅ p(๐ข) (๐ก) ๐ฅ ๐ข (๐ก)
โ โโ๐ข
Here, ๐ฝ p(๐ข) (๐ก) is a Lagrangian multiplier that ensures conservation of conditional probability:
๐ฅ ๐ฃ (๐ก)
ร
๐๐ก = 0.
๐ฅ p(๐ข) (๐ก)
๐ฃโ๐(p(๐ข))
One can see that the evolution is being driven by the expected instantaneous cost incurred
conditioned on the current state being in the subtree rooted at ๐ข.
One should interpret equation (1.11) only when ๐ฅ(๐ก) lies in the relative interior of K๐ .
Otherwise, the conditional probabilities are ill-defined. One way to rectify this is to prevent
๐ฅ(๐ก) from hitting the relative boundary of K๐ at all. It is possible to adaptively modify the cost
functions by a suitably small perturbation so as to guarantee this property and, at the same
time, ensure that the total discrepancy between the modified and true service cost is a small
additive constant.
Instead, we will follow a different approach, by extending the dynamics to an analogous
system of conditional probabilities {๐ ๐ข (๐ก) : ๐ข โ ๐ \ {๐ฃ }}:
๐๐ข
๐๐ก ๐ ๐ข (๐ก) = (๐ ๐ข (๐ก) + ๐ฟ ๐ข ) ๐ฝ p(๐ข) (๐ก) โ ๐ห๐ข (๐ก) + ๐ผ ๐ข (๐ก) ,
(1.12)
๐ค๐ข
where ๐ ๐ข (๐ก) = ๐ฅ๐ฅ๐ข (๐ก)(๐ก) whenever ๐ฅ p(๐ข) (๐ก) > 0, ๐ผ ๐ข (๐ก) is a Lagrangian multiplier for the constraint
p(๐ข)
๐ ๐ข (๐ก) โฅ 0, and ๐ห๐ข (๐ก) is the โderivedโ cost in the subtree rooted at ๐ข:
ร
๐ห๐ข (๐ก) := ๐โ |๐ข (๐ก)๐โ (๐ก)
โ โโ๐ข
ร
๐โ |๐ข (๐ก) := ๐ ๐ฃ (๐ก) ,
๐ฃโ๐พ๐ข,โ \{๐ข}
where ๐พ๐ข,โ is the unique simple ๐ข-โ path in ๐.
Stated this way, the mirror descent algorithm can be envisioned as running a โweighted starโ
algorithm on the conditional probabilities at every internal node of ๐, with the derived costs at
an internal node ๐ข given by the average cost of the current strategy for playing one unit of mass
in the subtree rooted at ๐ข.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 7
C HRISTIAN C OESTER AND JAMES R. L EE
In the next section, we will implement and analyze a discretization of (1.12) using Bregman
projections. Since our regularizer ฮฆ and convex body K๐ do not satisfy the assumptions
underlying the existence and uniqueness theorem of [14], we need to construct a solution to
(1.12) and, indeed, taking the discretization parameter in our algorithm to zero, one establishes
a solution of bounded variation; see Section 3.3.
The major benefit of the formulations (1.11) and (1.12) is in motivating such an algorithm and
prescribing the derived costs. In Section 3, we describe how these dynamics can be predicted
from the definition (1.7).
2 The MTS algorithm
We will first establish some generic machinery which, at this point, is not specific to MTS yet.
Consider a convex polytope K0 โ โ ๐ , define K := K0 โฉ โ+๐ , and assume that K is compact. Suppose
additionally that ฮฆ : ๐ โ โ is differentiable and strictly convex in an open neighborhood
๐ โ K.
Let us write Dฮฆ for the corresponding Bregman divergence
Dฮฆ (๐ฆ k ๐ฅ) := ฮฆ(๐ฆ) โ ฮฆ(๐ฅ) โ hโฮฆ(๐ฅ), ๐ฆ โ ๐ฅi ,
which is non-negative due to convexity of ฮฆ. Then for ๐ฅ, ๐ฆ, ๐ง โ K, we have:
Dฮฆ (๐ง k ๐ฆ) โ Dฮฆ (๐ง k ๐ฅ) = โฮฆ(๐ฆ) + ฮฆ(๐ฅ) โ hโฮฆ(๐ฆ), ๐ง โ ๐ฆi + hโฮฆ(๐ฅ), ๐ง โ ๐ฅi. (2.1)
For a vector ๐ โ โ ๐ and ๐ฅ โ K, define the projection
ฮ K๐ (๐ฅ) := argmin {Dฮฆ (๐ฆ k ๐ฅ) + h๐, ๐ฆi : ๐ฆ โ K } .
Since K is compact and ฮฆ is strictly convex, there is a unique minimizer ๐ฆ โ โ K.
For ๐ฅ โ K, recall the definition of the normal cone at ๐ฅ:
NK (๐ฅ) = {๐ โ โ ๐ : h๐, ๐ฆ โ ๐ฅi โค 0 for all ๐ฆ โ K } .
Given a representation of K by inequality constraints, K = {๐ฅ โ โ ๐ : ๐ด๐ฅ โค ๐} for ๐ด โ โ ๐ร๐ and
๐ โ โ ๐ , it holds
NK (๐ฅ) = {๐ด๐ ๐ฆ : ๐ฆ โฅ 0 and ๐ฆ ๐ (๐ด๐ฅ โ ๐) = 0}.
The KKT conditions yield
โฮฆ(๐ฆ โ ) = โฮฆ(๐ฅ) โ ๐ โ ๐โ , (2.2)
where ๐โ โ NK (๐ฆ โ ). Since NK (๐ฆ โ ) = NK0 (๐ฆ โ ) + Nโ+๐ (๐ฆ โ ), we can can decompose ๐โ = ๐ฝ โ ๐ผ with
๐ฝ โ NK0 (๐ฆ โ ) and โ๐ผ โ Nโ+๐ (๐ฆ โ ). In particular, we have ๐ผ โฅ 0 and ๐ผ ๐ > 0 =โ ๐ฆ ๐โ = 0 for every
๐ = 1, . . . , ๐.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 8
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
Substituting this into equation (2.1) gives
Dฮฆ (๐ง k ๐ฆ โ ) โ Dฮฆ (๐ง k ๐ฅ) = โฮฆ(๐ฆ โ ) + ฮฆ(๐ฅ) + hโฮฆ(๐ฅ), ๐ฆ โ โ ๐ฅi + h๐ โ ๐ผ + ๐ฝ, ๐ง โ ๐ฆ โ i
โค โDฮฆ (๐ฆ โ k ๐ฅ) + h๐ โ ๐ผ, ๐ง โ ๐ฆ โ i,
where the inequality comes from h๐ฝ, ๐ง โ ๐ฆ โ i โค 0 since ๐ง โ K and ๐ฝ โ NK (๐ฆ โ ). We have proved the
following.
Lemma 2.1. For any ๐ฅ, ๐ง โ K, and ๐ โ โ ๐ , let ๐ฆ โ = ฮ K๐ (๐ฅ) and ๐โ be as in (2.2). Then for any
๐ผ โ โNโ๐+ (๐ฆ โ ) such that ๐โ + ๐ผ โ NK0 (๐ฆ โ ), it holds that
Dฮฆ (๐ง k ๐ฆ โ ) โ Dฮฆ (๐ง k ๐ฅ) โค h๐ โ ๐ผ, ๐ง โ ๐ฆ โ i.
2.1 Iterative Bregman projections
We describe now a discretization of the algorithm from the introduction. This discretization
will mimic the continuous dynamics if the entries of each individual cost vector are small. We
can achieve this by splitting each cost vector into several copies of scaled down versions of itself,
as discussed in Section 2.3. In Section 3.3, we will give a formal argument that this indeed yields
a discretization of the continuous dynamics from the introduction.
Fix a tree ๐ and recall the definition of K๐ from (1.3). Let ๐๐ denote the collection of vectors
๐\{๐ฃ }
๐ โ โ+ such that for all ๐ข โ ๐ \ โ,
ร
๐ ๐ฃ = 1.
๐ฃโ๐(๐ข)
๐(๐ข) (๐ข)
For ๐ โ ๐๐ and ๐ข โ ๐ \ โ, we use ๐ (๐ข) โ โ+ to denote the vector defined by ๐ ๐ฃ := ๐ ๐ฃ for
(๐ข)
๐ฃ โ ๐(๐ข), and define the corresponding probability simplex ๐๐ := {๐ (๐ข) : ๐ โ ๐๐ }. We will use
ฮ : ๐๐ โ K๐ for the map which sends ๐ โ ๐๐ to the (unique) ๐ฅ = ฮ(๐) โ K๐ such that
๐ฅ๐ฃ = ๐ฅ๐ข ๐๐ฃ โ๐ข โ ๐ \ โ, ๐ฃ โ ๐(๐ข).
Note that ๐ contains more information than ๐ฅ; the map ฮ fails to be invertible whenever there is
some ๐ข โ ๐ \ โ with ๐ฅ ๐ข = 0.
Fix ๐
โฅ 1. On the open domain ๐ (๐ข) = (โ min๐ฃโ๐(๐ข) ๐ฟ ๐ฃ , โ)๐(๐ข) , for ๐ฟ ๐ฃ as given in equa-
tion (1.10), define the strictly convex function ฮฆ(๐ข) : ๐ (๐ข) โ โ by
1 ร ๐ค๐ฃ
ฮฆ(๐ข) (๐) := (๐ ๐ฃ + ๐ฟ ๐ฃ ) log (๐ ๐ฃ + ๐ฟ ๐ฃ ) .
๐
๐๐ฃ
๐ฃโ๐(๐ข)
(๐ข)
Denote the corresponding Bregman divergence on ๐๐ by
1 ร ๐ค๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
(๐ข) 0 0
D (๐ k ๐ ) = (๐ ๐ฃ + ๐ฟ ๐ฃ ) log 0 + ๐๐ฃ โ ๐๐ฃ .
๐
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
๐ฃโ๐(๐ข)
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 9
C HRISTIAN C OESTER AND JAMES R. L EE
We now define an algorithm that takes a point ๐ โ ๐๐ and a cost vector ๐ โ โ+โ and outputs
a point ๐ = ๐(๐, ๐) โ ๐๐ . Fix h๐ข1 , ๐ข2 , . . . , ๐ข๐ i a topological ordering of ๐ \ โ such that every
child in ๐ occurs before its parent. We define ๐ inductively as follows. Let ๐หโ := ๐โ for โ โ โ. For
every ๐ = 1, 2, . . . , ๐:
(๐ข ๐ )
๐ห๐ฃ := ๐ห๐ฃ โ๐ฃ โ ๐(๐ข ๐ ) (2.3)
n D E (๐ข ๐ )
o
๐ (๐ข ๐ ) := argmin D(๐ข ๐ ) ๐ k ๐ (๐ข ๐ ) + ๐, ๐ห(๐ข ๐ ) ๐ โ ๐๐ (2.4)
(๐ข ๐ )
ร
๐ห๐ข ๐ := ๐๐ฃ ๐ห๐ฃ (2.5)
๐ฃโ๐(๐ข ๐ )
Let ๐ผ(๐ข ๐ ) be the vector of Lagrange multipliers corresponding to the nonnegativity constraints in
equation (2.4) (recall Lemma 2.1). One should note that in this setting (a probability simplex),
the nonnegativity multipliers are unique and thus well-defined.
We denote ๐ผ = ๐ผ ๐,๐ โ โ+๐ as the vector given by ๐ผ ๐ฃ := ๐ผ ๐ฃ
(p(๐ฃ))
for ๐ฃ โ ๐ฃ and ๐ผ ๐ฃ := 0. Recall the
complementary slackness conditions:
๐ผ ๐ฃ > 0 =โ ๐ ๐ฃ = 0. (2.6)
For ๐ฃ โ ๐(๐ข), calculate
1 ๐ค๐ฃ
โฮฆ(๐ข) (๐) 1 + log(๐ ๐ฃ + ๐ฟ ๐ฃ ) .
=
๐ฃ ๐
๐๐ฃ
Then using equation (2.2), we can write the algorithm as follows:
For ๐ = 1, 2, . . . , ๐:
For ๐ฃ โ ๐(๐ข ๐ ):
๐
(๐ข ๐ ) (๐ข )
๐๐ฃ := (๐ ๐ฃ ๐ + ๐ฟ ๐ฃ ) exp ๐
๐ค๐ฃ๐ฃ ๐ฝ ๐ข ๐ โ ( ๐ห๐ฃ โ ๐ผ ๐ฃ ) โ ๐ฟ๐ฃ ,
(๐ข ๐ )
๐ห๐ข ๐ := ๐ฃโ๐(๐ข ๐ ) ๐ ๐ฃ ๐ห๐ฃ .
ร
(๐ข ๐ )
where ๐ฝ ๐ข ๐ โฅ 0 is the multiplier for the constraint ๐ฃโ๐(๐ข ๐ ) ๐ ๐ฃ
ร
โฅ 1. There is no multiplier for
(๐ข ๐ )
๐ฃโ๐(๐ข ๐ ) ๐ ๐ฃ
ร
the constraint โค 1 because this constraint will be satisfied automatically and is
(๐ข ๐ )
therefore not needed in (2.4): If it were violated, decreasing some ๐ ๐ฃ with ๐ ๐ฃ > ๐ ๐ฃ would yield
a strictly better solution to the minimization problem (2.4).
2.2 The global divergence
For ๐ง โ K๐ and ๐ โ ๐๐ , define the global divergence function
๐ง๐ฃ
๐ง๐ข + ๐ฟ ๐ฃ
" ! #
1 ร ร ๐ค๐ฃ
Dฬ(๐ง k ๐) := (๐ง ๐ฃ + ๐ฟ ๐ฃ ๐ง ๐ข ) log + ๐ง๐ข ๐๐ฃ โ ๐ง๐ฃ ,
๐
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
๐ขโโ ๐ฃโ๐(๐ข)
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 10
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
where we use the convention that 0 log 00 + ๐ฟ ๐ฃ = lim๐โ0 ๐ log 0๐ + ๐ฟ ๐ฃ = 0.
Note that Dฬ is the Bregman divergence associated to the regularizer (1.7) (divided by ๐
)
when ๐ฅ๐ฅ๐ข๐ฃ is replaced by ๐ ๐ฃ . One can write
ร
Dฬ(๐ง k ๐) = ๐ง ๐ข D(๐ข) (๐ k ๐) ,
๐ขโโ
where ๐ โ ฮโ1 (๐ง). In other words, ๐ โ ๐๐ is any point satisfying ๐ง ๐ฃ = ๐ ๐ฃ ๐ง ๐ข for all ๐ข โ โ and
๐ฃ โ ๐(๐ข),
We will use Dฬ as a potential function to prove inequality (1.1), and ๐ง will denote the
configuration of some offline algorithm. Note that the state of the online algorithm is encoded
by ๐ โ ๐๐ , which contains more information than its configuration ฮ(๐) โ ๐พ๐ .
The next lemma shows that when an offline algorithm moves, the change in potential is
bounded by ๐(1/๐
) times the offline movement cost.
Lemma 2.2. It holds that for any ๐ โ ๐๐ and ๐ง, ๐ง 0 โ K๐ ,
01 4
Dฬ(๐ง k ๐) โ Dฬ(๐ง k ๐) โค 2+ k๐ง โ ๐ง 0 kโ1 (๐ค) .
๐
๐
๐
Proof. Consider a differentiable map ๐ง : [0, 1] โ โ++ such that ๐ฃโ๐(๐ข) ๐ง ๐ฃ (๐ก) โค ๐ง ๐ข (๐ก) for each ๐ก
ร
and ๐ข โ โ. It suffices to show that for each ๐ก and every fixed ๐ โ ๐๐ ,
4
๐
๐๐ก Dฬ(๐ง(๐ก) k ๐) โค 2 + k๐ง 0(๐ก)kโ1 (๐ค) .
๐
Moreover, it suffices to address the case when there is at most one ๐ข โ ๐ with ๐ง 0๐ข (๐ก) โ 0.
A direct calculation gives
๐ค๐ข 0 ๐ง ๐ข (๐ก)/๐ง p(๐ข) (๐ก) + ๐ฟ ๐ข
๐
๐๐ก Dฬ(๐ง(๐ก) k ๐) = ๐ง ๐ข (๐ก) log
๐๐ข ๐๐ข + ๐ฟ๐ข
ร ๐ค
๐ง ๐ฃ (๐ก)/๐ง ๐ข (๐ก) + ๐ฟ ๐ฃ
๐ง ๐ฃ (๐ก)
๐ฃ
+ ๐ฟ ๐ฃ ๐ง 0๐ข (๐ก) log + ๐ง 0๐ข (๐ก) ๐ ๐ฃ โ . (2.7)
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ ๐ง ๐ข (๐ก)
๐ฃโ๐(๐ข)
Let us now use definitions (1.9) and (1.10) to observe that
1 ๐๐ฃ + ๐ฟ๐ฃ 1 1 + ๐ฟ๐ฃ
log โค log โค 2.
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ ๐๐ฃ ๐ฟ๐ฃ
Using this in equation (2.7) yields
ร ๐ง ๐ฃ (๐ก) ยช
1 4
๐
๐๐ก Dฬ(๐ง(๐ก) k ๐) โค ๐ค ๐ข |๐ง 0๐ข (๐ก)| ยญ2 + 2๐ฟ ๐ฃ + ๐ ๐ฃ โ ยฎ โค ๐ค ๐ข |๐ง 0๐ข (๐ก)| 2 + ,
ยฉ
๐ ๐ง ๐ข (๐ก) ๐
ยซ ๐ฃโ๐(๐ข) ยฌ
๐ฃโ๐(๐ข) ๐ฟ ๐ฃ โค ๐ฃโ๐(๐ข) ๐๐ฃ โค 1 and ๐ฃโ๐(๐ข) ๐ง ๐ฃ (๐ก) โค ๐ง ๐ข (๐ก).
ร ร ร
where the last inequality uses
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 11
C HRISTIAN C OESTER AND JAMES R. L EE
We will sometimes implicitly restrict vectors ๐ฅ โ โ๐ to the subspace spanned by {๐โ : โ โ โ}.
In this case, we employ the notation
ร
h๐ฅ, ๐ฆiโ := ๐ฅโ ๐ฆโ ,
โ โโ
when either vector lies in โ๐ or โ โ .
According to the following lemma, the change in potential due to movement of the online
algorithm is bounded by the difference in service cost between the offline and online algorithm.
Lemma 2.3. For any cost vector ๐ โ โ+โ , ๐ง โ K๐ , and ๐ โ ๐๐ , it holds that if ๐ = ๐(๐, ๐), then
Dฬ(๐ง k ๐) โ Dฬ(๐ง k ๐) โค h๐, ๐ง โ ฮ(๐)i โ .
Proof. Fix ๐ โ ๐๐ and ๐ โ โ+โ . Let ๐ผ = ๐ผ ๐,๐ denote the vector of multipliers defined in Section 2.1.
(๐ข)
For ๐ข โ ๐ \ โ with ๐ง ๐ข > 0, define ๐ง (๐ข) โ ๐๐ by
(๐ข) ๐ง๐ฃ
๐ง ๐ฃ := .
๐ง๐ข
Then Lemma 2.1 gives
D E
D(๐ข) ๐ง (๐ข) k ๐ (๐ข) โ D(๐ข) ๐ง (๐ข) k ๐ (๐ข) โค ๐ห (๐ข) โ ๐ผ (๐ข) , ๐ง (๐ข) โ ๐ (๐ข) ,
๐(๐ข)
where we use hยท, ยทi๐(๐ข) for the standard inner product on โ ๐(๐ข) . Multiplying by ๐ง ๐ข and summing
yields
ร D E
Dฬ(๐ง k ๐) โ Dฬ(๐ง k ๐) โค ๐ง ๐ข ๐ห(๐ข) โ ๐ผ(๐ข) , ๐ง (๐ข) โ ๐ (๐ข)
๐(๐ข)
๐ขโโ
ร ร ร ร
(๐ข) (๐ข) (๐ข) (๐ข)
= ( ๐ห๐ฃ โ ๐ผ ๐ฃ )๐ง ๐ฃ โ ๐ง๐ข ( ๐ห๐ฃ โ ๐ผ ๐ฃ )๐ ๐ฃ .
๐ขโโ ๐ฃโ๐(๐ข) ๐ขโโ ๐ฃโ๐(๐ข)
Note that from implication (2.6), the latter expression is
ร ร ร
(๐ข) (2.5)
๐ง๐ข ๐ห๐ฃ ๐ ๐ฃ = ๐ง ๐ข ๐ห๐ข .
๐ขโโ ๐ฃโ๐(๐ข) ๐ขโโ
Noting that ๐ห๐ฃ = โ โโ ฮ(๐)โ ๐โ , this gives
ร
ร ร
Dฬ(๐ง k ๐) โ Dฬ(๐ง k ๐) โค ( ๐ห๐ข โ ๐ผ ๐ข )๐ง ๐ข โ ๐ง ๐ข ๐ห๐ข โค h๐, ๐ง โ ฮ(๐)i โ .
๐ขโ ๐ฃ ๐ขโโ
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 12
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
2.3 Algorithm and competitive analysis
Let us now outline the proof of inequality (1.2). First, we perform a standard reduction that
allows us to bound only the โpositiveโ movement costs when the algorithm moves from ๐ฅ to ๐ฆ.
Its proof is straightforward.
Lemma 2.4. For ๐ฅ, ๐ฆ โ K๐ it holds that
k๐ฅ โ ๐ฆkโ1 (๐ค) = 2 k(๐ฅ โ ๐ฆ)+ kโ1 (๐ค) + [๐(๐ฆ) โ ๐(๐ฅ)],
where ๐(๐ฅ) := ๐ขโ ๐ฃ ๐ค ๐ข ๐ฅ ๐ข for ๐ฅ โ K๐ .
ร
We now state the key technical lemma which controls the positive movement cost by the
service cost. To this end, we employ an auxiliary potential function ฮจ : ๐๐ โ โ defined by
ฮจ๐ข (๐) := โฮ(๐)๐ข D(๐ข) ๐ (๐ข) k ๐ (๐ข)
ร
ฮจ(๐) := ฮจ๐ข (๐).
๐ขโโ
Intuitively, ฮจ(๐) is a measure of difference between the online configuration ๐ and the uniform
distribution over leaves (whose conditional probabilities are given by ๐).
Let us give a brief explanation of the need for ฮจ. Our addition of โnoiseโ to the multiscale
conditional entropy is to achieve the smoothness property established in Lemma 2.2. But this
has the adverse effect of increasing the movement cost of the algorithm, as one can see from the
๐ฟ ๐ข term in (1.11). This additional movement cannot be easily charged against the service cost in
the regime where the noise term is dominant: ๐ฅ๐ฅ๐ข (๐ก)(๐ก) ๐ฟ ๐ข . On the other hand, this additional
p(๐ข)
movement has the effect of further decreasing ๐ฅ๐ฅ๐ข (๐ก)(๐ก) , which drives the conditional probabilities
p(๐ข)
at p(๐ข) away from the uniform distribution, decreasing ฮจ. A formal statement appears later in
Lemma 2.11.
For the next two results, take any ๐ โ ๐๐ and cost ๐ โ โ+โ , and denote ๐ = ๐(๐, ๐), ๐ฅ =
ฮ(๐), ๐ฆ = ฮ(๐).
Lemma 2.5 (Movement analysis). It holds that
๐โ3
(๐ฅ โ ๐ฆ)+ โ (๐ค) โค (2๐๐ + log ๐)h๐, ๐ฅiโ + [ฮจ(๐) โ ฮจ(๐)] .
๐
๐ 1
This lemma will be proved in Section 2.4. Let us first see that it can be used to establish
bounds on the competitive ratio. Define ๐คmin := min{๐คโ : โ โ โ} and
๐คmin ๐โ3
๐๐ := .
2(2๐๐ + log ๐) ๐๐
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 13
C HRISTIAN C OESTER AND JAMES R. L EE
Theorem 2.6. For any ๐ง โ K๐ :
h๐, ๐ฆiโ โค h๐, ๐งiโ + Dฬ(๐ง k ๐) โ Dฬ(๐ง k ๐)
(2.8)
2๐
๐
โ1 k๐ฅ โ ๐ฆ kโ1 (๐ค) โค [๐(๐ฆ) โ ๐(๐ฅ)] + [ฮจ(๐) โ ฮจ(๐)] + (2๐๐ + log ๐)h๐, ๐ฅiโ
(2.9)
๐โ3
Moreover, if k๐ k โ โค ๐๐ , then
4๐
๐
โ1 k๐ฅ โ ๐ฆ kโ1 (๐ค) โค [๐(๐ฆ) โ ๐(๐ฅ)] + [ฮจ(๐) โ ฮจ(๐)] + (2๐๐ + log ๐)h๐, ๐ฆiโ .
(2.10)
๐โ3
Proof. Inequality (2.8) follows from Lemma 2.3, and inequality (2.9) follows from Lemma 2.5
and Lemma 2.4. To see that inequality (2.10) follows from inequality (2.9) and Lemma 2.5, use
the fact that
k๐ k โ
h๐, ๐ฅiโ โค h๐, ๐ฆiโ + (๐ฅ โ ๐ฆ)+ โ (๐ค) .
๐ค min 1
In light of Theorem 2.6, we can respond to a cost function ๐ โ โ+โ by splitting it into
๐ pieces ๐ 1 , ๐2 , . . . , ๐ ๐ where ๐ = dk๐ k โ /๐๐ e. Now define ๐ ๐ := ๐(๐ ๐โ1 , ๐/๐), ๐0 := ๐ and
ยฏ ๐) := ๐ ๐ .
๐(๐,
Theorem 2.7. Fix ๐ โฅ 4. Consider the algorithm that begins in some configuration ๐0 โ ๐๐ . If ๐ ๐ก โ โ+โ
ยฏ ๐กโ1 , ๐ ๐ก ). Then the sequence hฮ(๐0 ), ฮ(๐1 ), . . .i
is the cost function that arrives at time ๐ก, denote ๐ ๐ก := ๐(๐
is an online algorithm that is (1, ๐(1/๐
), ๐(๐
(๐๐ + log ๐)))-finely competitive.
We prove this momentarily. The following fact is well-known and, in conjunction with the
preceding theorem, yields the validity of Theorems 1.1 and 1.3.
Lemma 2.8 (See, e.g., [3, Thm. 2.4]). If (โ, ๐๐ ) is an HST metric, then there is another weighted tree
๐ 0 with leaf set โ such that
1. (โ, ๐๐ 0 ) is a 7-HST metric.
2. ๐๐ 0 โค log2 |โ|
3. All the leaves of ๐ 0 have depth ๐๐ 0 .
4. ๐๐ (โ , โ 0) โค ๐๐ 0 (โ , โ 0) โค ๐(๐๐ (โ , โ 0)) for all โ , โ 0 โ โ.
Proof sketch. Replace every weight ๐ค ๐ฃ in ๐ with ๐คห ๐ฃ := 7 dlog7 ๐ค๐ฃ e and iteratively contract every
edge (๐(๐ข), ๐ข) with ๐คห ๐(๐ข) = ๐คห ๐ข and ๐ข โ โ. The resulting weighted tree ๐1 is a 7-HST by
construction.
Now iteratively contract every edge (๐(๐ข), ๐ข) in ๐1 for which |โ๐ข๐1 | > 12 |โ ๐๐(๐ข)
1
|. The resulting
tree ๐ has depth ๐๐ โค log2 |โ|. Finally, one can achieve property (3) by increasing the depth of
0 0
every root-leaf path to ๐๐ 0 using vertex weights that decrease by a factor of 7 along the path.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 14
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
Proof of Theorem 2.7. Consider a sequence h๐ ๐ก : ๐ก โฅ 1i of cost functions. By splitting the costs
into smaller pieces, we may assume that k๐ ๐ก k โ โค ๐๐ for all ๐ก โฅ 1.
Let {๐ง ๐กโ } denote some offline algorithm with ๐ง 0โ = ฮ(๐0 ), and let {๐ฅ ๐ก = ฮ(๐ ๐ก )} denote our
online algorithm. Then using Dฬ(๐ง 0โ k ๐ฅ 0 ) = 0 along with inequality (2.8) and Lemma 2.2 yields,
for any time ๐ก1 โฅ 1,
๐ก1
ร ๐ก1
ร ๐ก1
ร
h๐ ๐ก , ๐ฅ ๐ก iโ โค h๐ ๐ก , ๐ง ๐กโ iโ โ Dฬ(๐ง ๐กโ1 k ๐ ๐ก1 ) + ๐(1/๐
) k๐ง ๐กโ โ ๐ง ๐กโ1
โ
kโ1 (๐ค)
๐ก=1 ๐ก=1 ๐ก=1
๐ก1
ร ๐ก1
ร
โค h๐ ๐ก , ๐ง ๐กโ iโ + ๐(1/๐
) k๐ง ๐กโ โ ๐ง ๐กโ1
โ
kโ1 (๐ค) ,
๐ก=1 ๐ก=1
where we have used Dฬ(๐ง k ๐) โฅ 0 for all ๐ง โ K๐ and ๐ โ ๐๐ . This verifies inequality (1.1) with
๐ผ 0 = 1, ๐ผ1 = ๐(1/๐
), and ๐ฝ = 0. Moreover, inequality (2.10) gives
๐ก1 ๐ก1
1ร 4๐ ร
k๐ฅ ๐ก โ ๐ฅ ๐กโ1 kโ1 (๐ค) โค [๐(๐ฅ ๐ก1 ) โ ๐(๐ฅ0 )] + [ฮจ(๐0 ) โ ฮจ(๐ ๐ก1 )] + 2๐๐ + log ๐ h๐ ๐ก , ๐ฅ ๐ก iโ ,
๐
๐โ3
๐ก=1 ๐ก=1
verifying inequality (1.2) with ๐ผ1 โค ๐(๐
(๐๐ + log ๐)) and ๐ฝ0 โค ๐(๐
max๐ฃโ ๐ฃ ๐ค ๐ฃ ) (see Lemma 2.10
below).
2.4 Movement analysis
It remains to prove Lemma 2.5. Recall that ๐ โ ๐๐ , ๐ โ โ+โ and ๐ = ๐(๐, ๐), ๐ฅ = ฮ(๐), ๐ฆ = ฮ(๐).
The KKT conditions (see equation (2.2)) give: For every ๐ฃ โ ๐(๐ข),
1 ๐ค๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
log = ๐ฝ ๐ข โ ๐ห๐ฃ + ๐ผ ๐ฃ , (2.11)
๐
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
where ๐ฝ ๐ข โฅ 0 is the multiplier corresponding to the constraint ๐ฃโ๐(๐ข) ๐ ๐ฃ โฅ 1.
ร
Lemma 2.9. It holds that ๐ผ ๐ฃ โค ๐ห๐ฃ for all ๐ฃ โ ๐ \ {๐ฃ }.
Proof. Note that ๐ห๐ฃ โฅ 0 by construction. Thus if ๐ผ ๐ฃ = 0, we are done. Otherwise, by
๐ +๐ฟ
complementary slackness, it must be that ๐ ๐ฃ = 0, and therefore log( ๐๐ฃ๐ฃ +๐ฟ๐ฃ๐ฃ ) โค 0. Since ๐ฝ p(๐ฃ) โฅ 0,
equation (2.11) implies that ๐ผ ๐ฃ โค ๐ห๐ฃ .
๐ ๐ฃ +๐ฟ ๐ฃ
Define ๐๐ฃ := log ๐ ๐ฃ +๐ฟ ๐ฃ so that
๐ ๐ฃ โ ๐ ๐ฃ = (๐ ๐ฃ + ๐ฟ ๐ฃ )(1 โ e๐๐ฃ ). (2.12)
Recall that for ๐ฃ โ ๐(๐ข), we have ๐ฅ ๐ฃ = ๐ ๐ฃ ๐ฅ ๐ข and ๐ฆ๐ฃ = ๐ ๐ฃ ๐ฆ๐ข , thus
๐ฅ ๐ฃ โ ๐ฆ๐ฃ = ๐ฅ ๐ข (๐ ๐ฃ โ ๐ ๐ฃ ) + ๐ ๐ฃ (๐ฅ ๐ข โ ๐ฆ๐ข ) = (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅ ๐ข )(1 โ ๐ ๐๐ฃ ) + ๐ ๐ฃ (๐ฅ ๐ข โ ๐ฆ๐ข ).
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 15
C HRISTIAN C OESTER AND JAMES R. L EE
In particular,
๐ค ๐ฃ (๐ฅ ๐ฃ โ ๐ฆ๐ฃ )+ โค ๐ค ๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅ ๐ข )(1 โ e๐๐ฃ )+ + ๐ค ๐ฃ ๐ ๐ฃ (๐ฅ ๐ข โ ๐ฆ๐ข )+
๐ค๐ข
โค ๐ค ๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅ ๐ข )(1 โ e๐๐ฃ )+ + ๐ ๐ฃ (๐ฅ ๐ข โ ๐ฆ๐ข )+ .
๐
๐ฃโ๐(๐ข) ๐ ๐ฃ = 1 and summing over all vertices yields
ร
Using
ร ร 1ร
๐ค ๐ฃ (๐ฅ ๐ฃ โ ๐ฆ๐ฃ )+ โค ๐ค ๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅp(๐ฃ) )(1 โ e๐๐ฃ )+ + ๐ค ๐ฃ (๐ฅ ๐ฃ โ ๐ฆ๐ฃ )+ ,
๐ฃโ ๐ฃ ๐ฃโ ๐ฃ
๐ ๐ฃโ ๐ฃ
hence
ร ๐ ร
๐ค ๐ฃ (๐ฅ ๐ฃ โ ๐ฆ๐ฃ )+ โค ๐ค ๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅp(๐ฃ) )(1 โ e๐๐ฃ )+
๐ฃโ ๐ฃ
๐ โ 1 ๐ฃโ ๐ฃ
๐ ร
โค ๐ค ๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅp(๐ฃ) ) (๐๐ฃ )โ
๐ โ 1 ๐ฃโ ๐ฃ
๐
๐ ยฉร ร ร
โค ยญ ๐๐ฃ ๐ฅ ๐ฃ ๐ห๐ฃ + ๐ฅ๐ข ๐๐ฃ ( ๐ห๐ฃ โ ๐ผ ๐ฃ )ยฎ ,
ยช
(2.13)
๐ โ 1 ๐ฃโ ๐ฃ
ยซ ๐ขโโ ๐ฃโ๐(๐ข) ยฌ
where the last line uses Lemma 2.9 and equation (2.11), to bound ๐ค ๐ฃ (๐๐ฃ )โ โค ๐
๐๐ฃ ( ๐ห๐ฃ โ ๐ผ ๐ฃ ).
Note that
ร ร ร
๐๐ฃ ๐ฅ ๐ฃ ๐ห๐ฃ โค ๐โ ๐ฅโ ๐๐ฃ โค (๐๐ + log ๐) h๐, ๐ฅi , (2.14)
๐ฃโ ๐ฃ โ โโ ๐ฃโ๐พ๐ฃ ,โ \{๐ฃ }
since for any โ โ โ, it holds that
ร ร |โp(๐ฃ) |
๐๐ฃ = ๐๐ (โ ) + log = ๐๐ (โ ) + log ๐,
|โ๐ฃ |
๐ฃโ๐พ๐ฃ ,โ \{๐ฃ } ๐ฃโ๐พ๐ฃ ,โ \{๐ฃ }
where ๐๐ (โ ) is the combinatorial depth of โ .
The second sum in (2.13) can be interpreted as the service cost of hybrid configurations of ๐
and ๐: While ๐ฃโ๐(๐ข) ๐ฅ ๐ฃ ๐ห๐ฃ is the service cost of ๐ฅ in โ๐ข , the term ๐ฅ ๐ข ๐ฃโ๐(๐ข) ๐๐ฃ ๐ห๐ฃ is the service
ร ร
cost in โ๐ข of the modification of ๐ฅ whose conditional probabilities at the children of ๐ข are given
by ๐(๐ข) rather than ๐ (๐ข) . To bound this hybrid service cost, we will employ the auxiliary potential
ฮจ.
2.4.1 The hybrid cost
We require the following elementary estimate.
Lemma 2.10. For ๐ข โ โ it holds that
n
(๐ข) (๐ข)
o 2 ๐ค๐ข
max D (๐ k ๐) : ๐, ๐ โ ๐๐ โค .
๐
๐
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 16
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
Proof. Define ๐๐ฃ : (โ๐ฟ ๐ฃ , โ) โ โ by
1
๐๐ฃ (๐) := (๐ ๐ฃ + ๐ฟ ๐ฃ ) log(๐ ๐ฃ + ๐ฟ ๐ฃ ),
๐๐ฃ
and let
๐๐ฃ + ๐ฟ๐ฃ
1
D๐๐ฃ (๐ ๐ฃ k ๐ ๐ฃ ) = (๐ ๐ฃ + ๐ฟ ๐ฃ ) log + (๐ ๐ฃ โ ๐ ๐ฃ )
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
denote the corresponding Bregman divergence. Then for ๐ ๐ฃ , ๐ ๐ฃ โฅ 0, it holds that D๐๐ฃ (๐ ๐ฃ k ๐ ๐ฃ ) โฅ 0
since ๐๐ฃ is convex on โ+ . Employing the ๐-HST property of ๐, this implies that
1 ร ๐ค๐ข ร
D(๐ข) (๐ k ๐) = ๐ค ๐ฃ D๐๐ฃ (๐๐ฃ k ๐ ๐ฃ ) โค D๐๐ฃ (๐๐ฃ k ๐ ๐ฃ ).
๐
๐
๐
๐ฃโ๐(๐ข) ๐ฃโ๐(๐ข)
(๐ข) (๐ข)
Define ๐น : ๐๐ ร ๐๐ โ โ+ by ๐น(๐, ๐) := ๐ฃโ๐(๐ข) D๐๐ฃ (๐๐ฃ k ๐ ๐ฃ ). The map ๐ โฆโ ๐น(๐, ๐) is convex
ร
in general (for any Bregman divergence). The map ๐ โฆโ ๐น(๐, ๐) is convex as well, as this holds
for each map ๐ ๐ฃ โฆโ D๐๐ฃ (๐ ๐ฃ k ๐ ๐ฃ ) since โ log(๐ฅ) is convex on โ++ . Since the maximum of a convex
function on the a polytope is achieved at an extreme point, we have
1 + ๐ฟ๐ฃ ๐ฟ ๐ฃ0
n o
(๐ข) 1 1
max ๐น(๐, ๐) : ๐, ๐ โ ๐๐ โค max (1 + ๐ฟ ๐ฃ ) log โ1 + ๐ฟ ๐ฃ0 log +1
๐ฃ,๐ฃ โ๐(๐ข)
0 ๐๐ฃ ๐ฟ๐ฃ ๐๐ฃ 0 1 + ๐ฟ ๐ฃ0
๐ฃโ ๐ฃ 0
โค 2.
The next lemma is crucial: It relates the service cost (with respect to the reduced cost ๐ห โ ๐ผ)
of the hybrid configurations to the service cost of the actual configuration and the movement
cost.
Lemma 2.11. For any ๐ข โ โ, it holds that
2 ๐ค๐ข ร
ฮจ๐ข (๐) โ ฮจ๐ข (๐) โค (๐ฅ ๐ข โ ๐ฆ๐ข )+ + (๐ห๐ฃ โ ๐ผ ๐ฃ ) [๐ฅ ๐ฃ โ ๐๐ฃ ๐ฅ ๐ข ] . (2.15)
๐
๐
๐ฃโ๐(๐ข)
Proof. Write
ฮจ๐ข (๐) โ ฮจ๐ข (๐) = ๐ฅ ๐ข D(๐ข) ๐ (๐ข) k ๐ (๐ข) โ ๐ฆ๐ข D(๐ข) ๐ (๐ข) k ๐ (๐ข)
h i
= (๐ฅ ๐ข โ ๐ฆ๐ข )D(๐ข) (๐(๐ข) k ๐ (๐ข) ) + ๐ฅ ๐ข D(๐ข) (๐ (๐ข) k ๐ (๐ข) ) โ D(๐ข) (๐(๐ข) k ๐ (๐ข) ) .
Using Lemma 2.10, the first term is bounded by ๐
2 ๐ค๐๐ข (๐ฅ ๐ข โ ๐ฆ๐ข )+ .
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 17
C HRISTIAN C OESTER AND JAMES R. L EE
Let us now bound the second term. Using 1 + ๐ก โค e๐ก , we have
ร ๐ค ๐๐ฃ + ๐ฟ๐ฃ
๐ฃ
h i
(๐ข) (๐ข) (๐ข) (๐ข) (๐ข) (๐ข)
๐
๐ฅ ๐ข D (๐ k๐ )โD (๐ k๐ ) = ๐ฅ๐ข (๐๐ฃ + ๐ฟ ๐ฃ ) log + ๐๐ฃ โ ๐๐ฃ
๐๐ฃ ๐๐ฃ + ๐ฟ๐ฃ
๐ฃโ๐(๐ข)
ร ๐ค
๐ฃ
[(๐๐ฃ + ๐ฟ ๐ฃ )๐๐ฃ + (๐ ๐ฃ + ๐ฟ ๐ฃ )(1 โ e๐๐ฃ )]
(2.12)
= ๐ฅ๐ข
๐๐ฃ
๐ฃโ๐(๐ข)
ร ๐ค
๐ฃ
โค ๐ฅ๐ข ๐๐ฃ (๐๐ฃ โ ๐ ๐ฃ )
๐๐ฃ
๐ฃโ๐(๐ข)
ร ๐ค
๐ฃ
= ๐๐ฃ [๐๐ฃ ๐ฅ ๐ข โ ๐ฅ ๐ฃ ] .
๐๐ฃ
๐ฃโ๐(๐ข)
To finish the proof, observe that from equation (2.11),
ร ๐ค
๐ฃ
ร ร
๐๐ฃ [๐๐ฃ ๐ฅ ๐ข โ ๐ฅ ๐ฃ ] = ๐
(๐ฝ ๐ข โ ๐ห๐ฃ + ๐ผ ๐ฃ ) [๐๐ฃ ๐ฅ ๐ข โ ๐ฅ ๐ฃ ] = ๐
(๐ผ ๐ฃ โ ๐ห๐ฃ ) [๐๐ฃ ๐ฅ ๐ข โ ๐ฅ ๐ฃ ] ,
๐๐ฃ
๐ฃโ๐(๐ข) ๐ฃโ๐(๐ข) ๐ฃโ๐(๐ข)
๐ฃโ๐(๐ข) ๐ฅ ๐ฃ = ๐ฅ ๐ข and ๐ฃโ๐(๐ข) ๐๐ฃ = 1 (from (1.8)).
ร ร
where the last equality uses
Using the lemma gives
ร ร (2.15) 2 ร
๐ฅ๐ข ๐๐ฃ ( ๐ห๐ฃ โ ๐ผ ๐ฃ ) โค [ฮจ(๐) โ ฮจ(๐)] + (ฮ(๐) โ ฮ(๐))+ โ (๐ค) + ๐ห๐ฃ ๐ฅ ๐ฃ
๐
๐ 1
๐ฃโ ๐ฃ
๐ขโโ ๐ฃโ๐(๐ข)
2
โค [ฮจ(๐) โ ฮจ(๐)] + (ฮ(๐) โ ฮ(๐))+ โ (๐ค) + ๐๐ h๐, ๐ฅiโ .
๐
๐ 1
Combining this inequality with inequality (2.13) and inequality (2.14) gives
๐
โ1 2
๐
(๐ฅ โ ๐ฆ)+ โ (๐ค) โค 2๐๐ + log ๐ h๐, ๐ฅiโ + (ฮจ(๐) โ ฮจ(๐)) + (๐ฅ โ ๐ฆ)+ โ (๐ค) ,
1 ๐โ1 ๐
๐ 1
(2.16)
completing the verification of Lemma 2.5.
3 Derivation of the dynamics and derived costs
For the sake of motivating the dynamics (1.11), we review the continuous-time mirror descent
framework of [14]. Suppose that K โ โ ๐ is a convex set. We recall again the definition of the
normal cone to K at ๐ฅ โ K, which is given by
๐K (๐ฅ) := (K โ ๐ฅ)โฆ = ๐ โ โ ๐ : h๐, ๐ฆ โ ๐ฅi โค 0 for all ๐ฆ โ K .
Suppose additionally that ฮฆ : ๐ โ โ is ๐ 2 and strictly convex on an open neighborhood
๐ โ K so that the Hessian โ2 ฮฆ(๐ฅ) is well-defined and positive definite on ๐. Given a control
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 18
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
function ๐น : [0, โ) ร K โ โ ๐ and an initial point ๐ฅ 0 โ K, we will be concerned with absolutely
continuous solutions ๐ฅ : [0, โ) โ K to the differential inclusion
๐ฅ(0) = ๐ฅ 0 ,
โ ฮฆ(๐ฅ(๐ก))๐ฅ 0(๐ก) โ ๐น(๐ก, ๐ฅ(๐ก)) โ ๐K (๐ฅ(๐ก)) .
2
In other words, a trajectory that satisfies ๐ฅ(0) = ๐ฅ0 and for almost every ๐ก โฅ 0:
๐ฅ 0(๐ก) = โ2 ฮฆ(๐ฅ(๐ก))โ1 (๐น(๐ก, ๐ฅ(๐ก)) โ ๐พ(๐ก)) , (3.1)
with ๐พ(๐ก) โ ๐K (๐ฅ(๐ก)).
Under suitably strong conditions on ฮฆ and ๐น, there is a unique absolutely continuous
solution to equation (3.1) [14]. In our setup, these conditions are actually not satisfied unless we
prevent the path ๐ฅ from hitting the relative boundary of K. Nevertheless, the formal calculation
is elucidating and motivates the algorithm of Section 2. For simplicity, we assume ๐
:= 1 in this
section.
3.1 Hessian computation
๐
Let us take ฮฆ as in (1.7) and calculate โ2 ฮฆ(๐ฅ) for ๐ฅ โ โ++ . Fix ๐ข โ ๐ฃ . Then we have
๐ค๐ข ๐ฅ๐ข ร ๐ค ๐ฅ๐ฃ ๐ฅ๐ฃ
๐ฃ
๐๐ข ฮฆ(๐ฅ) = log + ๐ฟ๐ข + 1 + ๐ฟ ๐ฃ log + ๐ฟ๐ฃ โ . (3.2)
๐๐ข ๐ฅ p(๐ข) ๐๐ฃ ๐ฅ๐ข ๐ฅ๐ข
๐ฃโ๐(๐ข)
Moreover, ๐๐ข๐ฃ ฮฆ(๐ฅ) = 0 unless ๐ข = ๐ฃ, ๐ข โ ๐(๐ฃ), or ๐ฃ โ ๐(๐ข), and in this case,
๐ค๐ข ๐ฅ๐ฃ ๐ค๐ฃ
ร 2
๐๐ข๐ข ฮฆ(๐ฅ) = +
๐๐ข (๐ฅ ๐ข + ๐ฟ ๐ข ๐ฅp(๐ข) ) ๐ฅ๐ข ๐๐ฃ (๐ฅ ๐ฃ + ๐ฟ ๐ฃ ๐ฅ ๐ข )
๐ฃโ๐(๐ข)
๐ฅ๐ข ๐ค๐ข
๐๐ข,p(๐ข) ฮฆ(๐ฅ) = ๐p(๐ข),๐ข ฮฆ(๐ฅ) = โ .
๐ฅp(๐ข) ๐๐ข (๐ฅ ๐ข + ๐ฟ ๐ข ๐ฅ p(๐ข) )
3.2 Explicit dynamics
We are now in a position to calculate the formal dynamics. Let us define the control by
๐น(ยท, ๐ก) := โ๐(๐ก). We claim that for ๐ข โ ๐ฃ ,
!
๐ฅ ๐ข (๐ก) ๐๐ข ๐ฅ ๐ข (๐ก) ร ๐ฅ (๐ก)
โ
๐๐ก = + ๐ฟ๐ข ๐ฝp(๐ข) (๐ก) โ ๐โ , (3.3)
๐ฅ p(๐ข) (๐ก) ๐ค ๐ข ๐ฅp(๐ข) (๐ก) ๐ฅ ๐ข (๐ก)
โ โโ๐ข
where ๐ฝ ๐ข (๐ก) โฅ 0 denotes the Lagrange multiplier corresponding to the constraint ๐ฅ ๐ข = ๐ฃโ๐(๐ข) ๐ฅ ๐ฃ .
ร
To verify equation (3.3), let us define, for ๐ข โ ๐ฃ ,
๐ค๐ข ๐ฅ p(๐ข) (๐ก) ๐ฅ ๐ข (๐ก)
โฐ(๐ข) := ๐๐ก .
๐๐ข ๐ฅ ๐ข (๐ก) + ๐ฟ ๐ข ๐ฅ p(๐ข) (๐ก) ๐ฅ p(๐ข) (๐ก)
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 19
C HRISTIAN C OESTER AND JAMES R. L EE
Then equation (3.3) is equivalent to the assertion that
ร ๐ฅ (๐ก)
โ
โฐ(๐ข) = ๐ฝ p(๐ข) (๐ก) โ ๐โ (๐ก). (3.4)
๐ฅ ๐ข (๐ก)
โ โโ๐ข
Recalling equation (3.1), the equality โ2 ฮฆ(๐ฅ(๐ก))๐ฅ 0(๐ก) ๐ข = (๐น(๐ก, ๐ฅ(๐ก)) โ ๐พ(๐ก))๐ข is equivalent to
โฐ(โ ) = ๐ฝ p(โ ) (๐ก) โ ๐โ (๐ก), โ โ โ, (3.5)
ร ๐ฅ (๐ก)
๐ฃ
โฐ(๐ข) โ โฐ(๐ฃ) = ๐ฝ p(๐ข) (๐ก) โ ๐ฝ ๐ข (๐ก) , ๐ข โ ๐ \ (โ โช {๐ฃ }). (3.6)
๐ฅ ๐ข (๐ก)
๐ฃโ๐(๐ข)
Clearly equation (3.5) already confirms equation (3.4) for โ โ โ.
Let us conclude by verifying equation (3.4) for all ๐ข โ ๐ฃ by (reverse) induction on the depth.
Employing equation (3.6) along with the validity of equation (3.4) for {โฐ(๐ฃ) : ๐ฃ โ ๐(๐ข)} yields
!
ร ๐ฅ (๐ก) ร ๐ฅ (๐ก)
๐ฃ โ
โฐ(๐ข) = ๐ฝ p(๐ข) (๐ก) โ ๐ฝ ๐ข (๐ก) + ๐ฝ ๐ข (๐ก) โ ๐โ (๐ก)
๐ฅ ๐ข (๐ก) ๐ฅ ๐ข (๐ก)
๐ฃโ๐(๐ข) โ โโ๐ข
ร ๐ฅ (๐ก)
โ
= ๐ฝ p(๐ข) (๐ก) โ ๐โ (๐ก),
๐ฅ ๐ข (๐ก)
โ โโ๐ข
where we used the fact that ๐ฅ ๐ข = ๐ฃโ๐(๐ข) ๐ฅ ๐ฃ for ๐ฅ โ K๐ .
ร
3.3 Relationship between discrete and continuous dynamics
Recall the setup from Section 1.3.3. We consider a system of variables {๐ ๐ข (๐ก) : ๐ข โ ๐ \ {๐ฃ }}
satisfying the differential equations
๐๐ข
๐๐ก ๐ ๐ข (๐ก) = (๐ ๐ข (๐ก) + ๐ฟ ๐ข ) ๐ฝ p(๐ข) (๐ก) โ ๐ห๐ข (๐ก) + ๐ผ ๐ข (๐ก) ,
(3.7)
๐ค๐ข
where ๐ผ ๐ข (๐ก) is a Lagrangian multiplier for the constraint ๐ ๐ข (๐ก) โฅ 0, and ๐ห๐ข (๐ก) is the โderivedโ
cost in the subtree rooted at ๐ข:
ร
๐ห๐ข (๐ก) := ๐โ |๐ข (๐ก)๐โ (๐ก)
โ โโ๐ข
ร
๐โ |๐ข (๐ก) := ๐ ๐ฃ (๐ก) ,
๐ฃโ๐พ๐ข,โ \{๐ข}
where ๐พ๐ข,โ is the unique simple ๐ข-โ path in ๐. Now the values ๐โ |๐ฃ give a probability distribution
on the leaves.
Let us argue that when the discretization parameter of the algorithm presented in Section 2
goes to zero, one arrives at a solution to equation (3.7). Recall that in Section 2.3, we split
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 20
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
each cost function ๐ โ โ+โ into ๐ pieces ๐ โ1 ๐ and computed a sequence of configurations
๐0 , . . . , ๐ ๐ โ ๐๐ . Define the piecewise-linear function ๐(๐) : [0, 1] โ ๐๐ by
๐+๐ฟ
๐(๐) := (1 โ ๐ฟ)๐ ๐ + ๐ฟ๐ ๐+1 , ๐ฟ โ [0, 1], ๐ โ {0, . . . , ๐ โ 1}.
๐
Recalling Section 2.1, we have
n D E o
(๐ข)
๐๐ := argmin D(๐ข) ๐ k ๐ (๐ข)
๐โ1
(๐ข)
+ ๐, ๐ โ1 ๐ห ๐ ๐ โ ๐๐
(๐ข)
, (3.8)
where ร
(๐ข)
๐ห ๐ = (๐ ๐ )โ |๐ข ๐โ .
โ โโ๐ฃ
Thus for ๐ฃ โ ๐(๐ข) and ๐ โฅ 1,
๐๐ฃ
h i
(๐ข) (๐ข) (๐ข)
๐๐ = ๐ ๐โ1 + ๐ฟ ๐ฃ exp ๐ฝ ๐ข โ (๐ โ1 ( ๐ห ๐ )๐ฃ โ ๐ผ ๐ฃ ) โ ๐ฟ ๐ฃ .
๐ฃ ๐ฃ ๐ค๐ฃ
One can now verify that there is a constant ๐ฟ = ๐ฟ(๐, ๐) such that
๐ฟ
๐ ๐ ๐ฃ โ ๐ ๐โ1 ๐ฃ โค , ๐ โ {1, . . . , ๐}, ๐ฃ โ ๐ \ {๐ฃ }.
๐
In particular, we see that ๐(๐)
0
โ ๐ฟโ ([0, 1], โ๐\{๐ฃ } ) for every ๐ โฅ 1 and, moreover,
0
sup ๐ (๐) < โ. (3.9)
๐โฅ1 ๐ฟโ
Therefore by Arzelร โAscoli (see, e.g., [2, Thm. 0.3.1]), there is a subsequence {๐ ๐ } such that
๐(๐ ๐ ) converges uniformly to a function ๐ : [0, 1] โ ๐๐ .
Since the unit ball of ๐ฟโ ([0, 1], โ๐\{๐ฃ } ) is weakly compact (by the sequential BanachโAlaoglu
Theorem; see, e.g., [2, Thm. 0.3.3]), we can pass to a further subsequence {๐ 0๐ } along which ๐ (๐
0
0)
๐
โซ๐
converges weakly to some โ โ ๐ฟโ ([0, 1], โ๐\{๐ฃ } ). Moreover, since ๐(๐) (๐)โ ๐(๐) (๐) = ๐ 0 (๐ก) ๐๐ก
๐ (๐)
โซ๐
for all 0 โค ๐ < ๐ โค 1, it follows that ๐(๐) โ ๐(๐) = ๐ โ(๐ก) ๐๐ก as well, and therefore for almost all
๐ก โ [0, 1], we have ๐ 0(๐ก) = โ(๐ก).
๐\{๐ฃ }
If we similarly linearly interpolate the cost function to ๐ห(๐) : [0, 1] โ โ+ , then ๐ห(๐ ๐ ) โ ๐ห
along this sequence as well, and
ร
๐ห(๐ข) (๐ก) = ๐โ |๐ข (๐ก)๐โ .
โ โโ๐ฃ
Now the KKT conditions for optimality in (3.8) give
(๐ข) (๐ข) (๐ข) (๐ข)
โฮฆ(๐ข) ๐ ๐ โ โฮฆ(๐ข) ๐ ๐โ1 + ๐ โ1 ๐ห ๐ โ โN๐ (๐ข) ๐ ๐ ,
๐
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 21
C HRISTIAN C OESTER AND JAMES R. L EE
or equivalently,
(๐ข) (๐ข)
โฮฆ(๐ข) ๐ ๐ โ โฮฆ(๐ข) ๐ ๐โ1
(๐ข) (๐ข)
โ โ๐ห ๐ โ N๐ (๐ข) ๐ ๐ .
๐ โ1 ๐
By standard results in differential inclusion theory (e. g., the Convergence Theorem [2, Thm.
1.4.1]), we conclude that ๐ : [0, 1] โ ๐๐ solves the differential inclusion
โ2 ฮฆ(๐ข) ๐ (๐ข) (๐ก) ๐๐ก ๐ (๐ข) (๐ก) โ โ๐ห(๐ข) (๐ก) โ N๐ (๐ข) (๐ (๐ข) (๐ก)).
๐
Calculating the Hessian โ2 ฮฆ(๐ข) reveals that ๐(๐ก) is a solution to equation (3.7).
References
[1] Jacob Abernethy, Peter Bartlett, Niv Buchbinder, and Isabelle Stanton: A regularization
approach to metrical task systems. In Proc. 21st Internat. Conf. Algorithmic Learning Theory
(ALTโ10), pp. 270โ284. Springer, 2010. [doi:10.1007/978-3-642-16108-7_23] 5
[2] Jean-Pierre Aubin and Arrigo Cellina: Differential Inclusions: Set-Valued Maps and Viability
Theory. Springer, 1984. [doi:10.1007/978-3-642-69512-4] 21, 22
[3] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor: A polylogarithmic-
competitive algorithm for the ๐-server problem. J. ACM, 62(5):40:1โ49, 2015. Preliminary
version in FOCSโ11. [doi:10.1145/2783434, arXiv:1110.1580] 14
[4] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: Metrical task systems and the ๐-server
problem on HSTs. In Proc. 37th Internat. Colloq. on Automata, Languages, and Programming
(ICALPโ10), pp. 287โ298. Springer, 2010. [doi:10.1007/978-3-642-14165-2_25] 3
[5] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: A primal-dual randomized algorithm
for weighted paging. J. ACM, 59(4):19:1โ24, 2012. Preliminary version in FOCSโ07.
[doi:10.1145/2339123.2339126] 2
[6] Yair Bartal: Probabilistic approximations of metric spaces and its algorithmic applications.
In Proc. 37th FOCS, pp. 184โ193. IEEE Comp. Soc., 1996. [doi:10.1109/SFCS.1996.548477] 2,
3
[7] Yair Bartal, Avrim Blum, Carl Burch, and Andrew Tomkins: A polylog(n)-competitive
algorithm for metrical task systems. In Proc. 29th STOC, pp. 711โ719. ACM Press, 1997.
[doi:10.1145/258533.258667] 2
[8] Yair Bartal, Bรฉla Bollobรกs, and Manor Mendel: Ramsey-type theorems for metric
spaces with applications to online problems. J. Comput. System Sci., 72(5):890โ921, 2006.
[doi:10.1016/j.jcss.2005.05.008, arXiv:cs/0406028] 2
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 22
P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS
[9] Yair Bartal, Nathan Linial, Manor Mendel, and Assaf Naor: On metric Ramsey-
type phenomena. Ann. Math., 162(2):643โ709, 2005. Preliminary version in STOCโ03.
[doi:10.4007/annals.2005.162.643, arXiv:math/0406353] 2
[10] Avrim Blum, Howard Karloff, Yuval Rabani, and Michael Saks: A decomposition
theorem for task systems and bounds for randomized server problems. SIAM J. Comput.,
30(5):1624โ1661, 2000. [doi:10.1137/S0097539799351882] 2
[11] Allan Borodin, Nathan Linial, and Michael E. Saks: An optimal on-line algorithm for
metrical task system. J. ACM, 39(4):745โ763, 1992. [doi:10.1145/146585.146588] 2
[12] Sรฉbastien Bubeck and Nicolรฒ Cesa-Bianchi: Regret analysis of stochastic and non-
stochastic multi-armed bandit problems. Found. Trends Mach. Learning, 5(1):1โ122, 2012.
[doi:10.1561/2200000024, arXiv:1204.5721] 2, 6
[13] Sรฉbastien Bubeck, Michael B. Cohen, James R. Lee, and Yin Tat Lee: Metrical task systems
on trees via mirror descent and unfair gluing. SIAM J. Comput., 50(3):909โ923, 2021.
Preliminary version in SODAโ19. [doi:10.1137/19M1237879, arXiv:1807.04404] 2, 3, 4, 5, 6
[14] Sรฉbastien Bubeck, Michael B. Cohen, Yin Tat Lee, James R. Lee, and Aleksander Madry:
๐-server via multiscale entropic regularization. In Proc. 50th STOC, pp. 3โ16. ACM Press,
2018. [doi:10.1145/3188745.3188798, arXiv:1711.01085] 5, 8, 18, 19
[15] Niv Buchbinder, Shahar Chen, and Joseph (Seffi) Naor: Competitive analysis via regular-
ization. In Proc. 25th Ann. ACMโSIAM Symp. on Discrete Algorithms (SODAโ14), pp. 436โ444.
SIAM, 2014. [doi:10.1137/1.9781611973402.32] 5
[16] Niv Buchbinder and Joseph Naor: The design of competitive online algorithms via a primal-
dual approach. Found. Trends Theor. Comp. Sci., 3(2โ3):93โ263, 2009. [doi:10.1561/0400000024]
6
[17] Christian Coester and James R. Lee: Pure entropic regularization for metrical task systems.
In Proc. 32nd Ann. Conf. on Learning Theory (COLTโ19), pp. 835โ848. MLR Press, 2019. PMLR.
1
[18] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar: A tight bound on approximating
arbitrary metrics by tree metrics. J. Comput. System Sci., 69(3):485โ497, 2004. Preliminary
version in STOCโ03. [doi:10.1016/j.jcss.2004.04.011] 2, 3, 4
[19] Amos Fiat and Manor Mendel: Better algorithms for unfair metrical task systems and
applications. SIAM J. Comput., 32(6):1403โ1422, 2003. Preliminary version in STOCโ00.
[doi:10.1137/S0097539700376159, arXiv:cs/0406034] 2
[20] Steve Seiden: Unfair problems and randomized algorithms for metrical task systems.
Inform. Comput., 148(2):219โ240, 1999. [doi:10.1006/inco.1998.2744] 2
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 23
C HRISTIAN C OESTER AND JAMES R. L EE
AUTHORS
Christian Coester
Associate professor
Department of Computer Science
University of Oxford
Oxford, United Kingdom
christian coester cs ox ac uk
https://www.cs.ox.ac.uk/people/christian.coester/
James R. Lee
Professor
Paul G. Allen Center for Computer Science & Engineering
University of Washington
Seattle, Washington, USA
jrl cs washington edu
https://homes.cs.washington.edu/~jrl/
ABOUT THE AUTHORS
Christian Coester is Associate Professor in the Department of Computer Science at
the University of Oxford and Tutorial Fellow at St Anneโs College. He received
his Ph. D. from Oxford in 2020, under the supervision of Elias Koutsoupias. After
stints as a postdoctoral researcher at CWI in Amsterdam and Tel Aviv University
and as a lecturer at the University of Sheffield, he returned to Oxford in 2022.
His research focuses on the design and analysis of algorithms, especially online
algorithms and learning-augmented algorithms. Outside of his research, he
enjoys sports, chess and playing the piano.
James R. Lee is a Professor of Computer Science & Engineering at the University of
Washington. He received his Ph. D. from the University of California, Berkeley
in 2005, under the supervision of Christos Papadimitriou, followed by a postdoc
at the Institute for Advanced Study in Princeton. His research interests are
varied and eclectic, ranging from spectral graph algorithms to functional analysis,
and from convex optimization to statistical physics. He challenged a class of
undergrads to compete against the MTS algorithm in this paper. They fought
(and coded) valiantly. They PyTorched and TensorFlowed. But in the end, the
Theory won.
T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ24 24