DOKK Library

Pure Entropic Regularization for Metrical Task Systems

Authors Christian Coester, James R. Lee,

License CC-BY-3.0

Plaintext
                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24
                                       www.theoryofcomputing.org




Pure Entropic Regularization for Metrical
             Task Systems
                              Christian Coesterโˆ—                    James R. Leeโ€ 
                 Received July 23, 2019; Revised October 8, 2020; Published December 29, 2022




       Abstract. We show that on every ๐‘›-point HST metric, there is a randomized online
       algorithm for metrical task systems (MTS) that is 1-competitive for service costs and
       ๐‘‚(log ๐‘›)-competitive for movement costs. In general, these refined guarantees are
       optimal up to the implicit constant. While an ๐‘‚(log ๐‘›)-competitive algorithm for
       MTS on HST metrics was developed by Bubeck et al. (SODAโ€™19), that approach could
       only establish an ๐‘‚((log ๐‘›)2 )-competitive ratio when the service costs are required
       to be ๐‘‚(1)-competitive. Our algorithm can be viewed as an instantiation of online
       mirror descent with the regularizer derived from a multiscale conditional entropy.
           In fact, our algorithm satisfies a set of even more refined guarantees; we are able
       to exploit this property to combine it with known random embedding theorems and
       obtain, for any ๐‘›-point metric space, a randomized algorithm that is 1-competitive
       for service costs and ๐‘‚((log ๐‘›)2 )-competitive for movement costs.

    An extended abstract of this paper appeared in the Proceedings of the 32nd Ann. Conference on Learning
Theory (COLT 2019) [17].
  โˆ— Supported by EPSRC Award 1652110. Part of this work was carried out while C. Coester was visiting the
University of Washington, hosted by J. R. Lee.
  โ€  Supported by NSF grants CCF-1616297 and CCF-1407779 and a Simons Investigator Award



ACM Classification: F.2.0
AMS Classification: 68W27
Key words and phrases: online algorithms, competitive analysis, mirror descent, metrical task
systems, decision making under uncertainty


ยฉ 2022 Christian Coester and James R. Lee
c b Licensed under a Creative Commons Attribution License (CC-BY)                 DOI: 10.4086/toc.2022.v018a023
                                     C HRISTIAN C OESTER AND JAMES R. L EE

1    Introduction
Let (๐‘‹ , ๐‘‘๐‘‹ ) be a finite metric space with |๐‘‹ | = ๐‘› > 1. The Metrical Task Systems (MTS) problem,
introduced in [11] is described as follows. The input is a sequence h๐‘ ๐‘ก : ๐‘‹ โ†’ โ„+ : ๐‘ก โ‰ฅ 1i of
nonnegative cost functions on the state space ๐‘‹. At every time ๐‘ก, an online algorithm maintains
a state ๐œŒ๐‘ก โˆˆ ๐‘‹.
     The corresponding cost is the sum of a service cost ๐‘ ๐‘ก (๐œŒ๐‘ก ) and a movement cost ๐‘‘๐‘‹ (๐œŒ๐‘กโˆ’1 , ๐œŒ๐‘ก ).
Formally, an online algorithm is a sequence of mappings ๐† = h๐œŒ1 , ๐œŒ2 , . . . , i where, for every ๐‘ก โ‰ฅ 1,
๐œŒ๐‘ก : (โ„+๐‘‹ )๐‘ก โ†’ ๐‘‹ maps a sequence of cost functions h๐‘1 , . . . , ๐‘ ๐‘ก i to a state. The initial state ๐œŒ0 โˆˆ ๐‘‹
is fixed. The total cost of the algorithm ๐† in servicing ๐’„ = h๐‘ ๐‘ก : ๐‘ก โ‰ฅ 1i is defined as:
                            ร•
             cost๐† (๐’„) :=         [๐‘ ๐‘ก (๐œŒ๐‘ก (๐‘1 , . . . , ๐‘ ๐‘ก )) + ๐‘‘๐‘‹ (๐œŒ๐‘กโˆ’1 (๐‘1 , . . . , ๐‘ ๐‘กโˆ’1 ), ๐œŒ๐‘ก (๐‘1 , . . . , ๐‘ ๐‘ก ))] .
                            ๐‘กโ‰ฅ1

The cost of the offline optimum, denoted costโˆ— (๐’„), is the infimum of ๐‘กโ‰ฅ1 [๐‘ ๐‘ก (๐œŒ๐‘ก ) + ๐‘‘๐‘‹ (๐œŒ๐‘กโˆ’1 , ๐œŒ๐‘ก )]
                                                                                                     ร
over any sequence h๐œŒ๐‘ก : ๐‘ก โ‰ฅ 1i of states. A randomized online algorithm ๐† is said to be ๐›ผ-competitive
if for every ๐œŒ0 โˆˆ ๐‘‹, there is a constant ๐›ฝ > 0 such that for all cost sequences ๐’„:

                                           ๐”ผ cost๐† (๐’„) โ‰ค ๐›ผ ยท costโˆ— (๐’„) + ๐›ฝ .
                                                     

    For the ๐‘›-point uniform metric, a simple coupon-collector argument shows that the com-
petitive ratio is ฮฉ(log ๐‘›), and this is tight [11]. A long-standing conjecture is that this ฮ˜(log ๐‘›)
competitive ratio holds for an arbitrary ๐‘›-point metric space. The lower bound has almost
been established [8, 9]; for any ๐‘›-point metric space, the competitive ratio is ฮฉ(log ๐‘›/log log ๐‘›).
Following a long sequence of works (see, e. g., [20, 10, 7, 6, 19, 18]), an upper bound of ๐‘‚((log ๐‘›)2 )
was shown in [13].

Relation to adversarial multi-arm bandits MTS is naturally related to the adversarial setting of
the classical multi-arm bandits model in sequential decision making, and provides a very general
framework for โ€œbandits with switching costs.โ€ Unlike in the setting of regret minimization,
where one competes against the best static strategy in hindsight (see, e. g., [12]), competitive
analysis compares the performance of an online algorithm to the best dynamical offline algorithm.
    Thus this model emphasizes the importance of an adaptivity in the face of changing
environments. For MTS, the online algorithm has full information: access to the complete cost
function ๐‘ ๐‘ก is available when deciding on a point ๐œŒ๐‘ก (๐‘1 , . . . , ๐‘ ๐‘ก ) โˆˆ ๐‘‹ at which to play. And yet
one of the fascinating relationships between MTS and adversarial bandits is the parallel between
adaptivityโ€”being willing to โ€œtry outโ€ new strategiesโ€”and the classical exploration/exploitation
tradeoff that occurs in models where one only has access to partial information about the loss
functions.

HST metrics The methods of [5] show that the competitive ratio for MTS is ๐‘‚(log ๐‘›) on
weighted star metrics. Recently, the authors of [13] generalized this result by designing
an algorithm with competitive ratio ๐‘‚(๐”‡๐‘‡ log ๐‘›) on any weighted ๐‘›-point tree metric with
combinatorial depth ๐”‡๐‘‡ . We now discuss a special class of metrics.

                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                   2
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

    Let ๐‘‡ = (๐‘‰ , ๐ธ) be a finite tree with root ๐•ฃ and vertex weights {๐‘ค ๐‘ข > 0 : ๐‘ข โˆˆ ๐‘‰ }, let โ„’ โІ ๐‘‰
denote the leaves of ๐‘‡, and suppose that the vertex weights on ๐‘‡ are non-increasing along
root-leaf paths. Consider the metric space (โ„’, ๐‘‘๐‘‡ ), where ๐‘‘๐‘‡ (โ„“ , โ„“ 0) is the weighted length of the
path connecting โ„“ and โ„“ 0 when the edge from a node ๐‘ข to its parent is ๐‘ค ๐‘ข . We will use ๐”‡๐‘‡ for
the combinatorial (i. e., unweighted) depth of ๐‘‡.
    (โ„’, ๐‘‘๐‘‡ ) is called an HST metric (or, equivalently for finite metric spaces, an ultrametric). If, for
some ๐œ > 1, the weights on ๐‘‡ satisfy the stronger inequality ๐‘ค ๐‘ฃ โ‰ค ๐‘ค ๐‘ข /๐œ whenever ๐‘ฃ is a child
of ๐‘ข, the space (โ„’, ๐‘‘๐‘‡ ) is said to be a ๐œ-HST metric. Such metric spaces play a special role in
MTS since every ๐‘›-point metric space can be probabilistically approximated by a distribution
over such spaces [6, 18]. Indeed, the ๐‘‚((log ๐‘›)2 )-competitive ratio for general metric spaces
established in [13] is a consequence of their ๐‘‚(log ๐‘›)-competitive algorithm for HSTs.


1.1   Refined guarantees
The authors of [4] observe that there is a more refined way to analyze competive algorithms for
MTS. For a randomized online algorithm ๐† and a cost sequence ๐’„, we denote, respectively, S๐† (๐’„)
and M๐† (๐’„) for the (expected) service cost and movement cost, that is
                                   ร•                                    ร•
                     S๐† (๐’„) := ๐”ผ         ๐‘ ๐‘ก (๐œŒ๐‘ก )   and M๐† (๐’„) := ๐”ผ          ๐‘‘๐‘‹ (๐œŒ๐‘กโˆ’1 , ๐œŒ๐‘ก ) .
                                   ๐‘กโ‰ฅ1                                  ๐‘กโ‰ฅ1


If there are numbers ๐›ผ, ๐›ผ0 , ๐›ฝ, ๐›ฝ0 > 0 such that for every cost ๐’„, it holds that

                                           S๐† (๐’„) โ‰ค ๐›ผ ยท costโˆ— (๐’„) + ๐›ฝ
                                          M๐† (๐’„) โ‰ค ๐›ผ0 ยท costโˆ— (๐’„) + ๐›ฝ0 ,

one says that ๐† is ๐›ผ-competitive for service costs and ๐›ผ0-competitive for movement costs.
    In [4], it is shown that on every ๐‘›-point HST metric, and for every ๐œ€ > 0, there is an
online algorithm that is simultaneously (1 + ๐œ€)-competitive for service costs and ๐‘‚((log(๐‘›/๐œ€))2 )-
competitive for movement costs. The authors of [13] improve this slightly to show that
actually there is an online algorithm that is simultaneously 1-competitive for service costs and
๐‘‚((log ๐‘›)2 )-competitive for movement costs. We obtain the optimal refined guarantees.

Theorem 1.1. On any ๐‘›-point HST metric ๐‘‹, there is a randomized online algorithm that is 1-competitive
for service costs and ๐‘‚(log ๐‘›)-competitive for movement costs.

Remark 1.2 (Optimality of the refined guarantees). Any finitely competitive algorithm for MTS
on an ๐‘›-point uniform metric cannot be better than ฮฉ(log ๐‘›)-competitive for movement costs,
regardless of its competitive ratio for service costs. This is because this lower bound holds even
if the cost functions only take values 0 and โˆž. Moreover, it cannot be better than 1-competitive
for service costs, regardless of its competitive ratio for movement costs. To see this, consider the
case where each cost function is the constant function 1.

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                            3
                                  C HRISTIAN C OESTER AND JAMES R. L EE

Finely competitive guarantees Suppose that for some numbers ๐›ผ0 , ๐›ผ 1 , ๐›พ, ๐›ฝ, ๐›ฝ0 > 0, a random-
ized online algorithm ๐† satisfies, for every cost ๐’„ and every offline algorithm ๐†โˆ— :
                                      S๐† (๐’„) โ‰ค ๐›ผ 0 S๐†โˆ— (๐’„) + ๐›ผ 1 M๐†โˆ— (๐’„) + ๐›ฝ                        (1.1)
                                                             0
                                     M๐† (๐’„) โ‰ค ๐›พ S๐† (๐’„) + ๐›ฝ .                                        (1.2)
In this case, we say that ๐† is (๐›ผ0 , ๐›ผ1 , ๐›พ)-finely competitive. We establish the following.
Theorem 1.3. On any ๐‘›-point  HST metric ๐‘‹, for every ๐œ… โ‰ฅ 1, there is an online randomized algorithm
๐† that is 1, 1/๐œ…, ๐‘‚(๐œ… log ๐‘›) -finely competitive. In fact, one can take ๐›ฝ = 0 and ๐›ฝ โ‰ค ๐‘‚(๐œ…diam(๐‘‹)).
                                                                                   0

   Combined with the random embedding from [18], this yields the following consequence for
general ๐‘›-point metric spaces.
Corollary 1.4. On any ๐‘›-point metric space, there is an online randomized algorithm that is 1-competitive
for service costs and ๐‘‚((log ๐‘›)2 )-competitive for movement costs.
Proof. Consider an ๐‘›-point metric space (๐‘‹ , ๐‘‘๐‘‹ ). It is known [18] that there exists a random
HST metric (๐‘‡, ๐‘‘๐‘‡ ) so that โ„’(๐‘‡) = ๐‘‹ and for all ๐‘ฅ, ๐‘ฆ โˆˆ ๐‘‹:
   1. Pr[๐‘‘๐‘‡ (๐‘ฅ, ๐‘ฆ) โ‰ฅ ๐‘‘๐‘‹ (๐‘ฅ, ๐‘ฆ)] = 1,
   2. ๐”ผ[๐‘‘๐‘‡ (๐‘ฅ, ๐‘ฆ)] โ‰ค ๐ท ยท ๐‘‘๐‘‹ (๐‘ฅ, ๐‘ฆ),
and ๐ท โ‰ค ๐‘‚(log ๐‘›).
   Let ๐†๐‘‡ be the randomized algorithm for (๐‘‡, ๐‘‘๐‘‡ ) guaranteed by Theorem 1.3 with ๐œ… = ๐ท. Let
๐† denote the algorithm that results from sampling (๐‘‡, ๐‘‘๐‘‡ ) and then using ๐†๐‘‡ . We use M๐‘‡ to
denote movement cost measured in ๐‘‘๐‘‡ and M๐‘‹ for movement cost measured in ๐‘‘๐‘‹ .
   Then for any cost ๐’„ and any offline algorithm ๐†โˆ— , we have
                         S๐† (๐’„) = ๐”ผ[S๐†๐‘‡ (๐’„)] โ‰ค S๐†โˆ— (๐’„) + ๐œ… โˆ’1 ๐”ผ[M๐‘‡๐†โˆ— (๐’„)] + ๐‘‚(1)
                                                 โ‰ค S๐†โˆ— (๐’„) + ๐œ… โˆ’1 ๐ท M๐†๐‘‹โˆ— (๐’„) + ๐‘‚(1)
                                                = S๐†โˆ— (๐’„) + M๐†๐‘‹โˆ— (๐’„) + ๐‘‚(1) ,
and
                 M๐†๐‘‹ (๐’„) = ๐”ผ[M๐†๐‘‹๐‘‡ (๐’„)] โ‰ค ๐”ผ[M๐‘‡๐†๐‘‡ (๐’„)] โ‰ค ๐‘‚(๐œ… log ๐‘›) ๐”ผ[S๐†๐‘‡ (๐’„)] + ๐‘‚(1),
completing the proof.                                                                                  

1.2   The fractional model on trees
We will work in the following deterministic fractional setting, which is equivalent to the
randomized integral setting described earlier (see [13, ยง2]). The state of a fractional algorithm is
given by a point in the polytope
                              ๏ฃฑ
                              ๏ฃด                               ร•                       ๏ฃผ
                                                                                      ๏ฃด
                                  ๐‘ฅ โˆˆ โ„+๐‘‰ : ๐‘ฅ ๐•ฃ = 1, ๐‘ฅ ๐‘ข =
                              ๏ฃฒ
                              ๏ฃด                                                       ๏ฃฝ
                                                                                      ๏ฃด
                      K๐‘‡ :=                                           ๐‘ฅ๐‘ฃ   โˆ€๐‘ข โˆˆ ๐‘‰ \ โ„’ ,             (1.3)
                                                             ๐‘ฃโˆˆ๐œ’(๐‘ข)
                              ๏ฃด
                              ๏ฃด                                                       ๏ฃด
                                                                                      ๏ฃด
                              ๏ฃณ                                                       ๏ฃพ

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                            4
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

where we use ๐œ’(๐‘ข) for the set of children of ๐‘ข in ๐‘‡. For ๐‘ข โ‰  ๐•ฃ , we will also write p(๐‘ข) for the
parent of ๐‘ข in ๐‘‡.
    A state ๐‘ฅ โˆˆ K๐‘‡ corresponds to the situation that the state of a randomized integral algorithm
is a leaf descendant of ๐‘ข with probability ๐‘ฅ ๐‘ข . Note that K๐‘‡ is simply an affine encoding of the
probability simplex on โ„’. In the fractional setting, changing from state ๐‘ฅ to ๐‘ฅ 0 incurs movement
cost k๐‘ฅ โˆ’ ๐‘ฅ 0 kโ„“1 (๐‘ค) , where
                                                             ร•
                                             k๐‘ง kโ„“1 (๐‘ค) :=         ๐‘ค ๐‘ข |๐‘ง ๐‘ข |
                                                             ๐‘ขโˆˆ๐‘‰

denotes the weighted โ„“1 -norm on โ„๐‘‰ .

1.3     Mirror descent, metric filtrations, and regularization
Following [13], our algorithm is based on the mirror descent framework as established in [14].
This is a method for regularized online convex optimization, an approach that was previously
explored for competitive analysis in [1, 15].
     A central component of mirror descent is choosing the appropriate mirror map (which we
will often refer to as the โ€œregularizerโ€). This is a strictly convex function ฮฆ : K๐‘‡ โ†’ โ„ that endows
K๐‘‡ with a geometric (Riemannian) structure, specifying how to perform constrained vector flow.
In other words, it specifies how one can move in a preferred direction while remaining inside
K๐‘‡ .
     The paper [13] employs the following regularizer:
                                          1 ร•
                              ฮฆ0 (๐‘ฅ) :=       ๐‘ค ๐‘ข (๐‘ฅ ๐‘ข + ๐›ฟ ๐‘ข ) log (๐‘ฅ ๐‘ข + ๐›ฟ ๐‘ข ) ,             (1.4)
                                          ๐œ‚
                                             ๐‘ขโˆˆ๐‘‰\{๐•ฃ }

with ๐œ‚ = ฮ˜(log |โ„’|) and ๐›ฟ ๐‘ข = |โ„’๐‘ข |/|โ„’|, where โ„’๐‘ข is the set of leaves in the subtree rooted at ๐‘ข.

1.3.1    Metric filtrations
It is straightforward that one can think of ฮฆ0 as a type of multiscale entropy (this is the negative
of the associated Shannon entropy, since we use the analystโ€™s convention that the entropy is
convex). To understand this notion, let us forget momentarily the weights on ๐‘‡. Then the
structure of ๐‘‡ gives a natural filtration over probability measures on the leaves โ„’. Suppose that
๐‘ฟ is a random variable taking values in โ„’ and, for ๐‘ข โˆˆ ๐‘‰, denote by โ„ฐ๐‘ข the event {๐‘ฟ โˆˆ โ„’๐‘ข }.
Then the chain rule for Shannon entropy yields
                        ร•                      1       ร•             Pr[โ„ฐp(๐‘ข) ]
                               Pr[โ„ฐโ„“ ] log           =   Pr[โ„ฐ๐‘ข ] log            .
                                             Pr[โ„ฐโ„“ ]                  Pr[โ„ฐ๐‘ข ]
                        โ„“ โˆˆโ„’                            ๐‘ขโˆˆ๐‘‰\{๐•ฃ }

    If we now imagine that uncertainty at higher scales is more costly than uncertainty at lower
scales, then we might define an analogous weighted entropy by
                                          ร•                          Pr[โ„ฐp(๐‘ข) ]
                                                 ๐‘ค ๐‘ข Pr[โ„ฐ๐‘ข ] log                  .           (1.5)
                                                                       Pr[โ„ฐ๐‘ข ]
                                      ๐‘ขโˆˆ๐‘‰\{๐•ฃ }


                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                      5
                                  C HRISTIAN C OESTER AND JAMES R. L EE

Such a notion is natural in the context of โ€œmetric learningโ€ problems.
    Ignoring the {๐›ฟ ๐‘ข } values for a moment, consider that (1.4) is not analogous to (1.5). Indeed,
it corresponds to the quantity
                                          ร•                               1
                                                     ๐‘ค ๐‘ข Pr[โ„ฐ๐‘ข ] log           ,                           (1.6)
                                                                       Pr[โ„ฐ๐‘ข ]
                                        ๐‘ขโˆˆ๐‘‰\{๐•ฃ }

and now one can see a fundamental reason why the algorithm associated to (1.4) only achieves
an ๐‘‚(๐”‡๐‘‡ log ๐‘›) competitive ratio, where ๐”‡๐‘‡ is the combinatorial depth of ๐‘‡: The quantity (1.6)
overmeasures the metric uncertainty.
                                                                               = log ๐‘›, where
                                                        ร
    Suppose that ๐‘ฟ is a uniformly random leaf. Then โ„“ โˆˆโ„’ Pr[โ„ฐโ„“ ] log Pr[โ„ฐ 1
                                                                             ]              โ„“
๐‘› = |โ„’|. But, in general, one could have ๐‘ขโˆˆ๐‘‰ Pr[โ„ฐ๐‘ข ] log Pr[โ„ฐ    โ‰ฅ ฮฉ(๐”‡๐‘‡ log ๐‘›). This fact was not
                                                   ร        1
                                                              ๐‘ข]
lost on the authors of [13], but they bypass the problem by combining mirror descent on stars
with a recursive composition method called โ€œunfair gluing.โ€

1.3.2   Multiscale conditional entropy
We employ a regularizer that is a more faithful analog of (1.5):
                                       ร• ๐‘ค                                  
                                                                             ๐‘ฅ๐‘ข
                                                                                      
                                           ๐‘ข
                                                       ๐‘ฅ ๐‘ข + ๐›ฟ ๐‘ข ๐‘ฅ p(๐‘ข) log        + ๐›ฟ๐‘ข ,
                                                                       
                           ฮฆ(๐‘ฅ) :=                                                                         (1.7)
                                                ๐œ‚๐‘ข                          ๐‘ฅ p(๐‘ข)
                                     ๐‘ขโˆˆ๐‘‰\{๐•ฃ }

where p(๐‘ข) denotes the parent of ๐‘ข.
   If one ignores the additional parameters {๐œ‚๐‘ข โ‰ฅ 1, ๐›ฟ ๐‘ข > 0}, this is precisely the negative
weighted Shannon entropy written according to the chain rule. Here, we set
                                                          |โ„’๐‘ข |
                                                ๐œƒ๐‘ข :=                                                      (1.8)
                                                         |โ„’p(๐‘ข) |
                                                ๐œ‚๐‘ข := 1 + log(1/๐œƒ๐‘ข )                                       (1.9)
                                                ๐›ฟ ๐‘ข := ๐œƒ๐‘ข /๐œ‚๐‘ข .                                           (1.10)

    The numbers {๐œƒ๐‘ข } are the conditional probabilites of the uniform distribution on leaves.
The {๐›ฟ ๐‘ข } values are employed as โ€œnoiseโ€ added to the entropy calculation. Such noise is a
fundamental aspect for competitive analysis, and distinguishes it from the application of mirror
descent to regret minimization problems (see, e. g., [12]).1 The effect of these noise parameters
appears ubiquitously in applications of the primal-dual method to competitive analysis (see
[16]), and manifests itself as an additive term in the update rules (see equation (1.11) below).
Intuitively, it ensures that the conditional probability ๐‘ฅ๐‘ฅp(๐‘ข)
                                                            ๐‘ข
                                                                is updated fast enough even when it
is close to 0.
    Finally, the numbers {๐œ‚๐‘ข : ๐‘ข โˆˆ ๐‘‰ } are commonly referred to as โ€œlearning ratesโ€ in the study
of online learning. They represent the rate at which information is discounted in the resulting
   1One finds aspects of this โ€œmixing with the uniform distributionโ€ in the bandits setting as well, but used for
variance reduction, a seemingly very different purpose.


                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                  6
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

algorithm; for MTS, this corresponds to the relative importance of costs arriving now vs. costs
that arrived in the past.

1.3.3   The dynamics
We will derive in Section 3 the following continuous time evolution of the resulting mirror descent
algorithm (๐‘ฅ(๐‘ก) โˆˆ K๐‘‡ : ๐‘ก โˆˆ [0, โˆž)) for a cost path ๐‘ : [0, โˆž) โ†’ โ„+โ„’ :
                                                                                                                      !
                             ๐‘ฅ ๐‘ข (๐‘ก)     ๐œ‚๐‘ข   ๐‘ฅ ๐‘ข (๐‘ก)                                      ร• ๐‘ฅ (๐‘ก)
                                                                     
                                                                                                    โ„“
                   ๐œ•๐‘ก                  =                + ๐›ฟ๐‘ข                ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’                     ๐‘โ„“ (๐‘ก)       (1.11)
                            ๐‘ฅ p(๐‘ข) (๐‘ก)   ๐‘ค ๐‘ข ๐‘ฅ p(๐‘ข) (๐‘ก)                                            ๐‘ฅ ๐‘ข (๐‘ก)
                                                                                           โ„“ โˆˆโ„’๐‘ข

Here, ๐›ฝ p(๐‘ข) (๐‘ก) is a Lagrangian multiplier that ensures conservation of conditional probability:

                                                                     ๐‘ฅ ๐‘ฃ (๐‘ก)
                                                  ร•                          
                                                           ๐œ•๐‘ก                  = 0.
                                                                    ๐‘ฅ p(๐‘ข) (๐‘ก)
                                               ๐‘ฃโˆˆ๐œ’(p(๐‘ข))

One can see that the evolution is being driven by the expected instantaneous cost incurred
conditioned on the current state being in the subtree rooted at ๐‘ข.
    One should interpret equation (1.11) only when ๐‘ฅ(๐‘ก) lies in the relative interior of K๐‘‡ .
Otherwise, the conditional probabilities are ill-defined. One way to rectify this is to prevent
๐‘ฅ(๐‘ก) from hitting the relative boundary of K๐‘‡ at all. It is possible to adaptively modify the cost
functions by a suitably small perturbation so as to guarantee this property and, at the same
time, ensure that the total discrepancy between the modified and true service cost is a small
additive constant.
    Instead, we will follow a different approach, by extending the dynamics to an analogous
system of conditional probabilities {๐‘ž ๐‘ข (๐‘ก) : ๐‘ข โˆˆ ๐‘‰ \ {๐•ฃ }}:
                                            ๐œ‚๐‘ข
                             ๐œ•๐‘ก ๐‘ž ๐‘ข (๐‘ก) =      (๐‘ž ๐‘ข (๐‘ก) + ๐›ฟ ๐‘ข ) ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’ ๐‘ห†๐‘ข (๐‘ก) + ๐›ผ ๐‘ข (๐‘ก) ,
                                                                                              
                                                                                                                          (1.12)
                                            ๐‘ค๐‘ข

where ๐‘ž ๐‘ข (๐‘ก) = ๐‘ฅ๐‘ฅ๐‘ข (๐‘ก)(๐‘ก) whenever ๐‘ฅ p(๐‘ข) (๐‘ก) > 0, ๐›ผ ๐‘ข (๐‘ก) is a Lagrangian multiplier for the constraint
                 p(๐‘ข)
๐‘ž ๐‘ข (๐‘ก) โ‰ฅ 0, and ๐‘ห†๐‘ข (๐‘ก) is the โ€œderivedโ€ cost in the subtree rooted at ๐‘ข:
                                                                ร•
                                                 ๐‘ห†๐‘ข (๐‘ก) :=             ๐‘žโ„“ |๐‘ข (๐‘ก)๐‘โ„“ (๐‘ก)
                                                                โ„“ โˆˆโ„’๐‘ข
                                                                    ร–
                                               ๐‘žโ„“ |๐‘ข (๐‘ก) :=                   ๐‘ž ๐‘ฃ (๐‘ก) ,
                                                                ๐‘ฃโˆˆ๐›พ๐‘ข,โ„“ \{๐‘ข}

where ๐›พ๐‘ข,โ„“ is the unique simple ๐‘ข-โ„“ path in ๐‘‡.
    Stated this way, the mirror descent algorithm can be envisioned as running a โ€œweighted starโ€
algorithm on the conditional probabilities at every internal node of ๐‘‡, with the derived costs at
an internal node ๐‘ข given by the average cost of the current strategy for playing one unit of mass
in the subtree rooted at ๐‘ข.

                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                 7
                               C HRISTIAN C OESTER AND JAMES R. L EE

    In the next section, we will implement and analyze a discretization of (1.12) using Bregman
projections. Since our regularizer ฮฆ and convex body K๐‘‡ do not satisfy the assumptions
underlying the existence and uniqueness theorem of [14], we need to construct a solution to
(1.12) and, indeed, taking the discretization parameter in our algorithm to zero, one establishes
a solution of bounded variation; see Section 3.3.
    The major benefit of the formulations (1.11) and (1.12) is in motivating such an algorithm and
prescribing the derived costs. In Section 3, we describe how these dynamics can be predicted
from the definition (1.7).


2   The MTS algorithm
We will first establish some generic machinery which, at this point, is not specific to MTS yet.
Consider a convex polytope K0 โІ โ„ ๐‘› , define K := K0 โˆฉ โ„+๐‘› , and assume that K is compact. Suppose
additionally that ฮฆ : ๐’Ÿ โ†’ โ„ is differentiable and strictly convex in an open neighborhood
๐’Ÿ โЇ K.
   Let us write Dฮฆ for the corresponding Bregman divergence

                            Dฮฆ (๐‘ฆ k ๐‘ฅ) := ฮฆ(๐‘ฆ) โˆ’ ฮฆ(๐‘ฅ) โˆ’ hโˆ‡ฮฆ(๐‘ฅ), ๐‘ฆ โˆ’ ๐‘ฅi ,

which is non-negative due to convexity of ฮฆ. Then for ๐‘ฅ, ๐‘ฆ, ๐‘ง โˆˆ K, we have:

            Dฮฆ (๐‘ง k ๐‘ฆ) โˆ’ Dฮฆ (๐‘ง k ๐‘ฅ) = โˆ’ฮฆ(๐‘ฆ) + ฮฆ(๐‘ฅ) โˆ’ hโˆ‡ฮฆ(๐‘ฆ), ๐‘ง โˆ’ ๐‘ฆi + hโˆ‡ฮฆ(๐‘ฅ), ๐‘ง โˆ’ ๐‘ฅi.            (2.1)

    For a vector ๐‘ โˆˆ โ„ ๐‘› and ๐‘ฅ โˆˆ K, define the projection

                          ฮ K๐‘ (๐‘ฅ) := argmin {Dฮฆ (๐‘ฆ k ๐‘ฅ) + h๐‘, ๐‘ฆi : ๐‘ฆ โˆˆ K } .

Since K is compact and ฮฆ is strictly convex, there is a unique minimizer ๐‘ฆ โˆ— โˆˆ K.
   For ๐‘ฅ โˆˆ K, recall the definition of the normal cone at ๐‘ฅ:

                         NK (๐‘ฅ) = {๐‘ โˆˆ โ„ ๐‘› : h๐‘, ๐‘ฆ โˆ’ ๐‘ฅi โ‰ค 0 for all ๐‘ฆ โˆˆ K } .

Given a representation of K by inequality constraints, K = {๐‘ฅ โˆˆ โ„ ๐‘› : ๐ด๐‘ฅ โ‰ค ๐‘} for ๐ด โˆˆ โ„ ๐‘šร—๐‘› and
๐‘ โˆˆ โ„ ๐‘› , it holds

                            NK (๐‘ฅ) = {๐ด๐‘‡ ๐‘ฆ : ๐‘ฆ โ‰ฅ 0 and ๐‘ฆ ๐‘‡ (๐ด๐‘ฅ โˆ’ ๐‘) = 0}.

The KKT conditions yield
                                     โˆ‡ฮฆ(๐‘ฆ โˆ— ) = โˆ‡ฮฆ(๐‘ฅ) โˆ’ ๐‘ โˆ’ ๐œ†โˆ— ,                                 (2.2)
where ๐œ†โˆ— โˆˆ NK (๐‘ฆ โˆ— ). Since NK (๐‘ฆ โˆ— ) = NK0 (๐‘ฆ โˆ— ) + Nโ„+๐‘› (๐‘ฆ โˆ— ), we can can decompose ๐œ†โˆ— = ๐›ฝ โˆ’ ๐›ผ with
๐›ฝ โˆˆ NK0 (๐‘ฆ โˆ— ) and โˆ’๐›ผ โˆˆ Nโ„+๐‘› (๐‘ฆ โˆ— ). In particular, we have ๐›ผ โ‰ฅ 0 and ๐›ผ ๐‘– > 0 =โ‡’ ๐‘ฆ ๐‘–โˆ— = 0 for every
๐‘– = 1, . . . , ๐‘›.

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                         8
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

      Substituting this into equation (2.1) gives

           Dฮฆ (๐‘ง k ๐‘ฆ โˆ— ) โˆ’ Dฮฆ (๐‘ง k ๐‘ฅ) = โˆ’ฮฆ(๐‘ฆ โˆ— ) + ฮฆ(๐‘ฅ) + hโˆ‡ฮฆ(๐‘ฅ), ๐‘ฆ โˆ— โˆ’ ๐‘ฅi + h๐‘ โˆ’ ๐›ผ + ๐›ฝ, ๐‘ง โˆ’ ๐‘ฆ โˆ— i
                                         โ‰ค โˆ’Dฮฆ (๐‘ฆ โˆ— k ๐‘ฅ) + h๐‘ โˆ’ ๐›ผ, ๐‘ง โˆ’ ๐‘ฆ โˆ— i,

where the inequality comes from h๐›ฝ, ๐‘ง โˆ’ ๐‘ฆ โˆ— i โ‰ค 0 since ๐‘ง โˆˆ K and ๐›ฝ โˆˆ NK (๐‘ฆ โˆ— ). We have proved the
following.
Lemma 2.1. For any ๐‘ฅ, ๐‘ง โˆˆ K, and ๐‘ โˆˆ โ„ ๐‘› , let ๐‘ฆ โˆ— = ฮ K๐‘ (๐‘ฅ) and ๐œ†โˆ— be as in (2.2). Then for any
๐›ผ โˆˆ โˆ’Nโ„๐‘›+ (๐‘ฆ โˆ— ) such that ๐œ†โˆ— + ๐›ผ โˆˆ NK0 (๐‘ฆ โˆ— ), it holds that

                                     Dฮฆ (๐‘ง k ๐‘ฆ โˆ— ) โˆ’ Dฮฆ (๐‘ง k ๐‘ฅ) โ‰ค h๐‘ โˆ’ ๐›ผ, ๐‘ง โˆ’ ๐‘ฆ โˆ— i.

2.1     Iterative Bregman projections
We describe now a discretization of the algorithm from the introduction. This discretization
will mimic the continuous dynamics if the entries of each individual cost vector are small. We
can achieve this by splitting each cost vector into several copies of scaled down versions of itself,
as discussed in Section 2.3. In Section 3.3, we will give a formal argument that this indeed yields
a discretization of the continuous dynamics from the introduction.
    Fix a tree ๐‘‡ and recall the definition of K๐‘‡ from (1.3). Let ๐‘„๐‘‡ denote the collection of vectors
      ๐‘‰\{๐•ฃ }
๐‘ž โˆˆ โ„+       such that for all ๐‘ข โˆˆ ๐‘‰ \ โ„’,
                                                       ร•
                                                               ๐‘ž ๐‘ฃ = 1.
                                                      ๐‘ฃโˆˆ๐œ’(๐‘ข)

                                           ๐œ’(๐‘ข)                                     (๐‘ข)
For ๐‘ž โˆˆ ๐‘„๐‘‡ and ๐‘ข โˆˆ ๐‘‰ \ โ„’, we use ๐‘ž (๐‘ข) โˆˆ โ„+ to denote the vector defined by ๐‘ž ๐‘ฃ := ๐‘ž ๐‘ฃ for
                                                            (๐‘ข)
๐‘ฃ โˆˆ ๐œ’(๐‘ข), and define the corresponding probability simplex ๐‘„๐‘‡ := {๐‘ž (๐‘ข) : ๐‘ž โˆˆ ๐‘„๐‘‡ }. We will use
ฮ” : ๐‘„๐‘‡ โ†’ K๐‘‡ for the map which sends ๐‘ž โˆˆ ๐‘„๐‘‡ to the (unique) ๐‘ฅ = ฮ”(๐‘ž) โˆˆ K๐‘‡ such that

                                        ๐‘ฅ๐‘ฃ = ๐‘ฅ๐‘ข ๐‘ž๐‘ฃ         โˆ€๐‘ข โˆˆ ๐‘‰ \ โ„’, ๐‘ฃ โˆˆ ๐œ’(๐‘ข).

Note that ๐‘ž contains more information than ๐‘ฅ; the map ฮ” fails to be invertible whenever there is
some ๐‘ข โˆˆ ๐‘‰ \ โ„’ with ๐‘ฅ ๐‘ข = 0.
    Fix ๐œ… โ‰ฅ 1. On the open domain ๐’Ÿ (๐‘ข) = (โˆ’ min๐‘ฃโˆˆ๐œ’(๐‘ข) ๐›ฟ ๐‘ฃ , โˆž)๐œ’(๐‘ข) , for ๐›ฟ ๐‘ฃ as given in equa-
tion (1.10), define the strictly convex function ฮฆ(๐‘ข) : ๐’Ÿ (๐‘ข) โ†’ โ„ by
                                                1 ร• ๐‘ค๐‘ฃ
                                  ฮฆ(๐‘ข) (๐‘) :=          (๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ) log (๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ) .
                                                ๐œ…   ๐œ‚๐‘ฃ
                                                  ๐‘ฃโˆˆ๐œ’(๐‘ข)

                                                                      (๐‘ข)
Denote the corresponding Bregman divergence on ๐‘„๐‘‡ by

                                          1 ร• ๐‘ค๐‘ฃ                   ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ
                                                                                   
                         (๐‘ข)        0                                         0
                     D         (๐‘ k ๐‘ ) =          (๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ) log 0      + ๐‘๐‘ฃ โˆ’ ๐‘๐‘ฃ .
                                          ๐œ…   ๐œ‚๐‘ฃ                   ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ
                                            ๐‘ฃโˆˆ๐œ’(๐‘ข)


                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                      9
                                              C HRISTIAN C OESTER AND JAMES R. L EE

    We now define an algorithm that takes a point ๐‘ž โˆˆ ๐‘„๐‘‡ and a cost vector ๐‘ โˆˆ โ„+โ„’ and outputs
a point ๐‘ = ๐’œ(๐‘ž, ๐‘) โˆˆ ๐‘„๐‘‡ . Fix h๐‘ข1 , ๐‘ข2 , . . . , ๐‘ข๐‘ i a topological ordering of ๐‘‰ \ โ„’ such that every
child in ๐‘‡ occurs before its parent. We define ๐‘ inductively as follows. Let ๐‘ห†โ„“ := ๐‘โ„“ for โ„“ โˆˆ โ„’. For
every ๐‘— = 1, 2, . . . , ๐‘:
                                 (๐‘ข ๐‘— )
                               ๐‘ห†๐‘ฃ        := ๐‘ห†๐‘ฃ         โˆ€๐‘ฃ โˆˆ ๐œ’(๐‘ข ๐‘— )                                                                  (2.3)
                                                          n                               D         E        (๐‘ข ๐‘— )
                                                                                                                        o
                               ๐‘ (๐‘ข ๐‘— ) := argmin D(๐‘ข ๐‘— ) ๐‘ k ๐‘ž (๐‘ข ๐‘— ) + ๐‘, ๐‘ห†(๐‘ข ๐‘— )                      ๐‘ โˆˆ ๐‘„๐‘‡                       (2.4)
                                                          (๐‘ข ๐‘— )
                                                 ร•
                                 ๐‘ห†๐‘ข ๐‘— :=                ๐‘๐‘ฃ        ๐‘ห†๐‘ฃ                                                                 (2.5)
                                             ๐‘ฃโˆˆ๐œ’(๐‘ข ๐‘— )


Let ๐›ผ(๐‘ข ๐‘— ) be the vector of Lagrange multipliers corresponding to the nonnegativity constraints in
equation (2.4) (recall Lemma 2.1). One should note that in this setting (a probability simplex),
the nonnegativity multipliers are unique and thus well-defined.
    We denote ๐›ผ = ๐›ผ ๐‘ž,๐‘ โˆˆ โ„+๐‘‰ as the vector given by ๐›ผ ๐‘ฃ := ๐›ผ ๐‘ฃ
                                                              (p(๐‘ฃ))
                                                                     for ๐‘ฃ โ‰  ๐•ฃ and ๐›ผ ๐•ฃ := 0. Recall the
complementary slackness conditions:

                                                               ๐›ผ ๐‘ฃ > 0 =โ‡’ ๐‘ ๐‘ฃ = 0.                                                     (2.6)

For ๐‘ฃ โˆˆ ๐œ’(๐‘ข), calculate
                                                                             1 ๐‘ค๐‘ฃ
                                                 โˆ‡ฮฆ(๐‘ข) (๐‘)                          1 + log(๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ) .
                                                                                                       
                                                                       =
                                                                   ๐‘ฃ           ๐œ… ๐œ‚๐‘ฃ
Then using equation (2.2), we can write the algorithm as follows:

      For ๐‘— = 1, 2, . . . , ๐‘:
           For ๐‘ฃ โˆˆ ๐œ’(๐‘ข ๐‘— ):
                                                                   ๐œ‚
                  (๐‘ข ๐‘— )             (๐‘ข )
                                                                                              
                ๐‘๐‘ฃ         := (๐‘ž ๐‘ฃ ๐‘— + ๐›ฟ ๐‘ฃ ) exp ๐œ… ๐‘ค๐‘ฃ๐‘ฃ ๐›ฝ ๐‘ข ๐‘— โˆ’ ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ )                         โˆ’ ๐›ฟ๐‘ฃ ,
                                       (๐‘ข ๐‘— )
           ๐‘ห†๐‘ข ๐‘— :=        ๐‘ฃโˆˆ๐œ’(๐‘ข ๐‘— ) ๐‘ ๐‘ฃ ๐‘ห†๐‘ฃ .
                      ร

                                                                                                       (๐‘ข ๐‘— )
where ๐›ฝ ๐‘ข ๐‘— โ‰ฅ 0 is the multiplier for the constraint                                       ๐‘ฃโˆˆ๐œ’(๐‘ข ๐‘— ) ๐‘ž ๐‘ฃ
                                                                                       ร
                                                                                                              โ‰ฅ 1. There is no multiplier for
                             (๐‘ข ๐‘— )
                 ๐‘ฃโˆˆ๐œ’(๐‘ข ๐‘— ) ๐‘ž ๐‘ฃ
                  ร
the constraint                      โ‰ค 1 because this constraint will be satisfied automatically and is
                                                                                       (๐‘ข ๐‘— )
therefore not needed in (2.4): If it were violated, decreasing some ๐‘ ๐‘ฃ with ๐‘ ๐‘ฃ > ๐‘ž ๐‘ฃ would yield
a strictly better solution to the minimization problem (2.4).

2.2   The global divergence
For ๐‘ง โˆˆ K๐‘‡ and ๐‘ž โˆˆ ๐‘„๐‘‡ , define the global divergence function
                                                            ๐‘ง๐‘ฃ
                                                            ๐‘ง๐‘ข + ๐›ฟ ๐‘ฃ
                                                                       "                                      !             #
                              1 ร• ร• ๐‘ค๐‘ฃ
                 Dฬƒ(๐‘ง k ๐‘ž) :=          (๐‘ง ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ง ๐‘ข ) log          + ๐‘ง๐‘ข ๐‘ž๐‘ฃ โˆ’ ๐‘ง๐‘ฃ ,
                              ๐œ…     ๐œ‚๐‘ฃ                      ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ
                                            ๐‘ขโˆ‰โ„’ ๐‘ฃโˆˆ๐œ’(๐‘ข)


                              T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                       10
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

where we use the convention that 0 log 00 + ๐›ฟ ๐‘ฃ = lim๐œ€โ†’0 ๐œ€ log 0๐œ€ + ๐›ฟ ๐‘ฃ = 0.
                                                                                                        

  Note that Dฬƒ is the Bregman divergence associated to the regularizer (1.7) (divided by ๐œ…)
when ๐‘ฅ๐‘ฅ๐‘ข๐‘ฃ is replaced by ๐‘ž ๐‘ฃ . One can write
                                                              ร•
                                                Dฬƒ(๐‘ง k ๐‘ž) =             ๐‘ง ๐‘ข D(๐‘ข) (๐‘ k ๐‘ž) ,
                                                              ๐‘ขโˆ‰โ„’

where ๐‘ โˆˆ ฮ”โˆ’1 (๐‘ง). In other words, ๐‘ โˆˆ ๐‘„๐‘‡ is any point satisfying ๐‘ง ๐‘ฃ = ๐‘ ๐‘ฃ ๐‘ง ๐‘ข for all ๐‘ข โˆ‰ โ„’ and
๐‘ฃ โˆˆ ๐œ’(๐‘ข),
   We will use Dฬƒ as a potential function to prove inequality (1.1), and ๐‘ง will denote the
configuration of some offline algorithm. Note that the state of the online algorithm is encoded
by ๐‘ž โˆˆ ๐‘„๐‘‡ , which contains more information than its configuration ฮ”(๐‘ž) โˆˆ ๐พ๐‘‡ .
   The next lemma shows that when an offline algorithm moves, the change in potential is
bounded by ๐‘‚(1/๐œ…) times the offline movement cost.
Lemma 2.2. It holds that for any ๐‘ž โˆˆ ๐‘„๐‘‡ and ๐‘ง, ๐‘ง 0 โˆˆ K๐‘‡ ,
                                                                                     
                                                      01    4
                               Dฬƒ(๐‘ง k ๐‘ž) โˆ’ Dฬƒ(๐‘ง k ๐‘ž) โ‰ค   2+   k๐‘ง โˆ’ ๐‘ง 0 kโ„“1 (๐‘ค) .
                                                       ๐œ…    ๐œ
                                                     ๐‘‰
Proof. Consider a differentiable map ๐‘ง : [0, 1] โ†’ โ„++  such that ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ง ๐‘ฃ (๐‘ก) โ‰ค ๐‘ง ๐‘ข (๐‘ก) for each ๐‘ก
                                                                                                 ร
and ๐‘ข โˆ‰ โ„’. It suffices to show that for each ๐‘ก and every fixed ๐‘ž โˆˆ ๐‘„๐‘‡ ,
                                                                                 
                                                           4
                                   ๐œ… ๐œ•๐‘ก Dฬƒ(๐‘ง(๐‘ก) k ๐‘ž) โ‰ค 2 +   k๐‘ง 0(๐‘ก)kโ„“1 (๐‘ค) .
                                                           ๐œ

Moreover, it suffices to address the case when there is at most one ๐‘ข โˆˆ ๐‘‰ with ๐‘ง 0๐‘ข (๐‘ก) โ‰  0.
  A direct calculation gives

                       ๐‘ค๐‘ข 0           ๐‘ง ๐‘ข (๐‘ก)/๐‘ง p(๐‘ข) (๐‘ก) + ๐›ฟ ๐‘ข
                                                                             
    ๐œ…๐œ•๐‘ก Dฬƒ(๐‘ง(๐‘ก) k ๐‘ž) =    ๐‘ง ๐‘ข (๐‘ก) log
                       ๐œ‚๐‘ข                    ๐‘ž๐‘ข + ๐›ฟ๐‘ข
                                ร• ๐‘ค                                
                                                                        ๐‘ง ๐‘ฃ (๐‘ก)/๐‘ง ๐‘ข (๐‘ก) + ๐›ฟ ๐‘ฃ
                                                                                                            
                                                                                                               ๐‘ง ๐‘ฃ (๐‘ก)
                                                                                                                         
                                         ๐‘ฃ
                           +                     ๐›ฟ ๐‘ฃ ๐‘ง 0๐‘ข (๐‘ก) log                             + ๐‘ง 0๐‘ข (๐‘ก) ๐‘ž ๐‘ฃ โˆ’                .   (2.7)
                                        ๐œ‚๐‘ฃ                                    ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ                          ๐‘ง ๐‘ข (๐‘ก)
                               ๐‘ฃโˆˆ๐œ’(๐‘ข)

Let us now use definitions (1.9) and (1.10) to observe that
                                    1      ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ   1      1 + ๐›ฟ๐‘ฃ
                                       log         โ‰ค    log        โ‰ค 2.
                                    ๐œ‚๐‘ฃ     ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ   ๐œ‚๐‘ฃ       ๐›ฟ๐‘ฃ
Using this in equation (2.7) yields

                                                            ร•                       ๐‘ง ๐‘ฃ (๐‘ก) ยช
                                                                                                                            
                                                 1                                                                  4
       ๐œ… ๐œ•๐‘ก Dฬƒ(๐‘ง(๐‘ก) k ๐‘ž)   โ‰ค ๐‘ค ๐‘ข |๐‘ง 0๐‘ข (๐‘ก)| ยญ2 +                        2๐›ฟ ๐‘ฃ + ๐‘ž ๐‘ฃ โˆ’         ยฎ โ‰ค ๐‘ค ๐‘ข |๐‘ง 0๐‘ข (๐‘ก)| 2 +   ,
                                            ยฉ
                                                       ๐œ                             ๐‘ง ๐‘ข (๐‘ก)                        ๐œ
                                                ยซ          ๐‘ฃโˆˆ๐œ’(๐‘ข)                                    ยฌ
                                            ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐›ฟ ๐‘ฃ โ‰ค          ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐œƒ๐‘ฃ โ‰ค 1 and              ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ง ๐‘ฃ (๐‘ก) โ‰ค ๐‘ง ๐‘ข (๐‘ก).
                                        ร                     ร                              ร
where the last inequality uses                                                                                                       

                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                        11
                                 C HRISTIAN C OESTER AND JAMES R. L EE

    We will sometimes implicitly restrict vectors ๐‘ฅ โˆˆ โ„๐‘‰ to the subspace spanned by {๐‘’โ„“ : โ„“ โˆˆ โ„’}.
In this case, we employ the notation
                                                                     ร•
                                                 h๐‘ฅ, ๐‘ฆiโ„’ :=                    ๐‘ฅโ„“ ๐‘ฆโ„“ ,
                                                                     โ„“ โˆˆโ„’


when either vector lies in โ„๐‘‰ or โ„ โ„’ .
   According to the following lemma, the change in potential due to movement of the online
algorithm is bounded by the difference in service cost between the offline and online algorithm.

Lemma 2.3. For any cost vector ๐‘ โˆˆ โ„+โ„’ , ๐‘ง โˆˆ K๐‘‡ , and ๐‘ž โˆˆ ๐‘„๐‘‡ , it holds that if ๐‘ = ๐’œ(๐‘ž, ๐‘), then

                                 Dฬƒ(๐‘ง k ๐‘) โˆ’ Dฬƒ(๐‘ง k ๐‘ž) โ‰ค h๐‘, ๐‘ง โˆ’ ฮ”(๐‘)i โ„’ .

Proof. Fix ๐‘ž โˆˆ ๐‘„๐‘‡ and ๐‘ โˆˆ โ„+โ„’ . Let ๐›ผ = ๐›ผ ๐‘ž,๐‘ denote the vector of multipliers defined in Section 2.1.
                                             (๐‘ข)
For ๐‘ข โˆˆ ๐‘‰ \ โ„’ with ๐‘ง ๐‘ข > 0, define ๐‘ง (๐‘ข) โˆˆ ๐‘„๐‘‡ by

                                                           (๐‘ข)           ๐‘ง๐‘ฃ
                                                         ๐‘ง ๐‘ฃ :=             .
                                                                         ๐‘ง๐‘ข

Then Lemma 2.1 gives
                                                                           D                                      E
                D(๐‘ข) ๐‘ง (๐‘ข) k ๐‘ (๐‘ข) โˆ’ D(๐‘ข) ๐‘ง (๐‘ข) k ๐‘ž (๐‘ข) โ‰ค ๐‘ห† (๐‘ข) โˆ’ ๐›ผ (๐‘ข) , ๐‘ง (๐‘ข) โˆ’ ๐‘ (๐‘ข)                                         ,
                                                                                                                          ๐œ’(๐‘ข)


where we use hยท, ยทi๐œ’(๐‘ข) for the standard inner product on โ„ ๐œ’(๐‘ข) . Multiplying by ๐‘ง ๐‘ข and summing
yields
                                      ร•          D                                           E
            Dฬƒ(๐‘ง k ๐‘) โˆ’ Dฬƒ(๐‘ง k ๐‘ž) โ‰ค         ๐‘ง ๐‘ข ๐‘ห†(๐‘ข) โˆ’ ๐›ผ(๐‘ข) , ๐‘ง (๐‘ข) โˆ’ ๐‘ (๐‘ข)
                                                                                                 ๐œ’(๐‘ข)
                                      ๐‘ขโˆ‰โ„’
                                      ร• ร•                                                ร•            ร•
                                                         (๐‘ข)             (๐‘ข)                                    (๐‘ข)         (๐‘ข)
                                 =                     ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ )๐‘ง ๐‘ฃ โˆ’                      ๐‘ง๐‘ข             ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ )๐‘ ๐‘ฃ .
                                      ๐‘ขโˆ‰โ„’ ๐‘ฃโˆˆ๐œ’(๐‘ข)                                         ๐‘ขโˆ‰โ„’         ๐‘ฃโˆˆ๐œ’(๐‘ข)


Note that from implication (2.6), the latter expression is
                                      ร•              ร•                             ร•
                                                               (๐‘ข)        (2.5)
                                            ๐‘ง๐‘ข             ๐‘ห†๐‘ฃ ๐‘ ๐‘ฃ =                     ๐‘ง ๐‘ข ๐‘ห†๐‘ข .
                                      ๐‘ขโˆ‰โ„’        ๐‘ฃโˆˆ๐œ’(๐‘ข)                            ๐‘ขโˆ‰โ„’


Noting that ๐‘ห†๐•ฃ =   โ„“ โˆˆโ„’ ฮ”(๐‘)โ„“ ๐‘โ„“ , this gives
                    ร

                                            ร•                                      ร•
                Dฬƒ(๐‘ง k ๐‘) โˆ’ Dฬƒ(๐‘ง k ๐‘ž) โ‰ค              ( ๐‘ห†๐‘ข โˆ’ ๐›ผ ๐‘ข )๐‘ง ๐‘ข โˆ’                  ๐‘ง ๐‘ข ๐‘ห†๐‘ข โ‰ค h๐‘, ๐‘ง โˆ’ ฮ”(๐‘)i โ„’ .                 
                                            ๐‘ขโ‰ ๐•ฃ                                    ๐‘ขโˆ‰โ„’


                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                        12
                     P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

2.3   Algorithm and competitive analysis
Let us now outline the proof of inequality (1.2). First, we perform a standard reduction that
allows us to bound only the โ€œpositiveโ€ movement costs when the algorithm moves from ๐‘ฅ to ๐‘ฆ.
Its proof is straightforward.

Lemma 2.4. For ๐‘ฅ, ๐‘ฆ โˆˆ K๐‘‡ it holds that

                              k๐‘ฅ โˆ’ ๐‘ฆkโ„“1 (๐‘ค) = 2 k(๐‘ฅ โˆ’ ๐‘ฆ)+ kโ„“1 (๐‘ค) + [๐œ“(๐‘ฆ) โˆ’ ๐œ“(๐‘ฅ)],

where ๐œ“(๐‘ฅ) :=       ๐‘ขโ‰ ๐•ฃ ๐‘ค ๐‘ข ๐‘ฅ ๐‘ข for ๐‘ฅ โˆˆ K๐‘‡ .
                ร

   We now state the key technical lemma which controls the positive movement cost by the
service cost. To this end, we employ an auxiliary potential function ฮจ : ๐‘„๐‘‡ โ†’ โ„ defined by
                                                                            
                                       ฮจ๐‘ข (๐‘ž) := โˆ’ฮ”(๐‘ž)๐‘ข D(๐‘ข) ๐œƒ (๐‘ข) k ๐‘ž (๐‘ข)
                                                  ร•
                                        ฮจ(๐‘ž) :=         ฮจ๐‘ข (๐‘ž).
                                                  ๐‘ขโˆ‰โ„’

Intuitively, ฮจ(๐‘ž) is a measure of difference between the online configuration ๐‘ž and the uniform
distribution over leaves (whose conditional probabilities are given by ๐œƒ).
     Let us give a brief explanation of the need for ฮจ. Our addition of โ€œnoiseโ€ to the multiscale
conditional entropy is to achieve the smoothness property established in Lemma 2.2. But this
has the adverse effect of increasing the movement cost of the algorithm, as one can see from the
๐›ฟ ๐‘ข term in (1.11). This additional movement cannot be easily charged against the service cost in
the regime where the noise term is dominant: ๐‘ฅ๐‘ฅ๐‘ข (๐‘ก)(๐‘ก)  ๐›ฟ ๐‘ข . On the other hand, this additional
                                                            p(๐‘ข)

movement has the effect of further decreasing ๐‘ฅ๐‘ฅ๐‘ข (๐‘ก)(๐‘ก) , which drives the conditional probabilities
                                               p(๐‘ข)
at p(๐‘ข) away from the uniform distribution, decreasing ฮจ. A formal statement appears later in
Lemma 2.11.
    For the next two results, take any ๐‘ž โˆˆ ๐‘„๐‘‡ and cost ๐‘ โˆˆ โ„+โ„’ , and denote ๐‘ = ๐’œ(๐‘ž, ๐‘), ๐‘ฅ =
ฮ”(๐‘ž), ๐‘ฆ = ฮ”(๐‘).

Lemma 2.5 (Movement analysis). It holds that

                     ๐œโˆ’3
                         (๐‘ฅ โˆ’ ๐‘ฆ)+ โ„“ (๐‘ค) โ‰ค (2๐”‡๐‘‡ + log ๐‘›)h๐‘, ๐‘ฅiโ„’ + [ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)] .
                      ๐œ…๐œ           1


   This lemma will be proved in Section 2.4. Let us first see that it can be used to establish
bounds on the competitive ratio. Define ๐‘คmin := min{๐‘คโ„“ : โ„“ โˆˆ โ„’} and

                                                      ๐‘คmin      ๐œโˆ’3
                                          ๐œ€๐‘‡ :=                     .
                                                  2(2๐”‡๐‘‡ + log ๐‘›) ๐œ๐œ…



                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                    13
                                       C HRISTIAN C OESTER AND JAMES R. L EE

Theorem 2.6. For any ๐‘ง โˆˆ K๐‘‡ :

               h๐‘, ๐‘ฆiโ„’ โ‰ค h๐‘, ๐‘งiโ„’ + Dฬƒ(๐‘ง k ๐‘ž) โˆ’ Dฬƒ(๐‘ง k ๐‘)
                                                                      
                                                                                                              (2.8)
                                             2๐œ
       ๐œ…โˆ’1 k๐‘ฅ โˆ’ ๐‘ฆ kโ„“1 (๐‘ค) โ‰ค [๐œ“(๐‘ฆ) โˆ’ ๐œ“(๐‘ฅ)] +       [ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)] + (2๐”‡๐‘‡ + log ๐‘›)h๐‘, ๐‘ฅiโ„’
                                                                                       
                                                                                                              (2.9)
                                            ๐œโˆ’3

Moreover, if k๐‘ k โˆž โ‰ค ๐œ€๐‘‡ , then

                                                      4๐œ
     ๐œ…โˆ’1 k๐‘ฅ โˆ’ ๐‘ฆ kโ„“1 (๐‘ค) โ‰ค [๐œ“(๐‘ฆ) โˆ’ ๐œ“(๐‘ฅ)] +                [ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)] + (2๐”‡๐‘‡ + log ๐‘›)h๐‘, ๐‘ฆiโ„’ .
                                                                                             
                                                                                                             (2.10)
                                                     ๐œโˆ’3

Proof. Inequality (2.8) follows from Lemma 2.3, and inequality (2.9) follows from Lemma 2.5
and Lemma 2.4. To see that inequality (2.10) follows from inequality (2.9) and Lemma 2.5, use
the fact that
                                                 k๐‘ k โˆž
                             h๐‘, ๐‘ฅiโ„’ โ‰ค h๐‘, ๐‘ฆiโ„’ +        (๐‘ฅ โˆ’ ๐‘ฆ)+ โ„“ (๐‘ค) .                    
                                                 ๐‘ค min            1



  In light of Theorem 2.6, we can respond to a cost function ๐‘ โˆˆ โ„+โ„’ by splitting it into
๐‘€ pieces ๐‘ 1 , ๐‘2 , . . . , ๐‘ ๐‘€ where ๐‘€ = dk๐‘ k โˆž /๐œ€๐‘‡ e. Now define ๐‘ž ๐‘– := ๐’œ(๐‘ž ๐‘–โˆ’1 , ๐‘/๐‘€), ๐‘ž0 := ๐‘ž and
 ยฏ ๐‘) := ๐‘ž ๐‘€ .
๐’œ(๐‘ž,

Theorem 2.7. Fix ๐œ โ‰ฅ 4. Consider the algorithm that begins in some configuration ๐‘ž0 โˆˆ ๐‘„๐‘‡ . If ๐‘ ๐‘ก โˆˆ โ„+โ„’
                                                            ยฏ ๐‘กโˆ’1 , ๐‘ ๐‘ก ). Then the sequence hฮ”(๐‘ž0 ), ฮ”(๐‘ž1 ), . . .i
is the cost function that arrives at time ๐‘ก, denote ๐‘ž ๐‘ก := ๐’œ(๐‘ž
is an online algorithm that is (1, ๐‘‚(1/๐œ…), ๐‘‚(๐œ…(๐”‡๐‘‡ + log ๐‘›)))-finely competitive.

   We prove this momentarily. The following fact is well-known and, in conjunction with the
preceding theorem, yields the validity of Theorems 1.1 and 1.3.

Lemma 2.8 (See, e.g., [3, Thm. 2.4]). If (โ„’, ๐‘‘๐‘‡ ) is an HST metric, then there is another weighted tree
๐‘‡ 0 with leaf set โ„’ such that

   1. (โ„’, ๐‘‘๐‘‡ 0 ) is a 7-HST metric.

   2. ๐”‡๐‘‡ 0 โ‰ค log2 |โ„’|

   3. All the leaves of ๐‘‡ 0 have depth ๐”‡๐‘‡ 0 .

   4. ๐‘‘๐‘‡ (โ„“ , โ„“ 0) โ‰ค ๐‘‘๐‘‡ 0 (โ„“ , โ„“ 0) โ‰ค ๐‘‚(๐‘‘๐‘‡ (โ„“ , โ„“ 0)) for all โ„“ , โ„“ 0 โˆˆ โ„’.

Proof sketch. Replace every weight ๐‘ค ๐‘ฃ in ๐‘‡ with ๐‘คห† ๐‘ฃ := 7 dlog7 ๐‘ค๐‘ฃ e and iteratively contract every
edge (๐‘(๐‘ข), ๐‘ข) with ๐‘คห† ๐‘(๐‘ข) = ๐‘คห† ๐‘ข and ๐‘ข โˆ‰ โ„’. The resulting weighted tree ๐‘‡1 is a 7-HST by
construction.
    Now iteratively contract every edge (๐‘(๐‘ข), ๐‘ข) in ๐‘‡1 for which |โ„’๐‘ข๐‘‡1 | > 12 |โ„’ ๐‘‡๐‘(๐‘ข)
                                                                                    1
                                                                                        |. The resulting
tree ๐‘‡ has depth ๐”‡๐‘‡ โ‰ค log2 |โ„’|. Finally, one can achieve property (3) by increasing the depth of
      0              0

every root-leaf path to ๐”‡๐‘‡ 0 using vertex weights that decrease by a factor of 7 along the path. 

                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                  14
                        P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

Proof of Theorem 2.7. Consider a sequence h๐‘ ๐‘ก : ๐‘ก โ‰ฅ 1i of cost functions. By splitting the costs
into smaller pieces, we may assume that k๐‘ ๐‘ก k โˆž โ‰ค ๐œ€๐‘‡ for all ๐‘ก โ‰ฅ 1.
    Let {๐‘ง ๐‘กโˆ— } denote some offline algorithm with ๐‘ง 0โˆ— = ฮ”(๐‘ž0 ), and let {๐‘ฅ ๐‘ก = ฮ”(๐‘ž ๐‘ก )} denote our
online algorithm. Then using Dฬƒ(๐‘ง 0โˆ— k ๐‘ฅ 0 ) = 0 along with inequality (2.8) and Lemma 2.2 yields,
for any time ๐‘ก1 โ‰ฅ 1,
                 ๐‘ก1
                 ร•                             ๐‘ก1
                                               ร•                                                   ๐‘ก1
                                                                                                   ร•
                        h๐‘ ๐‘ก , ๐‘ฅ ๐‘ก iโ„’ โ‰ค              h๐‘ ๐‘ก , ๐‘ง ๐‘กโˆ— iโ„’ โˆ’ Dฬƒ(๐‘ง ๐‘กโˆ—1 k ๐‘ž ๐‘ก1 ) + ๐‘‚(1/๐œ…)         k๐‘ง ๐‘กโˆ— โˆ’ ๐‘ง ๐‘กโˆ’1
                                                                                                                   โˆ—
                                                                                                                       kโ„“1 (๐‘ค)
                  ๐‘ก=1                          ๐‘ก=1                                                 ๐‘ก=1
                                               ๐‘ก1
                                               ร•                               ๐‘ก1
                                                                               ร•
                                           โ‰ค         h๐‘ ๐‘ก , ๐‘ง ๐‘กโˆ— iโ„’ + ๐‘‚(1/๐œ…)         k๐‘ง ๐‘กโˆ— โˆ’ ๐‘ง ๐‘กโˆ’1
                                                                                               โˆ—
                                                                                                   kโ„“1 (๐‘ค) ,
                                               ๐‘ก=1                             ๐‘ก=1

where we have used Dฬƒ(๐‘ง k ๐‘ž) โ‰ฅ 0 for all ๐‘ง โˆˆ K๐‘‡ and ๐‘ž โˆˆ ๐‘„๐‘‡ . This verifies inequality (1.1) with
๐›ผ 0 = 1, ๐›ผ1 = ๐‘‚(1/๐œ…), and ๐›ฝ = 0. Moreover, inequality (2.10) gives
      ๐‘ก1                                                                                                                         ๐‘ก1
 1ร•                                               4๐œ                                   ร•
    k๐‘ฅ ๐‘ก โˆ’ ๐‘ฅ ๐‘กโˆ’1 kโ„“1 (๐‘ค) โ‰ค [๐œ“(๐‘ฅ ๐‘ก1 ) โˆ’ ๐œ“(๐‘ฅ0 )] +     [ฮจ(๐‘ž0 ) โˆ’ ฮจ(๐‘ž ๐‘ก1 )] + 2๐”‡๐‘‡ + log ๐‘›    h๐‘ ๐‘ก , ๐‘ฅ ๐‘ก iโ„’ ,
 ๐œ…                                               ๐œโˆ’3
      ๐‘ก=1                                                                                                                        ๐‘ก=1

verifying inequality (1.2) with ๐›ผ1 โ‰ค ๐‘‚(๐œ…(๐”‡๐‘‡ + log ๐‘›)) and ๐›ฝ0 โ‰ค ๐‘‚(๐œ… max๐‘ฃโ‰ ๐•ฃ ๐‘ค ๐‘ฃ ) (see Lemma 2.10
below).                                                                                       

2.4        Movement analysis
It remains to prove Lemma 2.5. Recall that ๐‘ž โˆˆ ๐‘„๐‘‡ , ๐‘ โˆˆ โ„+โ„’ and ๐‘ = ๐’œ(๐‘ž, ๐‘), ๐‘ฅ = ฮ”(๐‘ž), ๐‘ฆ = ฮ”(๐‘).
     The KKT conditions (see equation (2.2)) give: For every ๐‘ฃ โˆˆ ๐œ’(๐‘ข),

                                                 1 ๐‘ค๐‘ฃ     ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ
                                                                          
                                                      log         = ๐›ฝ ๐‘ข โˆ’ ๐‘ห†๐‘ฃ + ๐›ผ ๐‘ฃ ,                                                  (2.11)
                                                 ๐œ… ๐œ‚๐‘ฃ     ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ

where ๐›ฝ ๐‘ข โ‰ฅ 0 is the multiplier corresponding to the constraint                                        ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ž ๐‘ฃ โ‰ฅ 1.
                                                                                                   ร

Lemma 2.9. It holds that ๐›ผ ๐‘ฃ โ‰ค ๐‘ห†๐‘ฃ for all ๐‘ฃ โˆˆ ๐‘‰ \ {๐•ฃ }.

Proof. Note that ๐‘ห†๐‘ฃ โ‰ฅ 0 by construction. Thus if ๐›ผ ๐‘ฃ = 0, we are done. Otherwise, by
                                                                     ๐‘ +๐›ฟ
complementary slackness, it must be that ๐‘ ๐‘ฃ = 0, and therefore log( ๐‘ž๐‘ฃ๐‘ฃ +๐›ฟ๐‘ฃ๐‘ฃ ) โ‰ค 0. Since ๐›ฝ p(๐‘ฃ) โ‰ฅ 0,
equation (2.11) implies that ๐›ผ ๐‘ฃ โ‰ค ๐‘ห†๐‘ฃ .                                                            
                                ๐‘ ๐‘ฃ +๐›ฟ ๐‘ฃ
                                          
      Define ๐œŽ๐‘ฃ := log          ๐‘ž ๐‘ฃ +๐›ฟ ๐‘ฃ       so that

                                                        ๐‘ž ๐‘ฃ โˆ’ ๐‘ ๐‘ฃ = (๐‘ž ๐‘ฃ + ๐›ฟ ๐‘ฃ )(1 โˆ’ e๐œŽ๐‘ฃ ).                                            (2.12)

Recall that for ๐‘ฃ โˆˆ ๐œ’(๐‘ข), we have ๐‘ฅ ๐‘ฃ = ๐‘ž ๐‘ฃ ๐‘ฅ ๐‘ข and ๐‘ฆ๐‘ฃ = ๐‘ ๐‘ฃ ๐‘ฆ๐‘ข , thus

               ๐‘ฅ ๐‘ฃ โˆ’ ๐‘ฆ๐‘ฃ = ๐‘ฅ ๐‘ข (๐‘ž ๐‘ฃ โˆ’ ๐‘ ๐‘ฃ ) + ๐‘ ๐‘ฃ (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข ) = (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅ ๐‘ข )(1 โˆ’ ๐‘’ ๐œŽ๐‘ฃ ) + ๐‘ ๐‘ฃ (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข ).

                            T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                         15
                                              C HRISTIAN C OESTER AND JAMES R. L EE

In particular,

                       ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ โˆ’ ๐‘ฆ๐‘ฃ )+ โ‰ค ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅ ๐‘ข )(1 โˆ’ e๐œŽ๐‘ฃ )+ + ๐‘ค ๐‘ฃ ๐‘ ๐‘ฃ (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข )+
                                                                            ๐‘ค๐‘ข
                                        โ‰ค ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅ ๐‘ข )(1 โˆ’ e๐œŽ๐‘ฃ )+ +      ๐‘ ๐‘ฃ (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข )+ .
                                                                             ๐œ
            ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ ๐‘ฃ = 1 and summing over all vertices yields
        ร
Using
               ร•                                ร•                                                    1ร•
                     ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ โˆ’ ๐‘ฆ๐‘ฃ )+ โ‰ค                ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅp(๐‘ฃ) )(1 โˆ’ e๐œŽ๐‘ฃ )+ +                 ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ โˆ’ ๐‘ฆ๐‘ฃ )+ ,
               ๐‘ฃโ‰ ๐•ฃ                              ๐‘ฃโ‰ ๐•ฃ
                                                                                                     ๐œ ๐‘ฃโ‰ ๐•ฃ

hence
                     ร•                               ๐œ ร•
                           ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ โˆ’ ๐‘ฆ๐‘ฃ )+ โ‰ค                ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅp(๐‘ฃ) )(1 โˆ’ e๐œŽ๐‘ฃ )+
                     ๐‘ฃโ‰ ๐•ฃ
                                                   ๐œ โˆ’ 1 ๐‘ฃโ‰ ๐•ฃ
                                                     ๐œ ร•
                                                 โ‰ค           ๐‘ค ๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅp(๐‘ฃ) ) (๐œŽ๐‘ฃ )โˆ’
                                                   ๐œ โˆ’ 1 ๐‘ฃโ‰ ๐•ฃ

                                                        ๐œ…๐œ ยฉร•              ร•    ร•
                                                 โ‰ค          ยญ ๐œ‚๐‘ฃ ๐‘ฅ ๐‘ฃ ๐‘ห†๐‘ฃ +   ๐‘ฅ๐‘ข   ๐œƒ๐‘ฃ ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ )ยฎ ,
                                                                                                  ยช
                                                                                                                                (2.13)
                                                       ๐œ โˆ’ 1 ๐‘ฃโ‰ ๐•ฃ
                                                                ยซ                       ๐‘ขโˆ‰โ„’        ๐‘ฃโˆˆ๐œ’(๐‘ข)             ยฌ
where the last line uses Lemma 2.9 and equation (2.11), to bound ๐‘ค ๐‘ฃ (๐œŽ๐‘ฃ )โˆ’ โ‰ค ๐œ…๐œ‚๐‘ฃ ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ ).
  Note that
                              ร•                      ร•                 ร•
                                    ๐œ‚๐‘ฃ ๐‘ฅ ๐‘ฃ ๐‘ห†๐‘ฃ โ‰ค          ๐‘โ„“ ๐‘ฅโ„“                     ๐œ‚๐‘ฃ โ‰ค (๐”‡๐‘‡ + log ๐‘›) h๐‘, ๐‘ฅi ,                  (2.14)
                              ๐‘ฃโ‰ ๐•ฃ                  โ„“ โˆˆโ„’             ๐‘ฃโˆˆ๐›พ๐•ฃ ,โ„“ \{๐•ฃ }

since for any โ„“ โˆˆ โ„’, it holds that
                               ร•                                     ร•                |โ„’p(๐‘ฃ) |
                                            ๐œ‚๐‘ฃ = ๐”‡๐‘‡ (โ„“ ) +                      log              = ๐”‡๐‘‡ (โ„“ ) + log ๐‘›,
                                                                                       |โ„’๐‘ฃ |
                            ๐‘ฃโˆˆ๐›พ๐•ฃ ,โ„“ \{๐•ฃ }                       ๐‘ฃโˆˆ๐›พ๐•ฃ ,โ„“ \{๐•ฃ }

where ๐”‡๐‘‡ (โ„“ ) is the combinatorial depth of โ„“ .
    The second sum in (2.13) can be interpreted as the service cost of hybrid configurations of ๐‘ž
and ๐œƒ: While ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ฅ ๐‘ฃ ๐‘ห†๐‘ฃ is the service cost of ๐‘ฅ in โ„’๐‘ข , the term ๐‘ฅ ๐‘ข ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐œƒ๐‘ฃ ๐‘ห†๐‘ฃ is the service
                ร                                                         ร
cost in โ„’๐‘ข of the modification of ๐‘ฅ whose conditional probabilities at the children of ๐‘ข are given
by ๐œƒ(๐‘ข) rather than ๐‘ž (๐‘ข) . To bound this hybrid service cost, we will employ the auxiliary potential
ฮจ.

2.4.1   The hybrid cost
We require the following elementary estimate.
Lemma 2.10. For ๐‘ข โˆ‰ โ„’ it holds that
                                                   n
                                                        (๐‘ข)                     (๐‘ข)
                                                                                           o       2 ๐‘ค๐‘ข
                                             max D            (๐‘Ÿ k ๐‘) : ๐‘Ÿ, ๐‘ โˆˆ ๐‘„๐‘‡              โ‰ค        .
                                                                                                   ๐œ… ๐œ

                             T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                 16
                      P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

Proof. Define ๐œ™๐‘ฃ : (โˆ’๐›ฟ ๐‘ฃ , โˆž) โ†’ โ„ by

                                                           1
                                         ๐œ™๐‘ฃ (๐‘) :=            (๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ) log(๐‘ ๐‘ฃ + ๐›ฟ ๐‘ฃ ),
                                                           ๐œ‚๐‘ฃ

and let
                                                                 ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ
                                                                                                 
                                             1
                          D๐œ™๐‘ฃ (๐‘ž ๐‘ฃ k ๐‘ ๐‘ฃ ) =    (๐‘ž ๐‘ฃ + ๐›ฟ ๐‘ฃ ) log         + (๐‘ ๐‘ฃ โˆ’ ๐‘ž ๐‘ฃ )
                                             ๐œ‚๐‘ฃ                  ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ

denote the corresponding Bregman divergence. Then for ๐‘ž ๐‘ฃ , ๐‘ ๐‘ฃ โ‰ฅ 0, it holds that D๐œ™๐‘ฃ (๐‘ž ๐‘ฃ k ๐‘ ๐‘ฃ ) โ‰ฅ 0
since ๐œ™๐‘ฃ is convex on โ„+ . Employing the ๐œ-HST property of ๐‘‡, this implies that

                                         1 ร•                       ๐‘ค๐‘ข ร•
                      D(๐‘ข) (๐‘Ÿ k ๐‘) =         ๐‘ค ๐‘ฃ D๐œ™๐‘ฃ (๐‘Ÿ๐‘ฃ k ๐‘ ๐‘ฃ ) โ‰ค      D๐œ™๐‘ฃ (๐‘Ÿ๐‘ฃ k ๐‘ ๐‘ฃ ).
                                         ๐œ…                         ๐œ…๐œ
                                              ๐‘ฃโˆˆ๐œ’(๐‘ข)                                ๐‘ฃโˆˆ๐œ’(๐‘ข)


                 (๐‘ข)   (๐‘ข)
    Define ๐น : ๐‘„๐‘‡ ร— ๐‘„๐‘‡ โ†’ โ„+ by ๐น(๐‘Ÿ, ๐‘) := ๐‘ฃโˆˆ๐œ’(๐‘ข) D๐œ™๐‘ฃ (๐‘Ÿ๐‘ฃ k ๐‘ ๐‘ฃ ). The map ๐‘Ÿ โ†ฆโ†’ ๐น(๐‘Ÿ, ๐‘) is convex
                                                                       ร
in general (for any Bregman divergence). The map ๐‘ โ†ฆโ†’ ๐น(๐‘Ÿ, ๐‘) is convex as well, as this holds
for each map ๐‘ ๐‘ฃ โ†ฆโ†’ D๐œ™๐‘ฃ (๐‘ž ๐‘ฃ k ๐‘ ๐‘ฃ ) since โˆ’ log(๐‘ฅ) is convex on โ„++ . Since the maximum of a convex
function on the a polytope is achieved at an extreme point, we have

                                                                             1 + ๐›ฟ๐‘ฃ                      ๐›ฟ ๐‘ฃ0
       n                           o                                                                             
                             (๐‘ข)                           1                              1
 max       ๐น(๐‘Ÿ, ๐‘) : ๐‘Ÿ, ๐‘ โˆˆ ๐‘„๐‘‡         โ‰ค max                  (1 + ๐›ฟ ๐‘ฃ ) log        โˆ’1 +      ๐›ฟ ๐‘ฃ0 log          +1
                                         ๐‘ฃ,๐‘ฃ โˆˆ๐œ’(๐‘ข)
                                         0                 ๐œ‚๐‘ฃ                  ๐›ฟ๐‘ฃ        ๐œ‚๐‘ฃ 0          1 + ๐›ฟ ๐‘ฃ0
                                           ๐‘ฃโ‰ ๐‘ฃ 0
                                       โ‰ค 2.                                                                               

    The next lemma is crucial: It relates the service cost (with respect to the reduced cost ๐‘ห† โˆ’ ๐›ผ)
of the hybrid configurations to the service cost of the actual configuration and the movement
cost.

Lemma 2.11. For any ๐‘ข โˆ‰ โ„’, it holds that

                                               2 ๐‘ค๐‘ข                ร•
                    ฮจ๐‘ข (๐‘) โˆ’ ฮจ๐‘ข (๐‘ž) โ‰ค               (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข )+ +   (๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ ) [๐‘ฅ ๐‘ฃ โˆ’ ๐œƒ๐‘ฃ ๐‘ฅ ๐‘ข ] .                 (2.15)
                                               ๐œ… ๐œ
                                                                           ๐‘ฃโˆˆ๐œ’(๐‘ข)


Proof. Write
                                                                                    
       ฮจ๐‘ข (๐‘) โˆ’ ฮจ๐‘ข (๐‘ž) = ๐‘ฅ ๐‘ข D(๐‘ข) ๐œƒ (๐‘ข) k ๐‘ž (๐‘ข) โˆ’ ๐‘ฆ๐‘ข D(๐‘ข) ๐œƒ (๐‘ข) k ๐‘ (๐‘ข)
                                                                               h                              i
                           = (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข )D(๐‘ข) (๐œƒ(๐‘ข) k ๐‘ (๐‘ข) ) + ๐‘ฅ ๐‘ข D(๐‘ข) (๐œƒ (๐‘ข) k ๐‘ž (๐‘ข) ) โˆ’ D(๐‘ข) (๐œƒ(๐‘ข) k ๐‘ (๐‘ข) ) .

Using Lemma 2.10, the first term is bounded by ๐œ…2 ๐‘ค๐œ๐‘ข (๐‘ฅ ๐‘ข โˆ’ ๐‘ฆ๐‘ข )+ .

                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                         17
                                                        C HRISTIAN C OESTER AND JAMES R. L EE

     Let us now bound the second term. Using 1 + ๐‘ก โ‰ค e๐‘ก , we have
                                                                                               ร• ๐‘ค                                ๐‘๐‘ฃ + ๐›ฟ๐‘ฃ
                                                                                                                                                             
                                                                                                           ๐‘ฃ
             h                                                                     i
                     (๐‘ข)        (๐‘ข)        (๐‘ข)         (๐‘ข)        (๐‘ข)        (๐‘ข)
     ๐œ…๐‘ฅ ๐‘ข D                (๐œƒ         k๐‘ž         )โˆ’D         (๐œƒ         k๐‘         ) = ๐‘ฅ๐‘ข                          (๐œƒ๐‘ฃ + ๐›ฟ ๐‘ฃ ) log         + ๐‘ž๐‘ฃ โˆ’ ๐‘๐‘ฃ
                                                                                                          ๐œ‚๐‘ฃ                       ๐‘ž๐‘ฃ + ๐›ฟ๐‘ฃ
                                                                                              ๐‘ฃโˆˆ๐œ’(๐‘ข)
                                                                                                ร• ๐‘ค
                                                                                                            ๐‘ฃ
                                                                                                                   [(๐œƒ๐‘ฃ + ๐›ฟ ๐‘ฃ )๐œŽ๐‘ฃ + (๐‘ž ๐‘ฃ + ๐›ฟ ๐‘ฃ )(1 โˆ’ e๐œŽ๐‘ฃ )]
                                                                                   (2.12)
                                                                                       = ๐‘ฅ๐‘ข
                                                                                                          ๐œ‚๐‘ฃ
                                                                                              ๐‘ฃโˆˆ๐œ’(๐‘ข)
                                                                                                ร• ๐‘ค
                                                                                                               ๐‘ฃ
                                                                                       โ‰ค ๐‘ฅ๐‘ข                        ๐œŽ๐‘ฃ (๐œƒ๐‘ฃ โˆ’ ๐‘ž ๐‘ฃ )
                                                                                                          ๐œ‚๐‘ฃ
                                                                                               ๐‘ฃโˆˆ๐œ’(๐‘ข)
                                                                                            ร• ๐‘ค
                                                                                                      ๐‘ฃ
                                                                                       =                  ๐œŽ๐‘ฃ [๐œƒ๐‘ฃ ๐‘ฅ ๐‘ข โˆ’ ๐‘ฅ ๐‘ฃ ] .
                                                                                                    ๐œ‚๐‘ฃ
                                                                                           ๐‘ฃโˆˆ๐œ’(๐‘ข)

To finish the proof, observe that from equation (2.11),
     ร• ๐‘ค
                 ๐‘ฃ
                                                             ร•                                                               ร•
                     ๐œŽ๐‘ฃ [๐œƒ๐‘ฃ ๐‘ฅ ๐‘ข โˆ’ ๐‘ฅ ๐‘ฃ ] = ๐œ…                             (๐›ฝ ๐‘ข โˆ’ ๐‘ห†๐‘ฃ + ๐›ผ ๐‘ฃ ) [๐œƒ๐‘ฃ ๐‘ฅ ๐‘ข โˆ’ ๐‘ฅ ๐‘ฃ ] = ๐œ…                      (๐›ผ ๐‘ฃ โˆ’ ๐‘ห†๐‘ฃ ) [๐œƒ๐‘ฃ ๐‘ฅ ๐‘ข โˆ’ ๐‘ฅ ๐‘ฃ ] ,
             ๐œ‚๐‘ฃ
    ๐‘ฃโˆˆ๐œ’(๐‘ข)                                               ๐‘ฃโˆˆ๐œ’(๐‘ข)                                                             ๐‘ฃโˆˆ๐œ’(๐‘ข)


                                                           ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ฅ ๐‘ฃ = ๐‘ฅ ๐‘ข and                     ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐œƒ๐‘ฃ = 1 (from (1.8)).
                                                       ร                                      ร
where the last equality uses                                                                                                                                         
     Using the lemma gives
           ร•                ร•                            (2.15)                                      2                        ร•
                     ๐‘ฅ๐‘ข                ๐œƒ๐‘ฃ ( ๐‘ห†๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ ) โ‰ค [ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)] +                              (ฮ”(๐‘ž) โˆ’ ฮ”(๐‘))+ โ„“ (๐‘ค) +     ๐‘ห†๐‘ฃ ๐‘ฅ ๐‘ฃ
                                                                                                    ๐œ…๐œ                 1
                                                                                                                              ๐‘ฃโ‰ ๐•ฃ
           ๐‘ขโˆ‰โ„’             ๐‘ฃโˆˆ๐œ’(๐‘ข)
                                                                                                     2
                                                             โ‰ค [ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)] +                         (ฮ”(๐‘ž) โˆ’ ฮ”(๐‘))+ โ„“ (๐‘ค) + ๐”‡๐‘‡ h๐‘, ๐‘ฅiโ„’ .
                                                                                                    ๐œ…๐œ                 1


     Combining this inequality with inequality (2.13) and inequality (2.14) gives

                               ๐œ
                                                                                                                                                            
      โˆ’1                                                                2
   ๐œ…         (๐‘ฅ โˆ’ ๐‘ฆ)+ โ„“ (๐‘ค) โ‰ค    2๐”‡๐‘‡ + log ๐‘› h๐‘, ๐‘ฅiโ„’ + (ฮจ(๐‘ž) โˆ’ ฮจ(๐‘)) +    (๐‘ฅ โˆ’ ๐‘ฆ)+ โ„“ (๐‘ค) ,
                                                                                       
                       1      ๐œโˆ’1                                      ๐œ…๐œ           1

                                                                                        (2.16)
completing the verification of Lemma 2.5.


3     Derivation of the dynamics and derived costs
For the sake of motivating the dynamics (1.11), we review the continuous-time mirror descent
framework of [14]. Suppose that K โІ โ„ ๐‘ is a convex set. We recall again the definition of the
normal cone to K at ๐‘ฅ โˆˆ K, which is given by

                                      ๐‘K (๐‘ฅ) := (K โˆ’ ๐‘ฅ)โ—ฆ = ๐‘ โˆˆ โ„ ๐‘ : h๐‘, ๐‘ฆ โˆ’ ๐‘ฅi โ‰ค 0 for all ๐‘ฆ โˆˆ K .
                                                                           

  Suppose additionally that ฮฆ : ๐’Ÿ โ†’ โ„ is ๐’ž 2 and strictly convex on an open neighborhood
๐’Ÿ โЇ K so that the Hessian โˆ‡2 ฮฆ(๐‘ฅ) is well-defined and positive definite on ๐’Ÿ. Given a control

                                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                                        18
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

function ๐น : [0, โˆž) ร— K โ†’ โ„ ๐‘ and an initial point ๐‘ฅ 0 โˆˆ K, we will be concerned with absolutely
continuous solutions ๐‘ฅ : [0, โˆž) โ†’ K to the differential inclusion

                                                    ๐‘ฅ(0) = ๐‘ฅ 0 ,
                                      โˆ‡ ฮฆ(๐‘ฅ(๐‘ก))๐‘ฅ 0(๐‘ก) โˆˆ ๐น(๐‘ก, ๐‘ฅ(๐‘ก)) โˆ’ ๐‘K (๐‘ฅ(๐‘ก)) .
                                          2


In other words, a trajectory that satisfies ๐‘ฅ(0) = ๐‘ฅ0 and for almost every ๐‘ก โ‰ฅ 0:

                                      ๐‘ฅ 0(๐‘ก) = โˆ‡2 ฮฆ(๐‘ฅ(๐‘ก))โˆ’1 (๐น(๐‘ก, ๐‘ฅ(๐‘ก)) โˆ’ ๐›พ(๐‘ก)) ,                                                   (3.1)

with ๐›พ(๐‘ก) โˆˆ ๐‘K (๐‘ฅ(๐‘ก)).
    Under suitably strong conditions on ฮฆ and ๐น, there is a unique absolutely continuous
solution to equation (3.1) [14]. In our setup, these conditions are actually not satisfied unless we
prevent the path ๐‘ฅ from hitting the relative boundary of K. Nevertheless, the formal calculation
is elucidating and motivates the algorithm of Section 2. For simplicity, we assume ๐œ… := 1 in this
section.

3.1   Hessian computation
                                                         ๐‘‰
Let us take ฮฆ as in (1.7) and calculate โˆ‡2 ฮฆ(๐‘ฅ) for ๐‘ฅ โˆˆ โ„++ . Fix ๐‘ข โ‰  ๐•ฃ . Then we have

                      ๐‘ค๐‘ข      ๐‘ฅ๐‘ข               ร• ๐‘ค           ๐‘ฅ๐‘ฃ        ๐‘ฅ๐‘ฃ
                                                                                                              
                                                   ๐‘ฃ
            ๐œ•๐‘ข ฮฆ(๐‘ฅ) =    log        + ๐›ฟ๐‘ข + 1 +       ๐›ฟ ๐‘ฃ log    + ๐›ฟ๐‘ฃ โˆ’    .                                                         (3.2)
                      ๐œ‚๐‘ข     ๐‘ฅ p(๐‘ข)              ๐œ‚๐‘ฃ          ๐‘ฅ๐‘ข        ๐‘ฅ๐‘ข
                                                                   ๐‘ฃโˆˆ๐œ’(๐‘ข)

Moreover, ๐œ•๐‘ข๐‘ฃ ฮฆ(๐‘ฅ) = 0 unless ๐‘ข = ๐‘ฃ, ๐‘ข โˆˆ ๐œ’(๐‘ฃ), or ๐‘ฃ โˆˆ ๐œ’(๐‘ข), and in this case,

                                                        ๐‘ค๐‘ข                                 ๐‘ฅ๐‘ฃ               ๐‘ค๐‘ฃ
                                                                        ร•                      2
                                     ๐œ•๐‘ข๐‘ข ฮฆ(๐‘ฅ) =                       +
                                                ๐œ‚๐‘ข (๐‘ฅ ๐‘ข + ๐›ฟ ๐‘ข ๐‘ฅp(๐‘ข) )                      ๐‘ฅ๐‘ข        ๐œ‚๐‘ฃ (๐‘ฅ ๐‘ฃ + ๐›ฟ ๐‘ฃ ๐‘ฅ ๐‘ข )
                                                                             ๐‘ฃโˆˆ๐œ’(๐‘ข)
                                             ๐‘ฅ๐‘ข           ๐‘ค๐‘ข
            ๐œ•๐‘ข,p(๐‘ข) ฮฆ(๐‘ฅ) = ๐œ•p(๐‘ข),๐‘ข ฮฆ(๐‘ฅ) = โˆ’                              .
                                            ๐‘ฅp(๐‘ข) ๐œ‚๐‘ข (๐‘ฅ ๐‘ข + ๐›ฟ ๐‘ข ๐‘ฅ p(๐‘ข) )

3.2   Explicit dynamics
We are now in a position to calculate the formal dynamics. Let us define the control by
๐น(ยท, ๐‘ก) := โˆ’๐‘(๐‘ก). We claim that for ๐‘ข โ‰  ๐•ฃ ,
                                                                                                           !
                              ๐‘ฅ ๐‘ข (๐‘ก)     ๐œ‚๐‘ข   ๐‘ฅ ๐‘ข (๐‘ก)                                ร• ๐‘ฅ (๐‘ก)
                                                                 
                                                                                                 โ„“
                    ๐œ•๐‘ก                  =               + ๐›ฟ๐‘ข            ๐›ฝp(๐‘ข) (๐‘ก) โˆ’                       ๐‘โ„“ ,                      (3.3)
                             ๐‘ฅ p(๐‘ข) (๐‘ก)   ๐‘ค ๐‘ข ๐‘ฅp(๐‘ข) (๐‘ก)                                         ๐‘ฅ ๐‘ข (๐‘ก)
                                                                                      โ„“ โˆˆโ„’๐‘ข

where ๐›ฝ ๐‘ข (๐‘ก) โ‰ฅ 0 denotes the Lagrange multiplier corresponding to the constraint ๐‘ฅ ๐‘ข =                                        ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ฅ ๐‘ฃ .
                                                                                                                           ร
  To verify equation (3.3), let us define, for ๐‘ข โ‰  ๐•ฃ ,

                                              ๐‘ค๐‘ข        ๐‘ฅ p(๐‘ข) (๐‘ก)            ๐‘ฅ ๐‘ข (๐‘ก)
                                                                                           
                                    โ„ฐ(๐‘ข) :=                               ๐œ•๐‘ก            .
                                              ๐œ‚๐‘ข ๐‘ฅ ๐‘ข (๐‘ก) + ๐›ฟ ๐‘ข ๐‘ฅ p(๐‘ข) (๐‘ก)    ๐‘ฅ p(๐‘ข) (๐‘ก)

                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                         19
                                   C HRISTIAN C OESTER AND JAMES R. L EE

Then equation (3.3) is equivalent to the assertion that
                                                                       ร• ๐‘ฅ (๐‘ก)
                                                                                 โ„“
                                      โ„ฐ(๐‘ข) = ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’                              ๐‘โ„“ (๐‘ก).                              (3.4)
                                                                             ๐‘ฅ ๐‘ข (๐‘ก)
                                                                    โ„“ โˆˆโ„’๐‘ข


Recalling equation (3.1), the equality โˆ‡2 ฮฆ(๐‘ฅ(๐‘ก))๐‘ฅ 0(๐‘ก) ๐‘ข = (๐น(๐‘ก, ๐‘ฅ(๐‘ก)) โˆ’ ๐›พ(๐‘ก))๐‘ข is equivalent to
                                                                             


                                             โ„ฐ(โ„“ ) = ๐›ฝ p(โ„“ ) (๐‘ก) โˆ’ ๐‘โ„“ (๐‘ก),                โ„“ โˆˆ โ„’,                            (3.5)
                           ร• ๐‘ฅ (๐‘ก)
                              ๐‘ฃ
                 โ„ฐ(๐‘ข) โˆ’                      โ„ฐ(๐‘ฃ) = ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’ ๐›ฝ ๐‘ข (๐‘ก) ,                ๐‘ข โˆˆ ๐‘‰ \ (โ„’ โˆช {๐•ฃ }).               (3.6)
                                   ๐‘ฅ ๐‘ข (๐‘ก)
                          ๐‘ฃโˆˆ๐œ’(๐‘ข)

Clearly equation (3.5) already confirms equation (3.4) for โ„“ โˆˆ โ„’.
   Let us conclude by verifying equation (3.4) for all ๐‘ข โˆ‰ ๐•ฃ by (reverse) induction on the depth.
Employing equation (3.6) along with the validity of equation (3.4) for {โ„ฐ(๐‘ฃ) : ๐‘ฃ โˆˆ ๐œ’(๐‘ข)} yields
                                                                                                                        !
                                                           ร• ๐‘ฅ (๐‘ก)                           ร• ๐‘ฅ (๐‘ก)
                                                              ๐‘ฃ                                       โ„“
                  โ„ฐ(๐‘ข) = ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’ ๐›ฝ ๐‘ข (๐‘ก) +                                  ๐›ฝ ๐‘ข (๐‘ก) โˆ’                     ๐‘โ„“ (๐‘ก)
                                                                   ๐‘ฅ ๐‘ข (๐‘ก)                           ๐‘ฅ ๐‘ข (๐‘ก)
                                                          ๐‘ฃโˆˆ๐œ’(๐‘ข)                             โ„“ โˆˆโ„’๐‘ข
                                             ร• ๐‘ฅ (๐‘ก)
                                                      โ„“
                        = ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’                         ๐‘โ„“ (๐‘ก),
                                                     ๐‘ฅ ๐‘ข (๐‘ก)
                                             โ„“ โˆˆโ„’๐‘ข

where we used the fact that ๐‘ฅ ๐‘ข =             ๐‘ฃโˆˆ๐œ’(๐‘ข) ๐‘ฅ ๐‘ฃ for ๐‘ฅ โˆˆ K๐‘‡ .
                                         ร


3.3   Relationship between discrete and continuous dynamics
Recall the setup from Section 1.3.3. We consider a system of variables {๐‘ž ๐‘ข (๐‘ก) : ๐‘ข โˆˆ ๐‘‰ \ {๐•ฃ }}
satisfying the differential equations
                                       ๐œ‚๐‘ข
                       ๐œ•๐‘ก ๐‘ž ๐‘ข (๐‘ก) =       (๐‘ž ๐‘ข (๐‘ก) + ๐›ฟ ๐‘ข ) ๐›ฝ p(๐‘ข) (๐‘ก) โˆ’ ๐‘ห†๐‘ข (๐‘ก) + ๐›ผ ๐‘ข (๐‘ก) ,
                                                                                         
                                                                                                                            (3.7)
                                       ๐‘ค๐‘ข

where ๐›ผ ๐‘ข (๐‘ก) is a Lagrangian multiplier for the constraint ๐‘ž ๐‘ข (๐‘ก) โ‰ฅ 0, and ๐‘ห†๐‘ข (๐‘ก) is the โ€œderivedโ€
cost in the subtree rooted at ๐‘ข:
                                                                ร•
                                               ๐‘ห†๐‘ข (๐‘ก) :=              ๐‘žโ„“ |๐‘ข (๐‘ก)๐‘โ„“ (๐‘ก)
                                                               โ„“ โˆˆโ„’๐‘ข
                                                                   ร–
                                              ๐‘žโ„“ |๐‘ข (๐‘ก) :=                   ๐‘ž ๐‘ฃ (๐‘ก) ,
                                                               ๐‘ฃโˆˆ๐›พ๐‘ข,โ„“ \{๐‘ข}

where ๐›พ๐‘ข,โ„“ is the unique simple ๐‘ข-โ„“ path in ๐‘‡. Now the values ๐‘žโ„“ |๐•ฃ give a probability distribution
on the leaves.
    Let us argue that when the discretization parameter of the algorithm presented in Section 2
goes to zero, one arrives at a solution to equation (3.7). Recall that in Section 2.3, we split

                     T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                    20
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

each cost function ๐‘ โˆˆ โ„+โ„’ into ๐‘€ pieces ๐‘€ โˆ’1 ๐‘ and computed a sequence of configurations
๐‘ž0 , . . . , ๐‘ž ๐‘€ โˆˆ ๐‘„๐‘‡ . Define the piecewise-linear function ๐‘ž(๐‘€) : [0, 1] โ†’ ๐‘„๐‘‡ by

                          ๐‘—+๐›ฟ
                                 
               ๐‘ž(๐‘€)           := (1 โˆ’ ๐›ฟ)๐‘ž ๐‘— + ๐›ฟ๐‘ž ๐‘—+1 ,                                     ๐›ฟ โˆˆ [0, 1], ๐‘— โˆˆ {0, . . . , ๐‘€ โˆ’ 1}.
                           ๐‘€

   Recalling Section 2.1, we have
                                                    n                                D                 E                     o
                          (๐‘ข)
                      ๐‘ž๐‘—        := argmin D(๐‘ข) ๐‘ k ๐‘ž (๐‘ข)
                                                     ๐‘—โˆ’1
                                                                     (๐‘ข)
                                                         + ๐‘, ๐‘€ โˆ’1 ๐‘ห† ๐‘—                                      ๐‘ โˆˆ ๐‘„๐‘‡
                                                                                                                       (๐‘ข)
                                                                                                                                   ,          (3.8)

where                                                                    ร•
                                                               (๐‘ข)
                                                           ๐‘ห† ๐‘—      =             (๐‘ž ๐‘— )โ„“ |๐‘ข ๐‘โ„“ .
                                                                         โ„“ โˆˆโ„’๐‘ฃ

Thus for ๐‘ฃ โˆˆ ๐œ’(๐‘ข) and ๐‘— โ‰ฅ 1,

                                                                   ๐œ‚๐‘ฃ 
                                h                      i                                                                 
                     (๐‘ข)                 (๐‘ข)                                            (๐‘ข)
                    ๐‘ž๐‘—     =           ๐‘ž ๐‘—โˆ’1 + ๐›ฟ ๐‘ฃ             exp      ๐›ฝ ๐‘ข โˆ’ (๐‘€ โˆ’1 ( ๐‘ห† ๐‘— )๐‘ฃ โˆ’ ๐›ผ ๐‘ฃ ) โˆ’ ๐›ฟ ๐‘ฃ .
                         ๐‘ฃ                   ๐‘ฃ                     ๐‘ค๐‘ฃ

One can now verify that there is a constant ๐ฟ = ๐ฟ(๐‘, ๐‘‡) such that
                                                                  ๐ฟ
                                ๐‘ž ๐‘— ๐‘ฃ โˆ’ ๐‘ž ๐‘—โˆ’1 ๐‘ฃ โ‰ค                   ,         ๐‘— โˆˆ {1, . . . , ๐‘€}, ๐‘ฃ โˆˆ ๐‘‰ \ {๐•ฃ }.
                                                    
                                                                  ๐‘€
In particular, we see that ๐‘ž(๐‘€)
                            0
                                โˆˆ ๐ฟโˆž ([0, 1], โ„๐‘‰\{๐•ฃ } ) for every ๐‘€ โ‰ฅ 1 and, moreover,

                                                                 0
                                                           sup ๐‘ž (๐‘€)                       < โˆž.                                               (3.9)
                                                           ๐‘€โ‰ฅ1                      ๐ฟโˆž

Therefore by Arzelร โ€“Ascoli (see, e.g., [2, Thm. 0.3.1]), there is a subsequence {๐‘€ ๐‘˜ } such that
๐‘ž(๐‘€ ๐‘˜ ) converges uniformly to a function ๐‘ž : [0, 1] โ†’ ๐‘„๐‘‡ .
    Since the unit ball of ๐ฟโˆž ([0, 1], โ„๐‘‰\{๐•ฃ } ) is weakly compact (by the sequential Banachโ€“Alaoglu
Theorem; see, e.g., [2, Thm. 0.3.3]), we can pass to a further subsequence {๐‘€ 0๐‘˜ } along which ๐‘ž (๐‘€
                                                                                                 0
                                                                                                    0)
                                                                                                                                                 ๐‘˜
                                                                                                                                       โˆซ๐‘
converges weakly to some โ„Ž โˆˆ ๐ฟโˆž ([0, 1], โ„๐‘‰\{๐•ฃ } ). Moreover, since ๐‘ž(๐‘€) (๐‘)โˆ’ ๐‘ž(๐‘€) (๐‘Ž) =                                                 ๐‘ž 0 (๐‘ก) ๐‘‘๐‘ก
                                                                                                                                       ๐‘Ž (๐‘€)
                                                                                     โˆซ๐‘
for all 0 โ‰ค ๐‘Ž < ๐‘ โ‰ค 1, it follows that ๐‘ž(๐‘) โˆ’ ๐‘ž(๐‘Ž) = ๐‘Ž โ„Ž(๐‘ก) ๐‘‘๐‘ก as well, and therefore for almost all
๐‘ก โˆˆ [0, 1], we have ๐‘ž 0(๐‘ก) = โ„Ž(๐‘ก).
                                                                                ๐‘‰\{๐•ฃ }
    If we similarly linearly interpolate the cost function to ๐‘ห†(๐‘€) : [0, 1] โ†’ โ„+ , then ๐‘ห†(๐‘€ ๐‘˜ ) โ†’ ๐‘ห†
along this sequence as well, and
                                                                             ร•
                                                         ๐‘ห†(๐‘ข) (๐‘ก) =                 ๐‘žโ„“ |๐‘ข (๐‘ก)๐‘โ„“ .
                                                                             โ„“ โˆˆโ„’๐‘ฃ

   Now the KKT conditions for optimality in (3.8) give
                                                                                                                      
                                          (๐‘ข)                           (๐‘ข)                    (๐‘ข)                   (๐‘ข)
                           โˆ‡ฮฆ(๐‘ข) ๐‘ž ๐‘—                โˆ’ โˆ‡ฮฆ(๐‘ข) ๐‘ž ๐‘—โˆ’1 + ๐‘€ โˆ’1 ๐‘ห† ๐‘—                        โˆˆ โˆ’N๐‘„ (๐‘ข) ๐‘ž ๐‘—             ,
                                                                                                             ๐‘‡



                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                                                                 21
                               C HRISTIAN C OESTER AND JAMES R. L EE

or equivalently,
                                                            
                                   (๐‘ข)                   (๐‘ข)
                        โˆ‡ฮฆ(๐‘ข) ๐‘ž ๐‘—            โˆ’ โˆ‡ฮฆ(๐‘ข) ๐‘ž ๐‘—โˆ’1                                      
                                                                        (๐‘ข)                (๐‘ข)
                                                                   โˆˆ โˆ’๐‘ห† ๐‘— โˆ’ N๐‘„ (๐‘ข) ๐‘ž ๐‘—              .
                                         ๐‘€ โˆ’1                                      ๐‘‡

By standard results in differential inclusion theory (e. g., the Convergence Theorem [2, Thm.
1.4.1]), we conclude that ๐‘ž : [0, 1] โ†’ ๐‘„๐‘‡ solves the differential inclusion
                                             
                       โˆ‡2 ฮฆ(๐‘ข) ๐‘ž (๐‘ข) (๐‘ก) ๐œ•๐‘ก ๐‘ž (๐‘ข) (๐‘ก) โˆˆ โˆ’๐‘ห†(๐‘ข) (๐‘ก) โˆ’ N๐‘„ (๐‘ข) (๐‘ž (๐‘ข) (๐‘ก)).
                                                                               ๐‘‡


Calculating the Hessian โˆ‡2 ฮฆ(๐‘ข) reveals that ๐‘ž(๐‘ก) is a solution to equation (3.7).


References
 [1] Jacob Abernethy, Peter Bartlett, Niv Buchbinder, and Isabelle Stanton: A regularization
     approach to metrical task systems. In Proc. 21st Internat. Conf. Algorithmic Learning Theory
     (ALTโ€™10), pp. 270โ€“284. Springer, 2010. [doi:10.1007/978-3-642-16108-7_23] 5

 [2] Jean-Pierre Aubin and Arrigo Cellina: Differential Inclusions: Set-Valued Maps and Viability
     Theory. Springer, 1984. [doi:10.1007/978-3-642-69512-4] 21, 22

 [3] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor: A polylogarithmic-
     competitive algorithm for the ๐‘˜-server problem. J. ACM, 62(5):40:1โ€“49, 2015. Preliminary
     version in FOCSโ€™11. [doi:10.1145/2783434, arXiv:1110.1580] 14

 [4] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: Metrical task systems and the ๐‘˜-server
     problem on HSTs. In Proc. 37th Internat. Colloq. on Automata, Languages, and Programming
     (ICALPโ€™10), pp. 287โ€“298. Springer, 2010. [doi:10.1007/978-3-642-14165-2_25] 3

 [5] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: A primal-dual randomized algorithm
     for weighted paging. J. ACM, 59(4):19:1โ€“24, 2012. Preliminary version in FOCSโ€™07.
     [doi:10.1145/2339123.2339126] 2

 [6] Yair Bartal: Probabilistic approximations of metric spaces and its algorithmic applications.
     In Proc. 37th FOCS, pp. 184โ€“193. IEEE Comp. Soc., 1996. [doi:10.1109/SFCS.1996.548477] 2,
     3

 [7] Yair Bartal, Avrim Blum, Carl Burch, and Andrew Tomkins: A polylog(n)-competitive
     algorithm for metrical task systems. In Proc. 29th STOC, pp. 711โ€“719. ACM Press, 1997.
     [doi:10.1145/258533.258667] 2

 [8] Yair Bartal, Bรฉla Bollobรกs, and Manor Mendel: Ramsey-type theorems for metric
     spaces with applications to online problems. J. Comput. System Sci., 72(5):890โ€“921, 2006.
     [doi:10.1016/j.jcss.2005.05.008, arXiv:cs/0406028] 2

                     T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                               22
                 P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

 [9] Yair Bartal, Nathan Linial, Manor Mendel, and Assaf Naor: On metric Ramsey-
     type phenomena. Ann. Math., 162(2):643โ€“709, 2005. Preliminary version in STOCโ€™03.
     [doi:10.4007/annals.2005.162.643, arXiv:math/0406353] 2

[10] Avrim Blum, Howard Karloff, Yuval Rabani, and Michael Saks: A decomposition
     theorem for task systems and bounds for randomized server problems. SIAM J. Comput.,
     30(5):1624โ€“1661, 2000. [doi:10.1137/S0097539799351882] 2

[11] Allan Borodin, Nathan Linial, and Michael E. Saks: An optimal on-line algorithm for
     metrical task system. J. ACM, 39(4):745โ€“763, 1992. [doi:10.1145/146585.146588] 2

[12] Sรฉbastien Bubeck and Nicolรฒ Cesa-Bianchi: Regret analysis of stochastic and non-
     stochastic multi-armed bandit problems. Found. Trends Mach. Learning, 5(1):1โ€“122, 2012.
     [doi:10.1561/2200000024, arXiv:1204.5721] 2, 6

[13] Sรฉbastien Bubeck, Michael B. Cohen, James R. Lee, and Yin Tat Lee: Metrical task systems
     on trees via mirror descent and unfair gluing. SIAM J. Comput., 50(3):909โ€“923, 2021.
     Preliminary version in SODAโ€™19. [doi:10.1137/19M1237879, arXiv:1807.04404] 2, 3, 4, 5, 6

[14] Sรฉbastien Bubeck, Michael B. Cohen, Yin Tat Lee, James R. Lee, and Aleksander Madry:
     ๐‘˜-server via multiscale entropic regularization. In Proc. 50th STOC, pp. 3โ€“16. ACM Press,
     2018. [doi:10.1145/3188745.3188798, arXiv:1711.01085] 5, 8, 18, 19

[15] Niv Buchbinder, Shahar Chen, and Joseph (Seffi) Naor: Competitive analysis via regular-
     ization. In Proc. 25th Ann. ACMโ€“SIAM Symp. on Discrete Algorithms (SODAโ€™14), pp. 436โ€“444.
     SIAM, 2014. [doi:10.1137/1.9781611973402.32] 5

[16] Niv Buchbinder and Joseph Naor: The design of competitive online algorithms via a primal-
     dual approach. Found. Trends Theor. Comp. Sci., 3(2โ€“3):93โ€“263, 2009. [doi:10.1561/0400000024]
     6

[17] Christian Coester and James R. Lee: Pure entropic regularization for metrical task systems.
     In Proc. 32nd Ann. Conf. on Learning Theory (COLTโ€™19), pp. 835โ€“848. MLR Press, 2019. PMLR.
     1

[18] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar: A tight bound on approximating
     arbitrary metrics by tree metrics. J. Comput. System Sci., 69(3):485โ€“497, 2004. Preliminary
     version in STOCโ€™03. [doi:10.1016/j.jcss.2004.04.011] 2, 3, 4

[19] Amos Fiat and Manor Mendel: Better algorithms for unfair metrical task systems and
     applications. SIAM J. Comput., 32(6):1403โ€“1422, 2003. Preliminary version in STOCโ€™00.
     [doi:10.1137/S0097539700376159, arXiv:cs/0406034] 2

[20] Steve Seiden: Unfair problems and randomized algorithms for metrical task systems.
     Inform. Comput., 148(2):219โ€“240, 1999. [doi:10.1006/inco.1998.2744] 2



                    T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                      23
                            C HRISTIAN C OESTER AND JAMES R. L EE

AUTHORS

    Christian Coester
    Associate professor
    Department of Computer Science
    University of Oxford
    Oxford, United Kingdom
    christian coester cs ox ac uk
    https://www.cs.ox.ac.uk/people/christian.coester/


    James R. Lee
    Professor
    Paul G. Allen Center for Computer Science & Engineering
    University of Washington
    Seattle, Washington, USA
    jrl cs washington edu
    https://homes.cs.washington.edu/~jrl/


ABOUT THE AUTHORS

    Christian Coester is Associate Professor in the Department of Computer Science at
      the University of Oxford and Tutorial Fellow at St Anneโ€™s College. He received
      his Ph. D. from Oxford in 2020, under the supervision of Elias Koutsoupias. After
      stints as a postdoctoral researcher at CWI in Amsterdam and Tel Aviv University
      and as a lecturer at the University of Sheffield, he returned to Oxford in 2022.
      His research focuses on the design and analysis of algorithms, especially online
      algorithms and learning-augmented algorithms. Outside of his research, he
      enjoys sports, chess and playing the piano.


    James R. Lee is a Professor of Computer Science & Engineering at the University of
       Washington. He received his Ph. D. from the University of California, Berkeley
       in 2005, under the supervision of Christos Papadimitriou, followed by a postdoc
       at the Institute for Advanced Study in Princeton. His research interests are
       varied and eclectic, ranging from spectral graph algorithms to functional analysis,
       and from convex optimization to statistical physics. He challenged a class of
       undergrads to compete against the MTS algorithm in this paper. They fought
       (and coded) valiantly. They PyTorched and TensorFlowed. But in the end, the
       Theory won.




                  T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1โ€“24                      24