Pure Entropic Regularization for Metrical Task Systems

Authors Christian Coester, James R. Lee,
Plaintext
                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24
                                       www.theoryofcomputing.org




Pure Entropic Regularization for Metrical
             Task Systems
                              Christian Coester∗                    James R. Lee†
                 Received July 23, 2019; Revised October 8, 2020; Published December 29, 2022




       Abstract. We show that on every 𝑛-point HST metric, there is a randomized online
       algorithm for metrical task systems (MTS) that is 1-competitive for service costs and
       𝑂(log 𝑛)-competitive for movement costs. In general, these refined guarantees are
       optimal up to the implicit constant. While an 𝑂(log 𝑛)-competitive algorithm for
       MTS on HST metrics was developed by Bubeck et al. (SODA’19), that approach could
       only establish an 𝑂((log 𝑛)2 )-competitive ratio when the service costs are required
       to be 𝑂(1)-competitive. Our algorithm can be viewed as an instantiation of online
       mirror descent with the regularizer derived from a multiscale conditional entropy.
           In fact, our algorithm satisfies a set of even more refined guarantees; we are able
       to exploit this property to combine it with known random embedding theorems and
       obtain, for any 𝑛-point metric space, a randomized algorithm that is 1-competitive
       for service costs and 𝑂((log 𝑛)2 )-competitive for movement costs.

    An extended abstract of this paper appeared in the Proceedings of the 32nd Ann. Conference on Learning
Theory (COLT 2019) [17].
  ∗ Supported by EPSRC Award 1652110. Part of this work was carried out while C. Coester was visiting the
University of Washington, hosted by J. R. Lee.
  † Supported by NSF grants CCF-1616297 and CCF-1407779 and a Simons Investigator Award



ACM Classification: F.2.0
AMS Classification: 68W27
Key words and phrases: online algorithms, competitive analysis, mirror descent, metrical task
systems, decision making under uncertainty


© 2022 Christian Coester and James R. Lee
c b Licensed under a Creative Commons Attribution License (CC-BY)                 DOI: 10.4086/toc.2022.v018a023
                                     C HRISTIAN C OESTER AND JAMES R. L EE

1    Introduction
Let (𝑋 , 𝑑𝑋 ) be a finite metric space with |𝑋 | = 𝑛 > 1. The Metrical Task Systems (MTS) problem,
introduced in [11] is described as follows. The input is a sequence h𝑐 𝑡 : 𝑋 → ℝ+ : 𝑡 ≥ 1i of
nonnegative cost functions on the state space 𝑋. At every time 𝑡, an online algorithm maintains
a state 𝜌𝑡 ∈ 𝑋.
     The corresponding cost is the sum of a service cost 𝑐 𝑡 (𝜌𝑡 ) and a movement cost 𝑑𝑋 (𝜌𝑡−1 , 𝜌𝑡 ).
Formally, an online algorithm is a sequence of mappings 𝝆 = h𝜌1 , 𝜌2 , . . . , i where, for every 𝑡 ≥ 1,
𝜌𝑡 : (ℝ+𝑋 )𝑡 → 𝑋 maps a sequence of cost functions h𝑐1 , . . . , 𝑐 𝑡 i to a state. The initial state 𝜌0 ∈ 𝑋
is fixed. The total cost of the algorithm 𝝆 in servicing 𝒄 = h𝑐 𝑡 : 𝑡 ≥ 1i is defined as:
                            Õ
             cost𝝆 (𝒄) :=         [𝑐 𝑡 (𝜌𝑡 (𝑐1 , . . . , 𝑐 𝑡 )) + 𝑑𝑋 (𝜌𝑡−1 (𝑐1 , . . . , 𝑐 𝑡−1 ), 𝜌𝑡 (𝑐1 , . . . , 𝑐 𝑡 ))] .
                            𝑡≥1

The cost of the offline optimum, denoted cost∗ (𝒄), is the infimum of 𝑡≥1 [𝑐 𝑡 (𝜌𝑡 ) + 𝑑𝑋 (𝜌𝑡−1 , 𝜌𝑡 )]
                                                                                                     Í
over any sequence h𝜌𝑡 : 𝑡 ≥ 1i of states. A randomized online algorithm 𝝆 is said to be 𝛼-competitive
if for every 𝜌0 ∈ 𝑋, there is a constant 𝛽 > 0 such that for all cost sequences 𝒄:

                                           𝔼 cost𝝆 (𝒄) ≤ 𝛼 · cost∗ (𝒄) + 𝛽 .
                                                     

    For the 𝑛-point uniform metric, a simple coupon-collector argument shows that the com-
petitive ratio is Ω(log 𝑛), and this is tight [11]. A long-standing conjecture is that this Θ(log 𝑛)
competitive ratio holds for an arbitrary 𝑛-point metric space. The lower bound has almost
been established [8, 9]; for any 𝑛-point metric space, the competitive ratio is Ω(log 𝑛/log log 𝑛).
Following a long sequence of works (see, e. g., [20, 10, 7, 6, 19, 18]), an upper bound of 𝑂((log 𝑛)2 )
was shown in [13].

Relation to adversarial multi-arm bandits MTS is naturally related to the adversarial setting of
the classical multi-arm bandits model in sequential decision making, and provides a very general
framework for “bandits with switching costs.” Unlike in the setting of regret minimization,
where one competes against the best static strategy in hindsight (see, e. g., [12]), competitive
analysis compares the performance of an online algorithm to the best dynamical offline algorithm.
    Thus this model emphasizes the importance of an adaptivity in the face of changing
environments. For MTS, the online algorithm has full information: access to the complete cost
function 𝑐 𝑡 is available when deciding on a point 𝜌𝑡 (𝑐1 , . . . , 𝑐 𝑡 ) ∈ 𝑋 at which to play. And yet
one of the fascinating relationships between MTS and adversarial bandits is the parallel between
adaptivity—being willing to “try out” new strategies—and the classical exploration/exploitation
tradeoff that occurs in models where one only has access to partial information about the loss
functions.

HST metrics The methods of [5] show that the competitive ratio for MTS is 𝑂(log 𝑛) on
weighted star metrics. Recently, the authors of [13] generalized this result by designing
an algorithm with competitive ratio 𝑂(𝔇𝑇 log 𝑛) on any weighted 𝑛-point tree metric with
combinatorial depth 𝔇𝑇 . We now discuss a special class of metrics.

                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                   2
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

    Let 𝑇 = (𝑉 , 𝐸) be a finite tree with root 𝕣 and vertex weights {𝑤 𝑢 > 0 : 𝑢 ∈ 𝑉 }, let ℒ ⊆ 𝑉
denote the leaves of 𝑇, and suppose that the vertex weights on 𝑇 are non-increasing along
root-leaf paths. Consider the metric space (ℒ, 𝑑𝑇 ), where 𝑑𝑇 (ℓ , ℓ 0) is the weighted length of the
path connecting ℓ and ℓ 0 when the edge from a node 𝑢 to its parent is 𝑤 𝑢 . We will use 𝔇𝑇 for
the combinatorial (i. e., unweighted) depth of 𝑇.
    (ℒ, 𝑑𝑇 ) is called an HST metric (or, equivalently for finite metric spaces, an ultrametric). If, for
some 𝜏 > 1, the weights on 𝑇 satisfy the stronger inequality 𝑤 𝑣 ≤ 𝑤 𝑢 /𝜏 whenever 𝑣 is a child
of 𝑢, the space (ℒ, 𝑑𝑇 ) is said to be a 𝜏-HST metric. Such metric spaces play a special role in
MTS since every 𝑛-point metric space can be probabilistically approximated by a distribution
over such spaces [6, 18]. Indeed, the 𝑂((log 𝑛)2 )-competitive ratio for general metric spaces
established in [13] is a consequence of their 𝑂(log 𝑛)-competitive algorithm for HSTs.


1.1   Refined guarantees
The authors of [4] observe that there is a more refined way to analyze competive algorithms for
MTS. For a randomized online algorithm 𝝆 and a cost sequence 𝒄, we denote, respectively, S𝝆 (𝒄)
and M𝝆 (𝒄) for the (expected) service cost and movement cost, that is
                                   Õ                                    Õ
                     S𝝆 (𝒄) := 𝔼         𝑐 𝑡 (𝜌𝑡 )   and M𝝆 (𝒄) := 𝔼          𝑑𝑋 (𝜌𝑡−1 , 𝜌𝑡 ) .
                                   𝑡≥1                                  𝑡≥1


If there are numbers 𝛼, 𝛼0 , 𝛽, 𝛽0 > 0 such that for every cost 𝒄, it holds that

                                           S𝝆 (𝒄) ≤ 𝛼 · cost∗ (𝒄) + 𝛽
                                          M𝝆 (𝒄) ≤ 𝛼0 · cost∗ (𝒄) + 𝛽0 ,

one says that 𝝆 is 𝛼-competitive for service costs and 𝛼0-competitive for movement costs.
    In [4], it is shown that on every 𝑛-point HST metric, and for every 𝜀 > 0, there is an
online algorithm that is simultaneously (1 + 𝜀)-competitive for service costs and 𝑂((log(𝑛/𝜀))2 )-
competitive for movement costs. The authors of [13] improve this slightly to show that
actually there is an online algorithm that is simultaneously 1-competitive for service costs and
𝑂((log 𝑛)2 )-competitive for movement costs. We obtain the optimal refined guarantees.

Theorem 1.1. On any 𝑛-point HST metric 𝑋, there is a randomized online algorithm that is 1-competitive
for service costs and 𝑂(log 𝑛)-competitive for movement costs.

Remark 1.2 (Optimality of the refined guarantees). Any finitely competitive algorithm for MTS
on an 𝑛-point uniform metric cannot be better than Ω(log 𝑛)-competitive for movement costs,
regardless of its competitive ratio for service costs. This is because this lower bound holds even
if the cost functions only take values 0 and ∞. Moreover, it cannot be better than 1-competitive
for service costs, regardless of its competitive ratio for movement costs. To see this, consider the
case where each cost function is the constant function 1.

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                            3
                                  C HRISTIAN C OESTER AND JAMES R. L EE

Finely competitive guarantees Suppose that for some numbers 𝛼0 , 𝛼 1 , 𝛾, 𝛽, 𝛽0 > 0, a random-
ized online algorithm 𝝆 satisfies, for every cost 𝒄 and every offline algorithm 𝝆∗ :
                                      S𝝆 (𝒄) ≤ 𝛼 0 S𝝆∗ (𝒄) + 𝛼 1 M𝝆∗ (𝒄) + 𝛽                        (1.1)
                                                             0
                                     M𝝆 (𝒄) ≤ 𝛾 S𝝆 (𝒄) + 𝛽 .                                        (1.2)
In this case, we say that 𝝆 is (𝛼0 , 𝛼1 , 𝛾)-finely competitive. We establish the following.
Theorem 1.3. On any 𝑛-point  HST metric 𝑋, for every 𝜅 ≥ 1, there is an online randomized algorithm
𝝆 that is 1, 1/𝜅, 𝑂(𝜅 log 𝑛) -finely competitive. In fact, one can take 𝛽 = 0 and 𝛽 ≤ 𝑂(𝜅diam(𝑋)).
                                                                                   0

   Combined with the random embedding from [18], this yields the following consequence for
general 𝑛-point metric spaces.
Corollary 1.4. On any 𝑛-point metric space, there is an online randomized algorithm that is 1-competitive
for service costs and 𝑂((log 𝑛)2 )-competitive for movement costs.
Proof. Consider an 𝑛-point metric space (𝑋 , 𝑑𝑋 ). It is known [18] that there exists a random
HST metric (𝑇, 𝑑𝑇 ) so that ℒ(𝑇) = 𝑋 and for all 𝑥, 𝑦 ∈ 𝑋:
   1. Pr[𝑑𝑇 (𝑥, 𝑦) ≥ 𝑑𝑋 (𝑥, 𝑦)] = 1,
   2. 𝔼[𝑑𝑇 (𝑥, 𝑦)] ≤ 𝐷 · 𝑑𝑋 (𝑥, 𝑦),
and 𝐷 ≤ 𝑂(log 𝑛).
   Let 𝝆𝑇 be the randomized algorithm for (𝑇, 𝑑𝑇 ) guaranteed by Theorem 1.3 with 𝜅 = 𝐷. Let
𝝆 denote the algorithm that results from sampling (𝑇, 𝑑𝑇 ) and then using 𝝆𝑇 . We use M𝑇 to
denote movement cost measured in 𝑑𝑇 and M𝑋 for movement cost measured in 𝑑𝑋 .
   Then for any cost 𝒄 and any offline algorithm 𝝆∗ , we have
                         S𝝆 (𝒄) = 𝔼[S𝝆𝑇 (𝒄)] ≤ S𝝆∗ (𝒄) + 𝜅 −1 𝔼[M𝑇𝝆∗ (𝒄)] + 𝑂(1)
                                                 ≤ S𝝆∗ (𝒄) + 𝜅 −1 𝐷 M𝝆𝑋∗ (𝒄) + 𝑂(1)
                                                = S𝝆∗ (𝒄) + M𝝆𝑋∗ (𝒄) + 𝑂(1) ,
and
                 M𝝆𝑋 (𝒄) = 𝔼[M𝝆𝑋𝑇 (𝒄)] ≤ 𝔼[M𝑇𝝆𝑇 (𝒄)] ≤ 𝑂(𝜅 log 𝑛) 𝔼[S𝝆𝑇 (𝒄)] + 𝑂(1),
completing the proof.                                                                                  

1.2   The fractional model on trees
We will work in the following deterministic fractional setting, which is equivalent to the
randomized integral setting described earlier (see [13, §2]). The state of a fractional algorithm is
given by a point in the polytope
                              
                                                             Õ                       
                                                                                      
                                  𝑥 ∈ ℝ+𝑉 : 𝑥 𝕣 = 1, 𝑥 𝑢 =
                              
                                                                                     
                                                                                      
                      K𝑇 :=                                           𝑥𝑣   ∀𝑢 ∈ 𝑉 \ ℒ ,             (1.3)
                                                             𝑣∈𝜒(𝑢)
                              
                                                                                     
                                                                                      
                                                                                     

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                            4
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

where we use 𝜒(𝑢) for the set of children of 𝑢 in 𝑇. For 𝑢 ≠ 𝕣 , we will also write p(𝑢) for the
parent of 𝑢 in 𝑇.
    A state 𝑥 ∈ K𝑇 corresponds to the situation that the state of a randomized integral algorithm
is a leaf descendant of 𝑢 with probability 𝑥 𝑢 . Note that K𝑇 is simply an affine encoding of the
probability simplex on ℒ. In the fractional setting, changing from state 𝑥 to 𝑥 0 incurs movement
cost k𝑥 − 𝑥 0 kℓ1 (𝑤) , where
                                                             Õ
                                             k𝑧 kℓ1 (𝑤) :=         𝑤 𝑢 |𝑧 𝑢 |
                                                             𝑢∈𝑉

denotes the weighted ℓ1 -norm on ℝ𝑉 .

1.3     Mirror descent, metric filtrations, and regularization
Following [13], our algorithm is based on the mirror descent framework as established in [14].
This is a method for regularized online convex optimization, an approach that was previously
explored for competitive analysis in [1, 15].
     A central component of mirror descent is choosing the appropriate mirror map (which we
will often refer to as the “regularizer”). This is a strictly convex function Φ : K𝑇 → ℝ that endows
K𝑇 with a geometric (Riemannian) structure, specifying how to perform constrained vector flow.
In other words, it specifies how one can move in a preferred direction while remaining inside
K𝑇 .
     The paper [13] employs the following regularizer:
                                          1 Õ
                              Φ0 (𝑥) :=       𝑤 𝑢 (𝑥 𝑢 + 𝛿 𝑢 ) log (𝑥 𝑢 + 𝛿 𝑢 ) ,             (1.4)
                                          𝜂
                                             𝑢∈𝑉\{𝕣 }

with 𝜂 = Θ(log |ℒ|) and 𝛿 𝑢 = |ℒ𝑢 |/|ℒ|, where ℒ𝑢 is the set of leaves in the subtree rooted at 𝑢.

1.3.1    Metric filtrations
It is straightforward that one can think of Φ0 as a type of multiscale entropy (this is the negative
of the associated Shannon entropy, since we use the analyst’s convention that the entropy is
convex). To understand this notion, let us forget momentarily the weights on 𝑇. Then the
structure of 𝑇 gives a natural filtration over probability measures on the leaves ℒ. Suppose that
𝑿 is a random variable taking values in ℒ and, for 𝑢 ∈ 𝑉, denote by ℰ𝑢 the event {𝑿 ∈ ℒ𝑢 }.
Then the chain rule for Shannon entropy yields
                        Õ                      1       Õ             Pr[ℰp(𝑢) ]
                               Pr[ℰℓ ] log           =   Pr[ℰ𝑢 ] log            .
                                             Pr[ℰℓ ]                  Pr[ℰ𝑢 ]
                        ℓ ∈ℒ                            𝑢∈𝑉\{𝕣 }

    If we now imagine that uncertainty at higher scales is more costly than uncertainty at lower
scales, then we might define an analogous weighted entropy by
                                          Õ                          Pr[ℰp(𝑢) ]
                                                 𝑤 𝑢 Pr[ℰ𝑢 ] log                  .           (1.5)
                                                                       Pr[ℰ𝑢 ]
                                      𝑢∈𝑉\{𝕣 }


                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                      5
                                  C HRISTIAN C OESTER AND JAMES R. L EE

Such a notion is natural in the context of “metric learning” problems.
    Ignoring the {𝛿 𝑢 } values for a moment, consider that (1.4) is not analogous to (1.5). Indeed,
it corresponds to the quantity
                                          Õ                               1
                                                     𝑤 𝑢 Pr[ℰ𝑢 ] log           ,                           (1.6)
                                                                       Pr[ℰ𝑢 ]
                                        𝑢∈𝑉\{𝕣 }

and now one can see a fundamental reason why the algorithm associated to (1.4) only achieves
an 𝑂(𝔇𝑇 log 𝑛) competitive ratio, where 𝔇𝑇 is the combinatorial depth of 𝑇: The quantity (1.6)
overmeasures the metric uncertainty.
                                                                               = log 𝑛, where
                                                        Í
    Suppose that 𝑿 is a uniformly random leaf. Then ℓ ∈ℒ Pr[ℰℓ ] log Pr[ℰ 1
                                                                             ]              ℓ
𝑛 = |ℒ|. But, in general, one could have 𝑢∈𝑉 Pr[ℰ𝑢 ] log Pr[ℰ    ≥ Ω(𝔇𝑇 log 𝑛). This fact was not
                                                   Í        1
                                                              𝑢]
lost on the authors of [13], but they bypass the problem by combining mirror descent on stars
with a recursive composition method called “unfair gluing.”

1.3.2   Multiscale conditional entropy
We employ a regularizer that is a more faithful analog of (1.5):
                                       Õ 𝑤                                  
                                                                             𝑥𝑢
                                                                                      
                                           𝑢
                                                       𝑥 𝑢 + 𝛿 𝑢 𝑥 p(𝑢) log        + 𝛿𝑢 ,
                                                                       
                           Φ(𝑥) :=                                                                         (1.7)
                                                𝜂𝑢                          𝑥 p(𝑢)
                                     𝑢∈𝑉\{𝕣 }

where p(𝑢) denotes the parent of 𝑢.
   If one ignores the additional parameters {𝜂𝑢 ≥ 1, 𝛿 𝑢 > 0}, this is precisely the negative
weighted Shannon entropy written according to the chain rule. Here, we set
                                                          |ℒ𝑢 |
                                                𝜃𝑢 :=                                                      (1.8)
                                                         |ℒp(𝑢) |
                                                𝜂𝑢 := 1 + log(1/𝜃𝑢 )                                       (1.9)
                                                𝛿 𝑢 := 𝜃𝑢 /𝜂𝑢 .                                           (1.10)

    The numbers {𝜃𝑢 } are the conditional probabilites of the uniform distribution on leaves.
The {𝛿 𝑢 } values are employed as “noise” added to the entropy calculation. Such noise is a
fundamental aspect for competitive analysis, and distinguishes it from the application of mirror
descent to regret minimization problems (see, e. g., [12]).1 The effect of these noise parameters
appears ubiquitously in applications of the primal-dual method to competitive analysis (see
[16]), and manifests itself as an additive term in the update rules (see equation (1.11) below).
Intuitively, it ensures that the conditional probability 𝑥𝑥p(𝑢)
                                                            𝑢
                                                                is updated fast enough even when it
is close to 0.
    Finally, the numbers {𝜂𝑢 : 𝑢 ∈ 𝑉 } are commonly referred to as “learning rates” in the study
of online learning. They represent the rate at which information is discounted in the resulting
   1One finds aspects of this “mixing with the uniform distribution” in the bandits setting as well, but used for
variance reduction, a seemingly very different purpose.


                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                  6
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

algorithm; for MTS, this corresponds to the relative importance of costs arriving now vs. costs
that arrived in the past.

1.3.3   The dynamics
We will derive in Section 3 the following continuous time evolution of the resulting mirror descent
algorithm (𝑥(𝑡) ∈ K𝑇 : 𝑡 ∈ [0, ∞)) for a cost path 𝑐 : [0, ∞) → ℝ+ℒ :
                                                                                                                      !
                             𝑥 𝑢 (𝑡)     𝜂𝑢   𝑥 𝑢 (𝑡)                                      Õ 𝑥 (𝑡)
                                                                     
                                                                                                    ℓ
                   𝜕𝑡                  =                + 𝛿𝑢                𝛽 p(𝑢) (𝑡) −                     𝑐ℓ (𝑡)       (1.11)
                            𝑥 p(𝑢) (𝑡)   𝑤 𝑢 𝑥 p(𝑢) (𝑡)                                            𝑥 𝑢 (𝑡)
                                                                                           ℓ ∈ℒ𝑢

Here, 𝛽 p(𝑢) (𝑡) is a Lagrangian multiplier that ensures conservation of conditional probability:

                                                                     𝑥 𝑣 (𝑡)
                                                  Õ                          
                                                           𝜕𝑡                  = 0.
                                                                    𝑥 p(𝑢) (𝑡)
                                               𝑣∈𝜒(p(𝑢))

One can see that the evolution is being driven by the expected instantaneous cost incurred
conditioned on the current state being in the subtree rooted at 𝑢.
    One should interpret equation (1.11) only when 𝑥(𝑡) lies in the relative interior of K𝑇 .
Otherwise, the conditional probabilities are ill-defined. One way to rectify this is to prevent
𝑥(𝑡) from hitting the relative boundary of K𝑇 at all. It is possible to adaptively modify the cost
functions by a suitably small perturbation so as to guarantee this property and, at the same
time, ensure that the total discrepancy between the modified and true service cost is a small
additive constant.
    Instead, we will follow a different approach, by extending the dynamics to an analogous
system of conditional probabilities {𝑞 𝑢 (𝑡) : 𝑢 ∈ 𝑉 \ {𝕣 }}:
                                            𝜂𝑢
                             𝜕𝑡 𝑞 𝑢 (𝑡) =      (𝑞 𝑢 (𝑡) + 𝛿 𝑢 ) 𝛽 p(𝑢) (𝑡) − 𝑐ˆ𝑢 (𝑡) + 𝛼 𝑢 (𝑡) ,
                                                                                              
                                                                                                                          (1.12)
                                            𝑤𝑢

where 𝑞 𝑢 (𝑡) = 𝑥𝑥𝑢 (𝑡)(𝑡) whenever 𝑥 p(𝑢) (𝑡) > 0, 𝛼 𝑢 (𝑡) is a Lagrangian multiplier for the constraint
                 p(𝑢)
𝑞 𝑢 (𝑡) ≥ 0, and 𝑐ˆ𝑢 (𝑡) is the “derived” cost in the subtree rooted at 𝑢:
                                                                Õ
                                                 𝑐ˆ𝑢 (𝑡) :=             𝑞ℓ |𝑢 (𝑡)𝑐ℓ (𝑡)
                                                                ℓ ∈ℒ𝑢
                                                                    Ö
                                               𝑞ℓ |𝑢 (𝑡) :=                   𝑞 𝑣 (𝑡) ,
                                                                𝑣∈𝛾𝑢,ℓ \{𝑢}

where 𝛾𝑢,ℓ is the unique simple 𝑢-ℓ path in 𝑇.
    Stated this way, the mirror descent algorithm can be envisioned as running a “weighted star”
algorithm on the conditional probabilities at every internal node of 𝑇, with the derived costs at
an internal node 𝑢 given by the average cost of the current strategy for playing one unit of mass
in the subtree rooted at 𝑢.

                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                 7
                               C HRISTIAN C OESTER AND JAMES R. L EE

    In the next section, we will implement and analyze a discretization of (1.12) using Bregman
projections. Since our regularizer Φ and convex body K𝑇 do not satisfy the assumptions
underlying the existence and uniqueness theorem of [14], we need to construct a solution to
(1.12) and, indeed, taking the discretization parameter in our algorithm to zero, one establishes
a solution of bounded variation; see Section 3.3.
    The major benefit of the formulations (1.11) and (1.12) is in motivating such an algorithm and
prescribing the derived costs. In Section 3, we describe how these dynamics can be predicted
from the definition (1.7).


2   The MTS algorithm
We will first establish some generic machinery which, at this point, is not specific to MTS yet.
Consider a convex polytope K0 ⊆ ℝ 𝑛 , define K := K0 ∩ ℝ+𝑛 , and assume that K is compact. Suppose
additionally that Φ : 𝒟 → ℝ is differentiable and strictly convex in an open neighborhood
𝒟 ⊇ K.
   Let us write DΦ for the corresponding Bregman divergence

                            DΦ (𝑦 k 𝑥) := Φ(𝑦) − Φ(𝑥) − h∇Φ(𝑥), 𝑦 − 𝑥i ,

which is non-negative due to convexity of Φ. Then for 𝑥, 𝑦, 𝑧 ∈ K, we have:

            DΦ (𝑧 k 𝑦) − DΦ (𝑧 k 𝑥) = −Φ(𝑦) + Φ(𝑥) − h∇Φ(𝑦), 𝑧 − 𝑦i + h∇Φ(𝑥), 𝑧 − 𝑥i.            (2.1)

    For a vector 𝑐 ∈ ℝ 𝑛 and 𝑥 ∈ K, define the projection

                          ΠK𝑐 (𝑥) := argmin {DΦ (𝑦 k 𝑥) + h𝑐, 𝑦i : 𝑦 ∈ K } .

Since K is compact and Φ is strictly convex, there is a unique minimizer 𝑦 ∗ ∈ K.
   For 𝑥 ∈ K, recall the definition of the normal cone at 𝑥:

                         NK (𝑥) = {𝑝 ∈ ℝ 𝑛 : h𝑝, 𝑦 − 𝑥i ≤ 0 for all 𝑦 ∈ K } .

Given a representation of K by inequality constraints, K = {𝑥 ∈ ℝ 𝑛 : 𝐴𝑥 ≤ 𝑏} for 𝐴 ∈ ℝ 𝑚×𝑛 and
𝑏 ∈ ℝ 𝑛 , it holds

                            NK (𝑥) = {𝐴𝑇 𝑦 : 𝑦 ≥ 0 and 𝑦 𝑇 (𝐴𝑥 − 𝑏) = 0}.

The KKT conditions yield
                                     ∇Φ(𝑦 ∗ ) = ∇Φ(𝑥) − 𝑐 − 𝜆∗ ,                                 (2.2)
where 𝜆∗ ∈ NK (𝑦 ∗ ). Since NK (𝑦 ∗ ) = NK0 (𝑦 ∗ ) + Nℝ+𝑛 (𝑦 ∗ ), we can can decompose 𝜆∗ = 𝛽 − 𝛼 with
𝛽 ∈ NK0 (𝑦 ∗ ) and −𝛼 ∈ Nℝ+𝑛 (𝑦 ∗ ). In particular, we have 𝛼 ≥ 0 and 𝛼 𝑖 > 0 =⇒ 𝑦 𝑖∗ = 0 for every
𝑖 = 1, . . . , 𝑛.

                      T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                         8
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

      Substituting this into equation (2.1) gives

           DΦ (𝑧 k 𝑦 ∗ ) − DΦ (𝑧 k 𝑥) = −Φ(𝑦 ∗ ) + Φ(𝑥) + h∇Φ(𝑥), 𝑦 ∗ − 𝑥i + h𝑐 − 𝛼 + 𝛽, 𝑧 − 𝑦 ∗ i
                                         ≤ −DΦ (𝑦 ∗ k 𝑥) + h𝑐 − 𝛼, 𝑧 − 𝑦 ∗ i,

where the inequality comes from h𝛽, 𝑧 − 𝑦 ∗ i ≤ 0 since 𝑧 ∈ K and 𝛽 ∈ NK (𝑦 ∗ ). We have proved the
following.
Lemma 2.1. For any 𝑥, 𝑧 ∈ K, and 𝑐 ∈ ℝ 𝑛 , let 𝑦 ∗ = ΠK𝑐 (𝑥) and 𝜆∗ be as in (2.2). Then for any
𝛼 ∈ −Nℝ𝑛+ (𝑦 ∗ ) such that 𝜆∗ + 𝛼 ∈ NK0 (𝑦 ∗ ), it holds that

                                     DΦ (𝑧 k 𝑦 ∗ ) − DΦ (𝑧 k 𝑥) ≤ h𝑐 − 𝛼, 𝑧 − 𝑦 ∗ i.

2.1     Iterative Bregman projections
We describe now a discretization of the algorithm from the introduction. This discretization
will mimic the continuous dynamics if the entries of each individual cost vector are small. We
can achieve this by splitting each cost vector into several copies of scaled down versions of itself,
as discussed in Section 2.3. In Section 3.3, we will give a formal argument that this indeed yields
a discretization of the continuous dynamics from the introduction.
    Fix a tree 𝑇 and recall the definition of K𝑇 from (1.3). Let 𝑄𝑇 denote the collection of vectors
      𝑉\{𝕣 }
𝑞 ∈ ℝ+       such that for all 𝑢 ∈ 𝑉 \ ℒ,
                                                       Õ
                                                               𝑞 𝑣 = 1.
                                                      𝑣∈𝜒(𝑢)

                                           𝜒(𝑢)                                     (𝑢)
For 𝑞 ∈ 𝑄𝑇 and 𝑢 ∈ 𝑉 \ ℒ, we use 𝑞 (𝑢) ∈ ℝ+ to denote the vector defined by 𝑞 𝑣 := 𝑞 𝑣 for
                                                            (𝑢)
𝑣 ∈ 𝜒(𝑢), and define the corresponding probability simplex 𝑄𝑇 := {𝑞 (𝑢) : 𝑞 ∈ 𝑄𝑇 }. We will use
Δ : 𝑄𝑇 → K𝑇 for the map which sends 𝑞 ∈ 𝑄𝑇 to the (unique) 𝑥 = Δ(𝑞) ∈ K𝑇 such that

                                        𝑥𝑣 = 𝑥𝑢 𝑞𝑣         ∀𝑢 ∈ 𝑉 \ ℒ, 𝑣 ∈ 𝜒(𝑢).

Note that 𝑞 contains more information than 𝑥; the map Δ fails to be invertible whenever there is
some 𝑢 ∈ 𝑉 \ ℒ with 𝑥 𝑢 = 0.
    Fix 𝜅 ≥ 1. On the open domain 𝒟 (𝑢) = (− min𝑣∈𝜒(𝑢) 𝛿 𝑣 , ∞)𝜒(𝑢) , for 𝛿 𝑣 as given in equa-
tion (1.10), define the strictly convex function Φ(𝑢) : 𝒟 (𝑢) → ℝ by
                                                1 Õ 𝑤𝑣
                                  Φ(𝑢) (𝑝) :=          (𝑝 𝑣 + 𝛿 𝑣 ) log (𝑝 𝑣 + 𝛿 𝑣 ) .
                                                𝜅   𝜂𝑣
                                                  𝑣∈𝜒(𝑢)

                                                                      (𝑢)
Denote the corresponding Bregman divergence on 𝑄𝑇 by

                                          1 Õ 𝑤𝑣                   𝑝𝑣 + 𝛿𝑣
                                                                                   
                         (𝑢)        0                                         0
                     D         (𝑝 k 𝑝 ) =          (𝑝 𝑣 + 𝛿 𝑣 ) log 0      + 𝑝𝑣 − 𝑝𝑣 .
                                          𝜅   𝜂𝑣                   𝑝𝑣 + 𝛿𝑣
                                            𝑣∈𝜒(𝑢)


                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                      9
                                              C HRISTIAN C OESTER AND JAMES R. L EE

    We now define an algorithm that takes a point 𝑞 ∈ 𝑄𝑇 and a cost vector 𝑐 ∈ ℝ+ℒ and outputs
a point 𝑝 = 𝒜(𝑞, 𝑐) ∈ 𝑄𝑇 . Fix h𝑢1 , 𝑢2 , . . . , 𝑢𝑁 i a topological ordering of 𝑉 \ ℒ such that every
child in 𝑇 occurs before its parent. We define 𝑝 inductively as follows. Let 𝑐ˆℓ := 𝑐ℓ for ℓ ∈ ℒ. For
every 𝑗 = 1, 2, . . . , 𝑁:
                                 (𝑢 𝑗 )
                               𝑐ˆ𝑣        := 𝑐ˆ𝑣         ∀𝑣 ∈ 𝜒(𝑢 𝑗 )                                                                  (2.3)
                                                          n                               D         E        (𝑢 𝑗 )
                                                                                                                        o
                               𝑝 (𝑢 𝑗 ) := argmin D(𝑢 𝑗 ) 𝑝 k 𝑞 (𝑢 𝑗 ) + 𝑝, 𝑐ˆ(𝑢 𝑗 )                      𝑝 ∈ 𝑄𝑇                       (2.4)
                                                          (𝑢 𝑗 )
                                                 Õ
                                 𝑐ˆ𝑢 𝑗 :=                𝑝𝑣        𝑐ˆ𝑣                                                                 (2.5)
                                             𝑣∈𝜒(𝑢 𝑗 )


Let 𝛼(𝑢 𝑗 ) be the vector of Lagrange multipliers corresponding to the nonnegativity constraints in
equation (2.4) (recall Lemma 2.1). One should note that in this setting (a probability simplex),
the nonnegativity multipliers are unique and thus well-defined.
    We denote 𝛼 = 𝛼 𝑞,𝑐 ∈ ℝ+𝑉 as the vector given by 𝛼 𝑣 := 𝛼 𝑣
                                                              (p(𝑣))
                                                                     for 𝑣 ≠ 𝕣 and 𝛼 𝕣 := 0. Recall the
complementary slackness conditions:

                                                               𝛼 𝑣 > 0 =⇒ 𝑝 𝑣 = 0.                                                     (2.6)

For 𝑣 ∈ 𝜒(𝑢), calculate
                                                                             1 𝑤𝑣
                                                 ∇Φ(𝑢) (𝑝)                          1 + log(𝑝 𝑣 + 𝛿 𝑣 ) .
                                                                                                       
                                                                       =
                                                                   𝑣           𝜅 𝜂𝑣
Then using equation (2.2), we can write the algorithm as follows:

      For 𝑗 = 1, 2, . . . , 𝑁:
           For 𝑣 ∈ 𝜒(𝑢 𝑗 ):
                                                                   𝜂
                  (𝑢 𝑗 )             (𝑢 )
                                                                                              
                𝑝𝑣         := (𝑞 𝑣 𝑗 + 𝛿 𝑣 ) exp 𝜅 𝑤𝑣𝑣 𝛽 𝑢 𝑗 − ( 𝑐ˆ𝑣 − 𝛼 𝑣 )                         − 𝛿𝑣 ,
                                       (𝑢 𝑗 )
           𝑐ˆ𝑢 𝑗 :=        𝑣∈𝜒(𝑢 𝑗 ) 𝑝 𝑣 𝑐ˆ𝑣 .
                      Í

                                                                                                       (𝑢 𝑗 )
where 𝛽 𝑢 𝑗 ≥ 0 is the multiplier for the constraint                                       𝑣∈𝜒(𝑢 𝑗 ) 𝑞 𝑣
                                                                                       Í
                                                                                                              ≥ 1. There is no multiplier for
                             (𝑢 𝑗 )
                 𝑣∈𝜒(𝑢 𝑗 ) 𝑞 𝑣
                  Í
the constraint                      ≤ 1 because this constraint will be satisfied automatically and is
                                                                                       (𝑢 𝑗 )
therefore not needed in (2.4): If it were violated, decreasing some 𝑝 𝑣 with 𝑝 𝑣 > 𝑞 𝑣 would yield
a strictly better solution to the minimization problem (2.4).

2.2   The global divergence
For 𝑧 ∈ K𝑇 and 𝑞 ∈ 𝑄𝑇 , define the global divergence function
                                                            𝑧𝑣
                                                            𝑧𝑢 + 𝛿 𝑣
                                                                       "                                      !             #
                              1 Õ Õ 𝑤𝑣
                 D̃(𝑧 k 𝑞) :=          (𝑧 𝑣 + 𝛿 𝑣 𝑧 𝑢 ) log          + 𝑧𝑢 𝑞𝑣 − 𝑧𝑣 ,
                              𝜅     𝜂𝑣                      𝑞𝑣 + 𝛿𝑣
                                            𝑢∉ℒ 𝑣∈𝜒(𝑢)


                              T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                       10
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

where we use the convention that 0 log 00 + 𝛿 𝑣 = lim𝜀→0 𝜀 log 0𝜀 + 𝛿 𝑣 = 0.
                                                                                                        

  Note that D̃ is the Bregman divergence associated to the regularizer (1.7) (divided by 𝜅)
when 𝑥𝑥𝑢𝑣 is replaced by 𝑞 𝑣 . One can write
                                                              Õ
                                                D̃(𝑧 k 𝑞) =             𝑧 𝑢 D(𝑢) (𝑝 k 𝑞) ,
                                                              𝑢∉ℒ

where 𝑝 ∈ Δ−1 (𝑧). In other words, 𝑝 ∈ 𝑄𝑇 is any point satisfying 𝑧 𝑣 = 𝑝 𝑣 𝑧 𝑢 for all 𝑢 ∉ ℒ and
𝑣 ∈ 𝜒(𝑢),
   We will use D̃ as a potential function to prove inequality (1.1), and 𝑧 will denote the
configuration of some offline algorithm. Note that the state of the online algorithm is encoded
by 𝑞 ∈ 𝑄𝑇 , which contains more information than its configuration Δ(𝑞) ∈ 𝐾𝑇 .
   The next lemma shows that when an offline algorithm moves, the change in potential is
bounded by 𝑂(1/𝜅) times the offline movement cost.
Lemma 2.2. It holds that for any 𝑞 ∈ 𝑄𝑇 and 𝑧, 𝑧 0 ∈ K𝑇 ,
                                                                                     
                                                      01    4
                               D̃(𝑧 k 𝑞) − D̃(𝑧 k 𝑞) ≤   2+   k𝑧 − 𝑧 0 kℓ1 (𝑤) .
                                                       𝜅    𝜏
                                                     𝑉
Proof. Consider a differentiable map 𝑧 : [0, 1] → ℝ++  such that 𝑣∈𝜒(𝑢) 𝑧 𝑣 (𝑡) ≤ 𝑧 𝑢 (𝑡) for each 𝑡
                                                                                                 Í
and 𝑢 ∉ ℒ. It suffices to show that for each 𝑡 and every fixed 𝑞 ∈ 𝑄𝑇 ,
                                                                                 
                                                           4
                                   𝜅 𝜕𝑡 D̃(𝑧(𝑡) k 𝑞) ≤ 2 +   k𝑧 0(𝑡)kℓ1 (𝑤) .
                                                           𝜏

Moreover, it suffices to address the case when there is at most one 𝑢 ∈ 𝑉 with 𝑧 0𝑢 (𝑡) ≠ 0.
  A direct calculation gives

                       𝑤𝑢 0           𝑧 𝑢 (𝑡)/𝑧 p(𝑢) (𝑡) + 𝛿 𝑢
                                                                             
    𝜅𝜕𝑡 D̃(𝑧(𝑡) k 𝑞) =    𝑧 𝑢 (𝑡) log
                       𝜂𝑢                    𝑞𝑢 + 𝛿𝑢
                                Õ 𝑤                                
                                                                        𝑧 𝑣 (𝑡)/𝑧 𝑢 (𝑡) + 𝛿 𝑣
                                                                                                            
                                                                                                               𝑧 𝑣 (𝑡)
                                                                                                                         
                                         𝑣
                           +                     𝛿 𝑣 𝑧 0𝑢 (𝑡) log                             + 𝑧 0𝑢 (𝑡) 𝑞 𝑣 −                .   (2.7)
                                        𝜂𝑣                                    𝑞𝑣 + 𝛿𝑣                          𝑧 𝑢 (𝑡)
                               𝑣∈𝜒(𝑢)

Let us now use definitions (1.9) and (1.10) to observe that
                                    1      𝑝𝑣 + 𝛿𝑣   1      1 + 𝛿𝑣
                                       log         ≤    log        ≤ 2.
                                    𝜂𝑣     𝑞𝑣 + 𝛿𝑣   𝜂𝑣       𝛿𝑣
Using this in equation (2.7) yields

                                                            Õ                       𝑧 𝑣 (𝑡) ª
                                                                                                                            
                                                 1                                                                  4
       𝜅 𝜕𝑡 D̃(𝑧(𝑡) k 𝑞)   ≤ 𝑤 𝑢 |𝑧 0𝑢 (𝑡)| 2 +                        2𝛿 𝑣 + 𝑞 𝑣 −         ® ≤ 𝑤 𝑢 |𝑧 0𝑢 (𝑡)| 2 +   ,
                                            ©
                                                       𝜏                             𝑧 𝑢 (𝑡)                        𝜏
                                                «          𝑣∈𝜒(𝑢)                                    ¬
                                            𝑣∈𝜒(𝑢) 𝛿 𝑣 ≤          𝑣∈𝜒(𝑢) 𝜃𝑣 ≤ 1 and              𝑣∈𝜒(𝑢) 𝑧 𝑣 (𝑡) ≤ 𝑧 𝑢 (𝑡).
                                        Í                     Í                              Í
where the last inequality uses                                                                                                       

                       T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                        11
                                 C HRISTIAN C OESTER AND JAMES R. L EE

    We will sometimes implicitly restrict vectors 𝑥 ∈ ℝ𝑉 to the subspace spanned by {𝑒ℓ : ℓ ∈ ℒ}.
In this case, we employ the notation
                                                                     Õ
                                                 h𝑥, 𝑦iℒ :=                    𝑥ℓ 𝑦ℓ ,
                                                                     ℓ ∈ℒ


when either vector lies in ℝ𝑉 or ℝ ℒ .
   According to the following lemma, the change in potential due to movement of the online
algorithm is bounded by the difference in service cost between the offline and online algorithm.

Lemma 2.3. For any cost vector 𝑐 ∈ ℝ+ℒ , 𝑧 ∈ K𝑇 , and 𝑞 ∈ 𝑄𝑇 , it holds that if 𝑝 = 𝒜(𝑞, 𝑐), then

                                 D̃(𝑧 k 𝑝) − D̃(𝑧 k 𝑞) ≤ h𝑐, 𝑧 − Δ(𝑝)i ℒ .

Proof. Fix 𝑞 ∈ 𝑄𝑇 and 𝑐 ∈ ℝ+ℒ . Let 𝛼 = 𝛼 𝑞,𝑐 denote the vector of multipliers defined in Section 2.1.
                                             (𝑢)
For 𝑢 ∈ 𝑉 \ ℒ with 𝑧 𝑢 > 0, define 𝑧 (𝑢) ∈ 𝑄𝑇 by

                                                           (𝑢)           𝑧𝑣
                                                         𝑧 𝑣 :=             .
                                                                         𝑧𝑢

Then Lemma 2.1 gives
                                                                           D                                      E
                D(𝑢) 𝑧 (𝑢) k 𝑝 (𝑢) − D(𝑢) 𝑧 (𝑢) k 𝑞 (𝑢) ≤ 𝑐ˆ (𝑢) − 𝛼 (𝑢) , 𝑧 (𝑢) − 𝑝 (𝑢)                                         ,
                                                                                                                          𝜒(𝑢)


where we use h·, ·i𝜒(𝑢) for the standard inner product on ℝ 𝜒(𝑢) . Multiplying by 𝑧 𝑢 and summing
yields
                                      Õ          D                                           E
            D̃(𝑧 k 𝑝) − D̃(𝑧 k 𝑞) ≤         𝑧 𝑢 𝑐ˆ(𝑢) − 𝛼(𝑢) , 𝑧 (𝑢) − 𝑝 (𝑢)
                                                                                                 𝜒(𝑢)
                                      𝑢∉ℒ
                                      Õ Õ                                                Õ            Õ
                                                         (𝑢)             (𝑢)                                    (𝑢)         (𝑢)
                                 =                     ( 𝑐ˆ𝑣 − 𝛼 𝑣 )𝑧 𝑣 −                      𝑧𝑢             ( 𝑐ˆ𝑣 − 𝛼 𝑣 )𝑝 𝑣 .
                                      𝑢∉ℒ 𝑣∈𝜒(𝑢)                                         𝑢∉ℒ         𝑣∈𝜒(𝑢)


Note that from implication (2.6), the latter expression is
                                      Õ              Õ                             Õ
                                                               (𝑢)        (2.5)
                                            𝑧𝑢             𝑐ˆ𝑣 𝑝 𝑣 =                     𝑧 𝑢 𝑐ˆ𝑢 .
                                      𝑢∉ℒ        𝑣∈𝜒(𝑢)                            𝑢∉ℒ


Noting that 𝑐ˆ𝕣 =   ℓ ∈ℒ Δ(𝑝)ℓ 𝑐ℓ , this gives
                    Í

                                            Õ                                      Õ
                D̃(𝑧 k 𝑝) − D̃(𝑧 k 𝑞) ≤              ( 𝑐ˆ𝑢 − 𝛼 𝑢 )𝑧 𝑢 −                  𝑧 𝑢 𝑐ˆ𝑢 ≤ h𝑐, 𝑧 − Δ(𝑝)i ℒ .                 
                                            𝑢≠𝕣                                    𝑢∉ℒ


                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                        12
                     P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

2.3   Algorithm and competitive analysis
Let us now outline the proof of inequality (1.2). First, we perform a standard reduction that
allows us to bound only the “positive” movement costs when the algorithm moves from 𝑥 to 𝑦.
Its proof is straightforward.

Lemma 2.4. For 𝑥, 𝑦 ∈ K𝑇 it holds that

                              k𝑥 − 𝑦kℓ1 (𝑤) = 2 k(𝑥 − 𝑦)+ kℓ1 (𝑤) + [𝜓(𝑦) − 𝜓(𝑥)],

where 𝜓(𝑥) :=       𝑢≠𝕣 𝑤 𝑢 𝑥 𝑢 for 𝑥 ∈ K𝑇 .
                Í

   We now state the key technical lemma which controls the positive movement cost by the
service cost. To this end, we employ an auxiliary potential function Ψ : 𝑄𝑇 → ℝ defined by
                                                                            
                                       Ψ𝑢 (𝑞) := −Δ(𝑞)𝑢 D(𝑢) 𝜃 (𝑢) k 𝑞 (𝑢)
                                                  Õ
                                        Ψ(𝑞) :=         Ψ𝑢 (𝑞).
                                                  𝑢∉ℒ

Intuitively, Ψ(𝑞) is a measure of difference between the online configuration 𝑞 and the uniform
distribution over leaves (whose conditional probabilities are given by 𝜃).
     Let us give a brief explanation of the need for Ψ. Our addition of “noise” to the multiscale
conditional entropy is to achieve the smoothness property established in Lemma 2.2. But this
has the adverse effect of increasing the movement cost of the algorithm, as one can see from the
𝛿 𝑢 term in (1.11). This additional movement cannot be easily charged against the service cost in
the regime where the noise term is dominant: 𝑥𝑥𝑢 (𝑡)(𝑡)  𝛿 𝑢 . On the other hand, this additional
                                                            p(𝑢)

movement has the effect of further decreasing 𝑥𝑥𝑢 (𝑡)(𝑡) , which drives the conditional probabilities
                                               p(𝑢)
at p(𝑢) away from the uniform distribution, decreasing Ψ. A formal statement appears later in
Lemma 2.11.
    For the next two results, take any 𝑞 ∈ 𝑄𝑇 and cost 𝑐 ∈ ℝ+ℒ , and denote 𝑝 = 𝒜(𝑞, 𝑐), 𝑥 =
Δ(𝑞), 𝑦 = Δ(𝑝).

Lemma 2.5 (Movement analysis). It holds that

                     𝜏−3
                         (𝑥 − 𝑦)+ ℓ (𝑤) ≤ (2𝔇𝑇 + log 𝑛)h𝑐, 𝑥iℒ + [Ψ(𝑞) − Ψ(𝑝)] .
                      𝜅𝜏           1


   This lemma will be proved in Section 2.4. Let us first see that it can be used to establish
bounds on the competitive ratio. Define 𝑤min := min{𝑤ℓ : ℓ ∈ ℒ} and

                                                      𝑤min      𝜏−3
                                          𝜀𝑇 :=                     .
                                                  2(2𝔇𝑇 + log 𝑛) 𝜏𝜅



                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                    13
                                       C HRISTIAN C OESTER AND JAMES R. L EE

Theorem 2.6. For any 𝑧 ∈ K𝑇 :

               h𝑐, 𝑦iℒ ≤ h𝑐, 𝑧iℒ + D̃(𝑧 k 𝑞) − D̃(𝑧 k 𝑝)
                                                                      
                                                                                                              (2.8)
                                             2𝜏
       𝜅−1 k𝑥 − 𝑦 kℓ1 (𝑤) ≤ [𝜓(𝑦) − 𝜓(𝑥)] +       [Ψ(𝑞) − Ψ(𝑝)] + (2𝔇𝑇 + log 𝑛)h𝑐, 𝑥iℒ
                                                                                       
                                                                                                              (2.9)
                                            𝜏−3

Moreover, if k𝑐 k ∞ ≤ 𝜀𝑇 , then

                                                      4𝜏
     𝜅−1 k𝑥 − 𝑦 kℓ1 (𝑤) ≤ [𝜓(𝑦) − 𝜓(𝑥)] +                [Ψ(𝑞) − Ψ(𝑝)] + (2𝔇𝑇 + log 𝑛)h𝑐, 𝑦iℒ .
                                                                                             
                                                                                                             (2.10)
                                                     𝜏−3

Proof. Inequality (2.8) follows from Lemma 2.3, and inequality (2.9) follows from Lemma 2.5
and Lemma 2.4. To see that inequality (2.10) follows from inequality (2.9) and Lemma 2.5, use
the fact that
                                                 k𝑐 k ∞
                             h𝑐, 𝑥iℒ ≤ h𝑐, 𝑦iℒ +        (𝑥 − 𝑦)+ ℓ (𝑤) .                    
                                                 𝑤 min            1



  In light of Theorem 2.6, we can respond to a cost function 𝑐 ∈ ℝ+ℒ by splitting it into
𝑀 pieces 𝑐 1 , 𝑐2 , . . . , 𝑐 𝑀 where 𝑀 = dk𝑐 k ∞ /𝜀𝑇 e. Now define 𝑞 𝑖 := 𝒜(𝑞 𝑖−1 , 𝑐/𝑀), 𝑞0 := 𝑞 and
 ¯ 𝑐) := 𝑞 𝑀 .
𝒜(𝑞,

Theorem 2.7. Fix 𝜏 ≥ 4. Consider the algorithm that begins in some configuration 𝑞0 ∈ 𝑄𝑇 . If 𝑐 𝑡 ∈ ℝ+ℒ
                                                            ¯ 𝑡−1 , 𝑐 𝑡 ). Then the sequence hΔ(𝑞0 ), Δ(𝑞1 ), . . .i
is the cost function that arrives at time 𝑡, denote 𝑞 𝑡 := 𝒜(𝑞
is an online algorithm that is (1, 𝑂(1/𝜅), 𝑂(𝜅(𝔇𝑇 + log 𝑛)))-finely competitive.

   We prove this momentarily. The following fact is well-known and, in conjunction with the
preceding theorem, yields the validity of Theorems 1.1 and 1.3.

Lemma 2.8 (See, e.g., [3, Thm. 2.4]). If (ℒ, 𝑑𝑇 ) is an HST metric, then there is another weighted tree
𝑇 0 with leaf set ℒ such that

   1. (ℒ, 𝑑𝑇 0 ) is a 7-HST metric.

   2. 𝔇𝑇 0 ≤ log2 |ℒ|

   3. All the leaves of 𝑇 0 have depth 𝔇𝑇 0 .

   4. 𝑑𝑇 (ℓ , ℓ 0) ≤ 𝑑𝑇 0 (ℓ , ℓ 0) ≤ 𝑂(𝑑𝑇 (ℓ , ℓ 0)) for all ℓ , ℓ 0 ∈ ℒ.

Proof sketch. Replace every weight 𝑤 𝑣 in 𝑇 with 𝑤ˆ 𝑣 := 7 dlog7 𝑤𝑣 e and iteratively contract every
edge (𝑝(𝑢), 𝑢) with 𝑤ˆ 𝑝(𝑢) = 𝑤ˆ 𝑢 and 𝑢 ∉ ℒ. The resulting weighted tree 𝑇1 is a 7-HST by
construction.
    Now iteratively contract every edge (𝑝(𝑢), 𝑢) in 𝑇1 for which |ℒ𝑢𝑇1 | > 12 |ℒ 𝑇𝑝(𝑢)
                                                                                    1
                                                                                        |. The resulting
tree 𝑇 has depth 𝔇𝑇 ≤ log2 |ℒ|. Finally, one can achieve property (3) by increasing the depth of
      0              0

every root-leaf path to 𝔇𝑇 0 using vertex weights that decrease by a factor of 7 along the path. 

                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                  14
                        P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

Proof of Theorem 2.7. Consider a sequence h𝑐 𝑡 : 𝑡 ≥ 1i of cost functions. By splitting the costs
into smaller pieces, we may assume that k𝑐 𝑡 k ∞ ≤ 𝜀𝑇 for all 𝑡 ≥ 1.
    Let {𝑧 𝑡∗ } denote some offline algorithm with 𝑧 0∗ = Δ(𝑞0 ), and let {𝑥 𝑡 = Δ(𝑞 𝑡 )} denote our
online algorithm. Then using D̃(𝑧 0∗ k 𝑥 0 ) = 0 along with inequality (2.8) and Lemma 2.2 yields,
for any time 𝑡1 ≥ 1,
                 𝑡1
                 Õ                             𝑡1
                                               Õ                                                   𝑡1
                                                                                                   Õ
                        h𝑐 𝑡 , 𝑥 𝑡 iℒ ≤              h𝑐 𝑡 , 𝑧 𝑡∗ iℒ − D̃(𝑧 𝑡∗1 k 𝑞 𝑡1 ) + 𝑂(1/𝜅)         k𝑧 𝑡∗ − 𝑧 𝑡−1
                                                                                                                   ∗
                                                                                                                       kℓ1 (𝑤)
                  𝑡=1                          𝑡=1                                                 𝑡=1
                                               𝑡1
                                               Õ                               𝑡1
                                                                               Õ
                                           ≤         h𝑐 𝑡 , 𝑧 𝑡∗ iℒ + 𝑂(1/𝜅)         k𝑧 𝑡∗ − 𝑧 𝑡−1
                                                                                               ∗
                                                                                                   kℓ1 (𝑤) ,
                                               𝑡=1                             𝑡=1

where we have used D̃(𝑧 k 𝑞) ≥ 0 for all 𝑧 ∈ K𝑇 and 𝑞 ∈ 𝑄𝑇 . This verifies inequality (1.1) with
𝛼 0 = 1, 𝛼1 = 𝑂(1/𝜅), and 𝛽 = 0. Moreover, inequality (2.10) gives
      𝑡1                                                                                                                         𝑡1
 1Õ                                               4𝜏                                   Õ
    k𝑥 𝑡 − 𝑥 𝑡−1 kℓ1 (𝑤) ≤ [𝜓(𝑥 𝑡1 ) − 𝜓(𝑥0 )] +     [Ψ(𝑞0 ) − Ψ(𝑞 𝑡1 )] + 2𝔇𝑇 + log 𝑛    h𝑐 𝑡 , 𝑥 𝑡 iℒ ,
 𝜅                                               𝜏−3
      𝑡=1                                                                                                                        𝑡=1

verifying inequality (1.2) with 𝛼1 ≤ 𝑂(𝜅(𝔇𝑇 + log 𝑛)) and 𝛽0 ≤ 𝑂(𝜅 max𝑣≠𝕣 𝑤 𝑣 ) (see Lemma 2.10
below).                                                                                       

2.4        Movement analysis
It remains to prove Lemma 2.5. Recall that 𝑞 ∈ 𝑄𝑇 , 𝑐 ∈ ℝ+ℒ and 𝑝 = 𝒜(𝑞, 𝑐), 𝑥 = Δ(𝑞), 𝑦 = Δ(𝑝).
     The KKT conditions (see equation (2.2)) give: For every 𝑣 ∈ 𝜒(𝑢),

                                                 1 𝑤𝑣     𝑝𝑣 + 𝛿𝑣
                                                                          
                                                      log         = 𝛽 𝑢 − 𝑐ˆ𝑣 + 𝛼 𝑣 ,                                                  (2.11)
                                                 𝜅 𝜂𝑣     𝑞𝑣 + 𝛿𝑣

where 𝛽 𝑢 ≥ 0 is the multiplier corresponding to the constraint                                        𝑣∈𝜒(𝑢) 𝑞 𝑣 ≥ 1.
                                                                                                   Í

Lemma 2.9. It holds that 𝛼 𝑣 ≤ 𝑐ˆ𝑣 for all 𝑣 ∈ 𝑉 \ {𝕣 }.

Proof. Note that 𝑐ˆ𝑣 ≥ 0 by construction. Thus if 𝛼 𝑣 = 0, we are done. Otherwise, by
                                                                     𝑝 +𝛿
complementary slackness, it must be that 𝑝 𝑣 = 0, and therefore log( 𝑞𝑣𝑣 +𝛿𝑣𝑣 ) ≤ 0. Since 𝛽 p(𝑣) ≥ 0,
equation (2.11) implies that 𝛼 𝑣 ≤ 𝑐ˆ𝑣 .                                                            
                                𝑝 𝑣 +𝛿 𝑣
                                          
      Define 𝜎𝑣 := log          𝑞 𝑣 +𝛿 𝑣       so that

                                                        𝑞 𝑣 − 𝑝 𝑣 = (𝑞 𝑣 + 𝛿 𝑣 )(1 − e𝜎𝑣 ).                                            (2.12)

Recall that for 𝑣 ∈ 𝜒(𝑢), we have 𝑥 𝑣 = 𝑞 𝑣 𝑥 𝑢 and 𝑦𝑣 = 𝑝 𝑣 𝑦𝑢 , thus

               𝑥 𝑣 − 𝑦𝑣 = 𝑥 𝑢 (𝑞 𝑣 − 𝑝 𝑣 ) + 𝑝 𝑣 (𝑥 𝑢 − 𝑦𝑢 ) = (𝑥 𝑣 + 𝛿 𝑣 𝑥 𝑢 )(1 − 𝑒 𝜎𝑣 ) + 𝑝 𝑣 (𝑥 𝑢 − 𝑦𝑢 ).

                            T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                         15
                                              C HRISTIAN C OESTER AND JAMES R. L EE

In particular,

                       𝑤 𝑣 (𝑥 𝑣 − 𝑦𝑣 )+ ≤ 𝑤 𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥 𝑢 )(1 − e𝜎𝑣 )+ + 𝑤 𝑣 𝑝 𝑣 (𝑥 𝑢 − 𝑦𝑢 )+
                                                                            𝑤𝑢
                                        ≤ 𝑤 𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥 𝑢 )(1 − e𝜎𝑣 )+ +      𝑝 𝑣 (𝑥 𝑢 − 𝑦𝑢 )+ .
                                                                             𝜏
            𝑣∈𝜒(𝑢) 𝑝 𝑣 = 1 and summing over all vertices yields
        Í
Using
               Õ                                Õ                                                    1Õ
                     𝑤 𝑣 (𝑥 𝑣 − 𝑦𝑣 )+ ≤                𝑤 𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥p(𝑣) )(1 − e𝜎𝑣 )+ +                 𝑤 𝑣 (𝑥 𝑣 − 𝑦𝑣 )+ ,
               𝑣≠𝕣                              𝑣≠𝕣
                                                                                                     𝜏 𝑣≠𝕣

hence
                     Õ                               𝜏 Õ
                           𝑤 𝑣 (𝑥 𝑣 − 𝑦𝑣 )+ ≤                𝑤 𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥p(𝑣) )(1 − e𝜎𝑣 )+
                     𝑣≠𝕣
                                                   𝜏 − 1 𝑣≠𝕣
                                                     𝜏 Õ
                                                 ≤           𝑤 𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥p(𝑣) ) (𝜎𝑣 )−
                                                   𝜏 − 1 𝑣≠𝕣

                                                        𝜅𝜏 ©Õ              Õ    Õ
                                                 ≤           𝜂𝑣 𝑥 𝑣 𝑐ˆ𝑣 +   𝑥𝑢   𝜃𝑣 ( 𝑐ˆ𝑣 − 𝛼 𝑣 )® ,
                                                                                                  ª
                                                                                                                                (2.13)
                                                       𝜏 − 1 𝑣≠𝕣
                                                                «                       𝑢∉ℒ        𝑣∈𝜒(𝑢)             ¬
where the last line uses Lemma 2.9 and equation (2.11), to bound 𝑤 𝑣 (𝜎𝑣 )− ≤ 𝜅𝜂𝑣 ( 𝑐ˆ𝑣 − 𝛼 𝑣 ).
  Note that
                              Õ                      Õ                 Õ
                                    𝜂𝑣 𝑥 𝑣 𝑐ˆ𝑣 ≤          𝑐ℓ 𝑥ℓ                     𝜂𝑣 ≤ (𝔇𝑇 + log 𝑛) h𝑐, 𝑥i ,                  (2.14)
                              𝑣≠𝕣                  ℓ ∈ℒ             𝑣∈𝛾𝕣 ,ℓ \{𝕣 }

since for any ℓ ∈ ℒ, it holds that
                               Õ                                     Õ                |ℒp(𝑣) |
                                            𝜂𝑣 = 𝔇𝑇 (ℓ ) +                      log              = 𝔇𝑇 (ℓ ) + log 𝑛,
                                                                                       |ℒ𝑣 |
                            𝑣∈𝛾𝕣 ,ℓ \{𝕣 }                       𝑣∈𝛾𝕣 ,ℓ \{𝕣 }

where 𝔇𝑇 (ℓ ) is the combinatorial depth of ℓ .
    The second sum in (2.13) can be interpreted as the service cost of hybrid configurations of 𝑞
and 𝜃: While 𝑣∈𝜒(𝑢) 𝑥 𝑣 𝑐ˆ𝑣 is the service cost of 𝑥 in ℒ𝑢 , the term 𝑥 𝑢 𝑣∈𝜒(𝑢) 𝜃𝑣 𝑐ˆ𝑣 is the service
                Í                                                         Í
cost in ℒ𝑢 of the modification of 𝑥 whose conditional probabilities at the children of 𝑢 are given
by 𝜃(𝑢) rather than 𝑞 (𝑢) . To bound this hybrid service cost, we will employ the auxiliary potential
Ψ.

2.4.1   The hybrid cost
We require the following elementary estimate.
Lemma 2.10. For 𝑢 ∉ ℒ it holds that
                                                   n
                                                        (𝑢)                     (𝑢)
                                                                                           o       2 𝑤𝑢
                                             max D            (𝑟 k 𝑝) : 𝑟, 𝑝 ∈ 𝑄𝑇              ≤        .
                                                                                                   𝜅 𝜏

                             T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                 16
                      P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

Proof. Define 𝜙𝑣 : (−𝛿 𝑣 , ∞) → ℝ by

                                                           1
                                         𝜙𝑣 (𝑝) :=            (𝑝 𝑣 + 𝛿 𝑣 ) log(𝑝 𝑣 + 𝛿 𝑣 ),
                                                           𝜂𝑣

and let
                                                                 𝑞𝑣 + 𝛿𝑣
                                                                                                 
                                             1
                          D𝜙𝑣 (𝑞 𝑣 k 𝑝 𝑣 ) =    (𝑞 𝑣 + 𝛿 𝑣 ) log         + (𝑝 𝑣 − 𝑞 𝑣 )
                                             𝜂𝑣                  𝑝𝑣 + 𝛿𝑣

denote the corresponding Bregman divergence. Then for 𝑞 𝑣 , 𝑝 𝑣 ≥ 0, it holds that D𝜙𝑣 (𝑞 𝑣 k 𝑝 𝑣 ) ≥ 0
since 𝜙𝑣 is convex on ℝ+ . Employing the 𝜏-HST property of 𝑇, this implies that

                                         1 Õ                       𝑤𝑢 Õ
                      D(𝑢) (𝑟 k 𝑝) =         𝑤 𝑣 D𝜙𝑣 (𝑟𝑣 k 𝑝 𝑣 ) ≤      D𝜙𝑣 (𝑟𝑣 k 𝑝 𝑣 ).
                                         𝜅                         𝜅𝜏
                                              𝑣∈𝜒(𝑢)                                𝑣∈𝜒(𝑢)


                 (𝑢)   (𝑢)
    Define 𝐹 : 𝑄𝑇 × 𝑄𝑇 → ℝ+ by 𝐹(𝑟, 𝑝) := 𝑣∈𝜒(𝑢) D𝜙𝑣 (𝑟𝑣 k 𝑝 𝑣 ). The map 𝑟 ↦→ 𝐹(𝑟, 𝑝) is convex
                                                                       Í
in general (for any Bregman divergence). The map 𝑝 ↦→ 𝐹(𝑟, 𝑝) is convex as well, as this holds
for each map 𝑝 𝑣 ↦→ D𝜙𝑣 (𝑞 𝑣 k 𝑝 𝑣 ) since − log(𝑥) is convex on ℝ++ . Since the maximum of a convex
function on the a polytope is achieved at an extreme point, we have

                                                                             1 + 𝛿𝑣                      𝛿 𝑣0
       n                           o                                                                             
                             (𝑢)                           1                              1
 max       𝐹(𝑟, 𝑝) : 𝑟, 𝑝 ∈ 𝑄𝑇         ≤ max                  (1 + 𝛿 𝑣 ) log        −1 +      𝛿 𝑣0 log          +1
                                         𝑣,𝑣 ∈𝜒(𝑢)
                                         0                 𝜂𝑣                  𝛿𝑣        𝜂𝑣 0          1 + 𝛿 𝑣0
                                           𝑣≠𝑣 0
                                       ≤ 2.                                                                               

    The next lemma is crucial: It relates the service cost (with respect to the reduced cost 𝑐ˆ − 𝛼)
of the hybrid configurations to the service cost of the actual configuration and the movement
cost.

Lemma 2.11. For any 𝑢 ∉ ℒ, it holds that

                                               2 𝑤𝑢                Õ
                    Ψ𝑢 (𝑝) − Ψ𝑢 (𝑞) ≤               (𝑥 𝑢 − 𝑦𝑢 )+ +   (𝑐ˆ𝑣 − 𝛼 𝑣 ) [𝑥 𝑣 − 𝜃𝑣 𝑥 𝑢 ] .                 (2.15)
                                               𝜅 𝜏
                                                                           𝑣∈𝜒(𝑢)


Proof. Write
                                                                                    
       Ψ𝑢 (𝑝) − Ψ𝑢 (𝑞) = 𝑥 𝑢 D(𝑢) 𝜃 (𝑢) k 𝑞 (𝑢) − 𝑦𝑢 D(𝑢) 𝜃 (𝑢) k 𝑝 (𝑢)
                                                                               h                              i
                           = (𝑥 𝑢 − 𝑦𝑢 )D(𝑢) (𝜃(𝑢) k 𝑝 (𝑢) ) + 𝑥 𝑢 D(𝑢) (𝜃 (𝑢) k 𝑞 (𝑢) ) − D(𝑢) (𝜃(𝑢) k 𝑝 (𝑢) ) .

Using Lemma 2.10, the first term is bounded by 𝜅2 𝑤𝜏𝑢 (𝑥 𝑢 − 𝑦𝑢 )+ .

                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                         17
                                                        C HRISTIAN C OESTER AND JAMES R. L EE

     Let us now bound the second term. Using 1 + 𝑡 ≤ e𝑡 , we have
                                                                                               Õ 𝑤                                𝑝𝑣 + 𝛿𝑣
                                                                                                                                                             
                                                                                                           𝑣
             h                                                                     i
                     (𝑢)        (𝑢)        (𝑢)         (𝑢)        (𝑢)        (𝑢)
     𝜅𝑥 𝑢 D                (𝜃         k𝑞         )−D         (𝜃         k𝑝         ) = 𝑥𝑢                          (𝜃𝑣 + 𝛿 𝑣 ) log         + 𝑞𝑣 − 𝑝𝑣
                                                                                                          𝜂𝑣                       𝑞𝑣 + 𝛿𝑣
                                                                                              𝑣∈𝜒(𝑢)
                                                                                                Õ 𝑤
                                                                                                            𝑣
                                                                                                                   [(𝜃𝑣 + 𝛿 𝑣 )𝜎𝑣 + (𝑞 𝑣 + 𝛿 𝑣 )(1 − e𝜎𝑣 )]
                                                                                   (2.12)
                                                                                       = 𝑥𝑢
                                                                                                          𝜂𝑣
                                                                                              𝑣∈𝜒(𝑢)
                                                                                                Õ 𝑤
                                                                                                               𝑣
                                                                                       ≤ 𝑥𝑢                        𝜎𝑣 (𝜃𝑣 − 𝑞 𝑣 )
                                                                                                          𝜂𝑣
                                                                                               𝑣∈𝜒(𝑢)
                                                                                            Õ 𝑤
                                                                                                      𝑣
                                                                                       =                  𝜎𝑣 [𝜃𝑣 𝑥 𝑢 − 𝑥 𝑣 ] .
                                                                                                    𝜂𝑣
                                                                                           𝑣∈𝜒(𝑢)

To finish the proof, observe that from equation (2.11),
     Õ 𝑤
                 𝑣
                                                             Õ                                                               Õ
                     𝜎𝑣 [𝜃𝑣 𝑥 𝑢 − 𝑥 𝑣 ] = 𝜅                             (𝛽 𝑢 − 𝑐ˆ𝑣 + 𝛼 𝑣 ) [𝜃𝑣 𝑥 𝑢 − 𝑥 𝑣 ] = 𝜅                      (𝛼 𝑣 − 𝑐ˆ𝑣 ) [𝜃𝑣 𝑥 𝑢 − 𝑥 𝑣 ] ,
             𝜂𝑣
    𝑣∈𝜒(𝑢)                                               𝑣∈𝜒(𝑢)                                                             𝑣∈𝜒(𝑢)


                                                           𝑣∈𝜒(𝑢) 𝑥 𝑣 = 𝑥 𝑢 and                     𝑣∈𝜒(𝑢) 𝜃𝑣 = 1 (from (1.8)).
                                                       Í                                      Í
where the last equality uses                                                                                                                                         
     Using the lemma gives
           Õ                Õ                            (2.15)                                      2                        Õ
                     𝑥𝑢                𝜃𝑣 ( 𝑐ˆ𝑣 − 𝛼 𝑣 ) ≤ [Ψ(𝑞) − Ψ(𝑝)] +                              (Δ(𝑞) − Δ(𝑝))+ ℓ (𝑤) +     𝑐ˆ𝑣 𝑥 𝑣
                                                                                                    𝜅𝜏                 1
                                                                                                                              𝑣≠𝕣
           𝑢∉ℒ             𝑣∈𝜒(𝑢)
                                                                                                     2
                                                             ≤ [Ψ(𝑞) − Ψ(𝑝)] +                         (Δ(𝑞) − Δ(𝑝))+ ℓ (𝑤) + 𝔇𝑇 h𝑐, 𝑥iℒ .
                                                                                                    𝜅𝜏                 1


     Combining this inequality with inequality (2.13) and inequality (2.14) gives

                               𝜏
                                                                                                                                                            
      −1                                                                2
   𝜅         (𝑥 − 𝑦)+ ℓ (𝑤) ≤    2𝔇𝑇 + log 𝑛 h𝑐, 𝑥iℒ + (Ψ(𝑞) − Ψ(𝑝)) +    (𝑥 − 𝑦)+ ℓ (𝑤) ,
                                                                                       
                       1      𝜏−1                                      𝜅𝜏           1

                                                                                        (2.16)
completing the verification of Lemma 2.5.


3     Derivation of the dynamics and derived costs
For the sake of motivating the dynamics (1.11), we review the continuous-time mirror descent
framework of [14]. Suppose that K ⊆ ℝ 𝑁 is a convex set. We recall again the definition of the
normal cone to K at 𝑥 ∈ K, which is given by

                                      𝑁K (𝑥) := (K − 𝑥)◦ = 𝑝 ∈ ℝ 𝑁 : h𝑝, 𝑦 − 𝑥i ≤ 0 for all 𝑦 ∈ K .
                                                                           

  Suppose additionally that Φ : 𝒟 → ℝ is 𝒞 2 and strictly convex on an open neighborhood
𝒟 ⊇ K so that the Hessian ∇2 Φ(𝑥) is well-defined and positive definite on 𝒟. Given a control

                                        T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                                        18
                   P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

function 𝐹 : [0, ∞) × K → ℝ 𝑁 and an initial point 𝑥 0 ∈ K, we will be concerned with absolutely
continuous solutions 𝑥 : [0, ∞) → K to the differential inclusion

                                                    𝑥(0) = 𝑥 0 ,
                                      ∇ Φ(𝑥(𝑡))𝑥 0(𝑡) ∈ 𝐹(𝑡, 𝑥(𝑡)) − 𝑁K (𝑥(𝑡)) .
                                          2


In other words, a trajectory that satisfies 𝑥(0) = 𝑥0 and for almost every 𝑡 ≥ 0:

                                      𝑥 0(𝑡) = ∇2 Φ(𝑥(𝑡))−1 (𝐹(𝑡, 𝑥(𝑡)) − 𝛾(𝑡)) ,                                                   (3.1)

with 𝛾(𝑡) ∈ 𝑁K (𝑥(𝑡)).
    Under suitably strong conditions on Φ and 𝐹, there is a unique absolutely continuous
solution to equation (3.1) [14]. In our setup, these conditions are actually not satisfied unless we
prevent the path 𝑥 from hitting the relative boundary of K. Nevertheless, the formal calculation
is elucidating and motivates the algorithm of Section 2. For simplicity, we assume 𝜅 := 1 in this
section.

3.1   Hessian computation
                                                         𝑉
Let us take Φ as in (1.7) and calculate ∇2 Φ(𝑥) for 𝑥 ∈ ℝ++ . Fix 𝑢 ≠ 𝕣 . Then we have

                      𝑤𝑢      𝑥𝑢               Õ 𝑤           𝑥𝑣        𝑥𝑣
                                                                                                              
                                                   𝑣
            𝜕𝑢 Φ(𝑥) =    log        + 𝛿𝑢 + 1 +       𝛿 𝑣 log    + 𝛿𝑣 −    .                                                         (3.2)
                      𝜂𝑢     𝑥 p(𝑢)              𝜂𝑣          𝑥𝑢        𝑥𝑢
                                                                   𝑣∈𝜒(𝑢)

Moreover, 𝜕𝑢𝑣 Φ(𝑥) = 0 unless 𝑢 = 𝑣, 𝑢 ∈ 𝜒(𝑣), or 𝑣 ∈ 𝜒(𝑢), and in this case,

                                                        𝑤𝑢                                 𝑥𝑣               𝑤𝑣
                                                                        Õ                      2
                                     𝜕𝑢𝑢 Φ(𝑥) =                       +
                                                𝜂𝑢 (𝑥 𝑢 + 𝛿 𝑢 𝑥p(𝑢) )                      𝑥𝑢        𝜂𝑣 (𝑥 𝑣 + 𝛿 𝑣 𝑥 𝑢 )
                                                                             𝑣∈𝜒(𝑢)
                                             𝑥𝑢           𝑤𝑢
            𝜕𝑢,p(𝑢) Φ(𝑥) = 𝜕p(𝑢),𝑢 Φ(𝑥) = −                              .
                                            𝑥p(𝑢) 𝜂𝑢 (𝑥 𝑢 + 𝛿 𝑢 𝑥 p(𝑢) )

3.2   Explicit dynamics
We are now in a position to calculate the formal dynamics. Let us define the control by
𝐹(·, 𝑡) := −𝑐(𝑡). We claim that for 𝑢 ≠ 𝕣 ,
                                                                                                           !
                              𝑥 𝑢 (𝑡)     𝜂𝑢   𝑥 𝑢 (𝑡)                                Õ 𝑥 (𝑡)
                                                                 
                                                                                                 ℓ
                    𝜕𝑡                  =               + 𝛿𝑢            𝛽p(𝑢) (𝑡) −                       𝑐ℓ ,                      (3.3)
                             𝑥 p(𝑢) (𝑡)   𝑤 𝑢 𝑥p(𝑢) (𝑡)                                         𝑥 𝑢 (𝑡)
                                                                                      ℓ ∈ℒ𝑢

where 𝛽 𝑢 (𝑡) ≥ 0 denotes the Lagrange multiplier corresponding to the constraint 𝑥 𝑢 =                                        𝑣∈𝜒(𝑢) 𝑥 𝑣 .
                                                                                                                           Í
  To verify equation (3.3), let us define, for 𝑢 ≠ 𝕣 ,

                                              𝑤𝑢        𝑥 p(𝑢) (𝑡)            𝑥 𝑢 (𝑡)
                                                                                           
                                    ℰ(𝑢) :=                               𝜕𝑡            .
                                              𝜂𝑢 𝑥 𝑢 (𝑡) + 𝛿 𝑢 𝑥 p(𝑢) (𝑡)    𝑥 p(𝑢) (𝑡)

                         T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                         19
                                   C HRISTIAN C OESTER AND JAMES R. L EE

Then equation (3.3) is equivalent to the assertion that
                                                                       Õ 𝑥 (𝑡)
                                                                                 ℓ
                                      ℰ(𝑢) = 𝛽 p(𝑢) (𝑡) −                              𝑐ℓ (𝑡).                              (3.4)
                                                                             𝑥 𝑢 (𝑡)
                                                                    ℓ ∈ℒ𝑢


Recalling equation (3.1), the equality ∇2 Φ(𝑥(𝑡))𝑥 0(𝑡) 𝑢 = (𝐹(𝑡, 𝑥(𝑡)) − 𝛾(𝑡))𝑢 is equivalent to
                                                                             


                                             ℰ(ℓ ) = 𝛽 p(ℓ ) (𝑡) − 𝑐ℓ (𝑡),                ℓ ∈ ℒ,                            (3.5)
                           Õ 𝑥 (𝑡)
                              𝑣
                 ℰ(𝑢) −                      ℰ(𝑣) = 𝛽 p(𝑢) (𝑡) − 𝛽 𝑢 (𝑡) ,                𝑢 ∈ 𝑉 \ (ℒ ∪ {𝕣 }).               (3.6)
                                   𝑥 𝑢 (𝑡)
                          𝑣∈𝜒(𝑢)

Clearly equation (3.5) already confirms equation (3.4) for ℓ ∈ ℒ.
   Let us conclude by verifying equation (3.4) for all 𝑢 ∉ 𝕣 by (reverse) induction on the depth.
Employing equation (3.6) along with the validity of equation (3.4) for {ℰ(𝑣) : 𝑣 ∈ 𝜒(𝑢)} yields
                                                                                                                        !
                                                           Õ 𝑥 (𝑡)                           Õ 𝑥 (𝑡)
                                                              𝑣                                       ℓ
                  ℰ(𝑢) = 𝛽 p(𝑢) (𝑡) − 𝛽 𝑢 (𝑡) +                                  𝛽 𝑢 (𝑡) −                     𝑐ℓ (𝑡)
                                                                   𝑥 𝑢 (𝑡)                           𝑥 𝑢 (𝑡)
                                                          𝑣∈𝜒(𝑢)                             ℓ ∈ℒ𝑢
                                             Õ 𝑥 (𝑡)
                                                      ℓ
                        = 𝛽 p(𝑢) (𝑡) −                         𝑐ℓ (𝑡),
                                                     𝑥 𝑢 (𝑡)
                                             ℓ ∈ℒ𝑢

where we used the fact that 𝑥 𝑢 =             𝑣∈𝜒(𝑢) 𝑥 𝑣 for 𝑥 ∈ K𝑇 .
                                         Í


3.3   Relationship between discrete and continuous dynamics
Recall the setup from Section 1.3.3. We consider a system of variables {𝑞 𝑢 (𝑡) : 𝑢 ∈ 𝑉 \ {𝕣 }}
satisfying the differential equations
                                       𝜂𝑢
                       𝜕𝑡 𝑞 𝑢 (𝑡) =       (𝑞 𝑢 (𝑡) + 𝛿 𝑢 ) 𝛽 p(𝑢) (𝑡) − 𝑐ˆ𝑢 (𝑡) + 𝛼 𝑢 (𝑡) ,
                                                                                         
                                                                                                                            (3.7)
                                       𝑤𝑢

where 𝛼 𝑢 (𝑡) is a Lagrangian multiplier for the constraint 𝑞 𝑢 (𝑡) ≥ 0, and 𝑐ˆ𝑢 (𝑡) is the “derived”
cost in the subtree rooted at 𝑢:
                                                                Õ
                                               𝑐ˆ𝑢 (𝑡) :=              𝑞ℓ |𝑢 (𝑡)𝑐ℓ (𝑡)
                                                               ℓ ∈ℒ𝑢
                                                                   Ö
                                              𝑞ℓ |𝑢 (𝑡) :=                   𝑞 𝑣 (𝑡) ,
                                                               𝑣∈𝛾𝑢,ℓ \{𝑢}

where 𝛾𝑢,ℓ is the unique simple 𝑢-ℓ path in 𝑇. Now the values 𝑞ℓ |𝕣 give a probability distribution
on the leaves.
    Let us argue that when the discretization parameter of the algorithm presented in Section 2
goes to zero, one arrives at a solution to equation (3.7). Recall that in Section 2.3, we split

                     T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                    20
                    P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

each cost function 𝑐 ∈ ℝ+ℒ into 𝑀 pieces 𝑀 −1 𝑐 and computed a sequence of configurations
𝑞0 , . . . , 𝑞 𝑀 ∈ 𝑄𝑇 . Define the piecewise-linear function 𝑞(𝑀) : [0, 1] → 𝑄𝑇 by

                          𝑗+𝛿
                                 
               𝑞(𝑀)           := (1 − 𝛿)𝑞 𝑗 + 𝛿𝑞 𝑗+1 ,                                     𝛿 ∈ [0, 1], 𝑗 ∈ {0, . . . , 𝑀 − 1}.
                           𝑀

   Recalling Section 2.1, we have
                                                    n                                D                 E                     o
                          (𝑢)
                      𝑞𝑗        := argmin D(𝑢) 𝑝 k 𝑞 (𝑢)
                                                     𝑗−1
                                                                     (𝑢)
                                                         + 𝑝, 𝑀 −1 𝑐ˆ 𝑗                                      𝑝 ∈ 𝑄𝑇
                                                                                                                       (𝑢)
                                                                                                                                   ,          (3.8)

where                                                                    Õ
                                                               (𝑢)
                                                           𝑐ˆ 𝑗      =             (𝑞 𝑗 )ℓ |𝑢 𝑐ℓ .
                                                                         ℓ ∈ℒ𝑣

Thus for 𝑣 ∈ 𝜒(𝑢) and 𝑗 ≥ 1,

                                                                   𝜂𝑣 
                                h                      i                                                                 
                     (𝑢)                 (𝑢)                                            (𝑢)
                    𝑞𝑗     =           𝑞 𝑗−1 + 𝛿 𝑣             exp      𝛽 𝑢 − (𝑀 −1 ( 𝑐ˆ 𝑗 )𝑣 − 𝛼 𝑣 ) − 𝛿 𝑣 .
                         𝑣                   𝑣                     𝑤𝑣

One can now verify that there is a constant 𝐿 = 𝐿(𝑐, 𝑇) such that
                                                                  𝐿
                                𝑞 𝑗 𝑣 − 𝑞 𝑗−1 𝑣 ≤                   ,         𝑗 ∈ {1, . . . , 𝑀}, 𝑣 ∈ 𝑉 \ {𝕣 }.
                                                    
                                                                  𝑀
In particular, we see that 𝑞(𝑀)
                            0
                                ∈ 𝐿∞ ([0, 1], ℝ𝑉\{𝕣 } ) for every 𝑀 ≥ 1 and, moreover,

                                                                 0
                                                           sup 𝑞 (𝑀)                       < ∞.                                               (3.9)
                                                           𝑀≥1                      𝐿∞

Therefore by Arzelà–Ascoli (see, e.g., [2, Thm. 0.3.1]), there is a subsequence {𝑀 𝑘 } such that
𝑞(𝑀 𝑘 ) converges uniformly to a function 𝑞 : [0, 1] → 𝑄𝑇 .
    Since the unit ball of 𝐿∞ ([0, 1], ℝ𝑉\{𝕣 } ) is weakly compact (by the sequential Banach–Alaoglu
Theorem; see, e.g., [2, Thm. 0.3.3]), we can pass to a further subsequence {𝑀 0𝑘 } along which 𝑞 (𝑀
                                                                                                 0
                                                                                                    0)
                                                                                                                                                 𝑘
                                                                                                                                       ∫𝑏
converges weakly to some ℎ ∈ 𝐿∞ ([0, 1], ℝ𝑉\{𝕣 } ). Moreover, since 𝑞(𝑀) (𝑏)− 𝑞(𝑀) (𝑎) =                                                 𝑞 0 (𝑡) 𝑑𝑡
                                                                                                                                       𝑎 (𝑀)
                                                                                     ∫𝑏
for all 0 ≤ 𝑎 < 𝑏 ≤ 1, it follows that 𝑞(𝑏) − 𝑞(𝑎) = 𝑎 ℎ(𝑡) 𝑑𝑡 as well, and therefore for almost all
𝑡 ∈ [0, 1], we have 𝑞 0(𝑡) = ℎ(𝑡).
                                                                                𝑉\{𝕣 }
    If we similarly linearly interpolate the cost function to 𝑐ˆ(𝑀) : [0, 1] → ℝ+ , then 𝑐ˆ(𝑀 𝑘 ) → 𝑐ˆ
along this sequence as well, and
                                                                             Õ
                                                         𝑐ˆ(𝑢) (𝑡) =                 𝑞ℓ |𝑢 (𝑡)𝑐ℓ .
                                                                             ℓ ∈ℒ𝑣

   Now the KKT conditions for optimality in (3.8) give
                                                                                                                      
                                          (𝑢)                           (𝑢)                    (𝑢)                   (𝑢)
                           ∇Φ(𝑢) 𝑞 𝑗                − ∇Φ(𝑢) 𝑞 𝑗−1 + 𝑀 −1 𝑐ˆ 𝑗                        ∈ −N𝑄 (𝑢) 𝑞 𝑗             ,
                                                                                                             𝑇



                          T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                                                                 21
                               C HRISTIAN C OESTER AND JAMES R. L EE

or equivalently,
                                                            
                                   (𝑢)                   (𝑢)
                        ∇Φ(𝑢) 𝑞 𝑗            − ∇Φ(𝑢) 𝑞 𝑗−1                                      
                                                                        (𝑢)                (𝑢)
                                                                   ∈ −𝑐ˆ 𝑗 − N𝑄 (𝑢) 𝑞 𝑗              .
                                         𝑀 −1                                      𝑇

By standard results in differential inclusion theory (e. g., the Convergence Theorem [2, Thm.
1.4.1]), we conclude that 𝑞 : [0, 1] → 𝑄𝑇 solves the differential inclusion
                                             
                       ∇2 Φ(𝑢) 𝑞 (𝑢) (𝑡) 𝜕𝑡 𝑞 (𝑢) (𝑡) ∈ −𝑐ˆ(𝑢) (𝑡) − N𝑄 (𝑢) (𝑞 (𝑢) (𝑡)).
                                                                               𝑇


Calculating the Hessian ∇2 Φ(𝑢) reveals that 𝑞(𝑡) is a solution to equation (3.7).


References
 [1] Jacob Abernethy, Peter Bartlett, Niv Buchbinder, and Isabelle Stanton: A regularization
     approach to metrical task systems. In Proc. 21st Internat. Conf. Algorithmic Learning Theory
     (ALT’10), pp. 270–284. Springer, 2010. [doi:10.1007/978-3-642-16108-7_23] 5

 [2] Jean-Pierre Aubin and Arrigo Cellina: Differential Inclusions: Set-Valued Maps and Viability
     Theory. Springer, 1984. [doi:10.1007/978-3-642-69512-4] 21, 22

 [3] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph Naor: A polylogarithmic-
     competitive algorithm for the 𝑘-server problem. J. ACM, 62(5):40:1–49, 2015. Preliminary
     version in FOCS’11. [doi:10.1145/2783434, arXiv:1110.1580] 14

 [4] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: Metrical task systems and the 𝑘-server
     problem on HSTs. In Proc. 37th Internat. Colloq. on Automata, Languages, and Programming
     (ICALP’10), pp. 287–298. Springer, 2010. [doi:10.1007/978-3-642-14165-2_25] 3

 [5] Nikhil Bansal, Niv Buchbinder, and Joseph Naor: A primal-dual randomized algorithm
     for weighted paging. J. ACM, 59(4):19:1–24, 2012. Preliminary version in FOCS’07.
     [doi:10.1145/2339123.2339126] 2

 [6] Yair Bartal: Probabilistic approximations of metric spaces and its algorithmic applications.
     In Proc. 37th FOCS, pp. 184–193. IEEE Comp. Soc., 1996. [doi:10.1109/SFCS.1996.548477] 2,
     3

 [7] Yair Bartal, Avrim Blum, Carl Burch, and Andrew Tomkins: A polylog(n)-competitive
     algorithm for metrical task systems. In Proc. 29th STOC, pp. 711–719. ACM Press, 1997.
     [doi:10.1145/258533.258667] 2

 [8] Yair Bartal, Béla Bollobás, and Manor Mendel: Ramsey-type theorems for metric
     spaces with applications to online problems. J. Comput. System Sci., 72(5):890–921, 2006.
     [doi:10.1016/j.jcss.2005.05.008, arXiv:cs/0406028] 2

                     T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                               22
                 P URE E NTROPIC R EGULARIZATION FOR M ETRICAL TASK S YSTEMS

 [9] Yair Bartal, Nathan Linial, Manor Mendel, and Assaf Naor: On metric Ramsey-
     type phenomena. Ann. Math., 162(2):643–709, 2005. Preliminary version in STOC’03.
     [doi:10.4007/annals.2005.162.643, arXiv:math/0406353] 2

[10] Avrim Blum, Howard Karloff, Yuval Rabani, and Michael Saks: A decomposition
     theorem for task systems and bounds for randomized server problems. SIAM J. Comput.,
     30(5):1624–1661, 2000. [doi:10.1137/S0097539799351882] 2

[11] Allan Borodin, Nathan Linial, and Michael E. Saks: An optimal on-line algorithm for
     metrical task system. J. ACM, 39(4):745–763, 1992. [doi:10.1145/146585.146588] 2

[12] Sébastien Bubeck and Nicolò Cesa-Bianchi: Regret analysis of stochastic and non-
     stochastic multi-armed bandit problems. Found. Trends Mach. Learning, 5(1):1–122, 2012.
     [doi:10.1561/2200000024, arXiv:1204.5721] 2, 6

[13] Sébastien Bubeck, Michael B. Cohen, James R. Lee, and Yin Tat Lee: Metrical task systems
     on trees via mirror descent and unfair gluing. SIAM J. Comput., 50(3):909–923, 2021.
     Preliminary version in SODA’19. [doi:10.1137/19M1237879, arXiv:1807.04404] 2, 3, 4, 5, 6

[14] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee, James R. Lee, and Aleksander Madry:
     𝑘-server via multiscale entropic regularization. In Proc. 50th STOC, pp. 3–16. ACM Press,
     2018. [doi:10.1145/3188745.3188798, arXiv:1711.01085] 5, 8, 18, 19

[15] Niv Buchbinder, Shahar Chen, and Joseph (Seffi) Naor: Competitive analysis via regular-
     ization. In Proc. 25th Ann. ACM–SIAM Symp. on Discrete Algorithms (SODA’14), pp. 436–444.
     SIAM, 2014. [doi:10.1137/1.9781611973402.32] 5

[16] Niv Buchbinder and Joseph Naor: The design of competitive online algorithms via a primal-
     dual approach. Found. Trends Theor. Comp. Sci., 3(2–3):93–263, 2009. [doi:10.1561/0400000024]
     6

[17] Christian Coester and James R. Lee: Pure entropic regularization for metrical task systems.
     In Proc. 32nd Ann. Conf. on Learning Theory (COLT’19), pp. 835–848. MLR Press, 2019. PMLR.
     1

[18] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar: A tight bound on approximating
     arbitrary metrics by tree metrics. J. Comput. System Sci., 69(3):485–497, 2004. Preliminary
     version in STOC’03. [doi:10.1016/j.jcss.2004.04.011] 2, 3, 4

[19] Amos Fiat and Manor Mendel: Better algorithms for unfair metrical task systems and
     applications. SIAM J. Comput., 32(6):1403–1422, 2003. Preliminary version in STOC’00.
     [doi:10.1137/S0097539700376159, arXiv:cs/0406034] 2

[20] Steve Seiden: Unfair problems and randomized algorithms for metrical task systems.
     Inform. Comput., 148(2):219–240, 1999. [doi:10.1006/inco.1998.2744] 2



                    T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                      23
                            C HRISTIAN C OESTER AND JAMES R. L EE

AUTHORS

    Christian Coester
    Associate professor
    Department of Computer Science
    University of Oxford
    Oxford, United Kingdom
    christian coester cs ox ac uk
    https://www.cs.ox.ac.uk/people/christian.coester/


    James R. Lee
    Professor
    Paul G. Allen Center for Computer Science & Engineering
    University of Washington
    Seattle, Washington, USA
    jrl cs washington edu
    https://homes.cs.washington.edu/~jrl/


ABOUT THE AUTHORS

    Christian Coester is Associate Professor in the Department of Computer Science at
      the University of Oxford and Tutorial Fellow at St Anne’s College. He received
      his Ph. D. from Oxford in 2020, under the supervision of Elias Koutsoupias. After
      stints as a postdoctoral researcher at CWI in Amsterdam and Tel Aviv University
      and as a lecturer at the University of Sheffield, he returned to Oxford in 2022.
      His research focuses on the design and analysis of algorithms, especially online
      algorithms and learning-augmented algorithms. Outside of his research, he
      enjoys sports, chess and playing the piano.


    James R. Lee is a Professor of Computer Science & Engineering at the University of
       Washington. He received his Ph. D. from the University of California, Berkeley
       in 2005, under the supervision of Christos Papadimitriou, followed by a postdoc
       at the Institute for Advanced Study in Princeton. His research interests are
       varied and eclectic, ranging from spectral graph algorithms to functional analysis,
       and from convex optimization to statistical physics. He challenged a class of
       undergrads to compete against the MTS algorithm in this paper. They fought
       (and coded) valiantly. They PyTorched and TensorFlowed. But in the end, the
       Theory won.




                  T HEORY OF C OMPUTING, Volume 18 (23), 2022, pp. 1–24                      24