Universal Streaming of Subset Norms

Authors Vladimir Braverman, Robert Krauthgamer, Lin F. Yang,
Plaintext
                          T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32
                                       www.theoryofcomputing.org




     Universal Streaming of Subset Norms
      Vladimir Braverman∗                           Robert Krauthgamer†              Lin F. Yang
                 Received March 9, 2020; Revised November 5, 2021; Published August 6, 2022




       Abstract.
           Most known algorithms in the streaming model of computation aim to approxi-
       mate a single function such as an ℓ 𝑝 norm.
           In 2009, Nelson [https://sublinear.info, Open Problem 30] asked if it is
       possible to design universal algorithms, that simultaneously approximate multiple
       functions of the stream. In this paper we answer the question of Nelson for the
       class of subset-ℓ 0 norms in the insertion-only frequency-vector model. Given a family
       of subsets, 𝒮 ⊂ 2[𝑛] , we provide a single streaming algorithm that can (1 ± 𝜀)-
       approximate the subset-ℓ 𝑝 norm for every 𝑆 ∈ 𝒮. Here, the subset-ℓ 𝑝 norm of 𝑣 ∈ ℝ 𝑛
       with respect to the set 𝑆 ⊆ [𝑛] is the ℓ 𝑝 norm of 𝑣 |𝑆 (the vector 𝑣 restricted to 𝑆 by
       zeroing all other coordinates).
           Our main result is a nearly tight characterization of the space complexity of the
       subset-ℓ0 norm for every family 𝒮 ⊂ 2[𝑛] in insertion-only streams, expressed in
       terms of the “heavy-hitter dimension” of 𝒮, a new combinatorial quantity related to
       the VC-dimension of 𝒮. We also show that the more general turnstile and sliding-
       window models require a much larger space usage. All these results easily extend to
       the ℓ 1 norm.
   ∗ This material is based upon work supported in part by the National Science Foundation under Grants No.
1447639, 1650041 and 1652257, Cisco faculty award, and by the ONR Award N00014-18-1-2364.
   † Work partially supported by ONR Award N00014-18-1-2364, the Israel Science Foundation grant #1086/18, and
a Minerva foundation grant from the Federal German Ministry for Education and Research.


ACM Classification: F.2.1
AMS Classification: 68Q25
Key words and phrases: universal streaming, subset norms


© 2022 Vladimir Braverman, Robert Krauthgamer, and Lin F. Yang
c b Licensed under a Creative Commons Attribution License (CC-BY)                  DOI: 10.4086/toc.2022.v018.a020
                 V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

         In addition, we design algorithms for two other subset-ℓ 𝑝 variants. These can
      be compared to the famous Priority Sampling algorithm of Duffield, Lund and
      Thorup [JACM 2007], which achieves additive approximation 𝜀 k𝑣k 1 for all possible
      subsets (𝒮 = 2[𝑛] ) in the entrywise update model. One of our algorithms extends
      their algorithm to handle turnstile updates, and another one achieves multiplicative
      approximation, given a family 𝒮.


1   Introduction
The streaming model of computation, where a space-bounded algorithm makes only a single
pass over an input stream, has gained popularity for its theoretical significance and usefulness
in practice. Efficient streaming algorithms have been designed for many fundamental problems,
including, for example, moments or norms of a frequency vector 𝑣 ∈ ℝ 𝑛 formed by a stream
of additive updates [2, 31, 33]; clustering a stream of points in ℝ 𝑑 [30]; and graph statistics for
streams of edge updates [27].
    Most algorithms designed for this model solve only a single problem. For instance, in the
extensively studied area of streaming ℓ 𝑝 norms of a frequency vector, an algorithm usually
makes a pass over the stream, and then it can use the summary it stores to compute only one
particular norm—the one it was designed for. Designing a new algorithm for each statistic can
be impractical in some applications. For example, in network monitoring, it is often desirable to
maintain a single summary of the observed traffic and use this summary for multiple tasks such
as approximating the entropy, finding elephant flows and heavy hitters, and detecting DDoS
attacks [39, 45].
    The importance of multi-functional summaries has been observed in the theory community
as well. Nelson [42, Open Problem 30] asked in 2009 if it is possible to design universal algorithms
for families of functions. More formally, given a family ℱ of functions of the form 𝑓 : ℝ 𝑛 → ℝ,
the goal is to compute, in one pass over a stream representing 𝑣 ∈ ℝ 𝑛 , a single summary that
can be used to evaluate 𝑓 (𝑣) for every function 𝑓 ∈ ℱ . Several algorithms [18, 19, 13, 11, 15]
provide universal sketches for some families of functions, for example all symmetric norms in
a certain class [11]. However, universal algorithms are an exception rather than the rule, and
Nelson’s question is still open in any reasonable generality.
    A simple systematic method to generate a family ℱ from a single function 𝑓 : ℝ 𝑛 → ℝ is to
apply this 𝑓 to different subsets of coordinates. More precisely, for every subset 𝑆 ⊂ [𝑛] define
the function 𝑓𝑆 : 𝑣 ↦→ 𝑓 (𝑣 |𝑆 ), where 𝑣 |𝑆 denotes zeroing out the coordinates of 𝑣 not in 𝑆. In this
way, every set system 𝒮 ⊂ 2[𝑛] describes a family of functions { 𝑓𝑆 : 𝑆 ∈ 𝒮}. We focus on the
basic case where 𝑓 is an ℓ 𝑝 norm, and call such function families subset-ℓ 𝑝 norms. As usual, we
wish to approximate each function multiplicatively, say within factor 1 ± 𝜀 for a given 𝜀 ∈ (0, 1);
this is clearly stronger than approximating additively by 𝜀 k𝑣 k 𝑝 .
    Subset-ℓ 𝑝 norms arise naturally in applications; for instance, 𝒮 could represent supported
queries to a database. Indeed, the well-known Subset Sum problem [1, 26, 48] and its variant,
the Disaggregated Subset Sum problem [24, 49], are equivalent to our subset-ℓ 1 -norm problem
in the entrywise and insertion-only models, respectively. Network-monitoring tasks, such as

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                          2
                              U NIVERSAL S TREAMING OF S UBSET N ORMS

worm detection, rely on subset-ℓ 𝑝 norms to approximate flow statistics [26]. Recall that a network
flow is a subset of network traffic defined by a source address, a customer, an organization,
an application or an arbitrary predicate. It is folklore that calculating flow volume is simply
a subset-ℓ 1 query, and the number of distinct packets in a flow is a subset-ℓ0 query. Recent
work [29, 41] reiterated that approximating these subset-ℓ 𝑝 queries and more general filters is
still an important open problem in network telemetry. In another recent example, Ting [49]
argued that Disaggregated Subset Sum is widely applicable in ad prediction, where future user
behavior is inferred from historical aggregate queries that have the form of subset-ℓ1 queries. In
both examples, data collection is challenging since future queries can be arbitrary, and thus it
is critical to answer large classes of subset-ℓ 1 queries. We refer the reader to [26, Section 1.4.1]
and [49, Section 2.2] for detailed discussions of these and additional applications in machine
learning [47], database query estimation [50] and denial of service attacks [44].
     Our main contribution to universal streaming is a near-tight characterization, for every 𝒮 ⊂ 2[𝑛] ,
of the space complexity of subset-ℓ0 ’s in insertion-only streams. We stress that this problem
asks to count the distinct items (non-zero coordinates of 𝑣) inside every subset 𝑆 ∈ 𝒮. Our
characterization connects the space complexity for a set system 𝒮 to a combinatorial notion
that we call the heavy-hitter dimension of 𝒮, which counts the maximum possible number of
coordinates in a single 𝑣 ∈ ℝ 𝑛 that may be a “heavy hitter” for some 𝑆 ∈ 𝒮 (see Definition 2.1 for
full details). This notion is related to VC-dimension by VCdim(𝒮) ≤ HHdim(𝒮), however the
gap between the two is not bounded by any fixed factor. We in fact prove the above inequality
and make use of it in Section 2.5. Throughout, 𝑂(·) e suppresses a polylog(𝑛) factor, and 𝑂 𝜀 (·)
suppresses a factor depending only on 𝜀; taken together, 𝑂    e𝜀 ( 𝑓 ) stands for 𝑂(𝑔(𝜀)(log𝑂(1) 𝑛)) · 𝑓
for some function 𝑔.

Theorem 1.1 (Informal Statement of Theorems 2.10 and 2.11). For every 𝒮 ⊂ 2[𝑛] and 𝜀 ∈ (0, 1),
there is a randomized universal algorithm for insertion-only streams, that makes one pass using
𝑂
e𝜀 (HHdim(𝒮)) words of storage, and can then (1 + 𝜀)-approximate each subset ℓ0 -norm from 𝒮 with
high probability. Moreover, every such algorithm requires Ω(HHdim(𝒮)) bits of storage.

     To illustrate the scope of this result, we present in Table 1 a few examples and properties of
the heavy-hitter dimension (their proofs appear in Section 2.1). One interesting example is the
family of all large intervals in [𝑛], namely, of size Ω(𝑛), which has dimension 𝑂(1). A second
one is a family of poly(𝑛)-many uniformly-random sets (every set contains every index with
probability 1/2), which has dimension 𝑂(log 𝑛) with high probability. For both examples, our
algorithm uses small space (polylogarithmic in 𝑛) to achieve multiplicative (1+ 𝜀)-approximation,
which was not known before. (For intervals, additive approximation 𝜀 k𝑣k 1 can be achieved by
several known algorithms, including quantile estimates, range counting, and heavy-hitters over
dyadic intervals.)
     Another interesting example is a two-dimensional family derived from the last property
in the table, as follows. Suppose each index 𝑖 ∈ [𝑛] is actually a pair (𝑖1 , 𝑖2 ) ∈ [𝑛1 ] × [𝑛2 ], for
instance the source and destination address of a packet. Let 𝒮1 be the family of all large intervals
with respect to 𝑖1 , i. e., all [𝑎 . . 𝑏] × [𝑛2 ] with 𝑏 − 𝑎 = Ω(𝑛1 ), and similarly 𝒮2 with respect to
𝑖2 , where throughout [𝑎 . . 𝑏] = {𝑎, 𝑎 + 1, . . . , 𝑏} denotes an interval of integers. Each of 𝒮1 , 𝒮2

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                           3
                     V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

has heavy-hitter dimension 𝑂(1), and by the bound in the table, also their union-product
{𝑆1 ∪ 𝑆2 : 𝑆1 ∈ 𝒮1 , 𝑆2 ∈ 𝒮2 }. Our algorithm can then use small space (polylogarithmic in 𝑛) to
estimate distinct elements in subsets of form {(𝑖1 , 𝑖2 ) : 𝑖1 ∈ [𝑎 . . 𝑏] ∨ 𝑖2 ∈ [𝑐 . . 𝑑]}, capturing for
instance the logical-or between a range of source addresses and a range of destination addresses.

                  Set System                     HHdim                                 Description
                {𝑆1 , . . . , 𝑆 𝑘 }                ≤𝑘                       any sets (tight for disjoint sets)
     {𝑆 ⊂ [𝑛] : |𝑆| ≥ 𝑛 − 𝑘}                       𝑘+1                       sets missing few coordinates
    {[𝑖    . . 𝑖 0] : 𝑖 0 − 𝑖 + 1 ≥ 𝑘}            Θ(𝑛/𝑘)                          intervals of size ≥ 𝑘
   𝑠 1 , . . . , 𝑠 𝑘 : all 𝑠 𝑖,𝑗 ∼ ℬ(𝑝)      whp 𝑂(log(𝑛 𝑘)/𝑝)              𝑘 random subsets of density 𝑝
 

      {𝑆0 ∪ 𝑆00 : 𝑆0 , 𝑆00 ∈ 𝒮}                 HHdim(𝒮)                             self union of 𝒮
                    𝒮1 ∪ 𝒮2               ≤ HHdim(𝒮1 ) + HHdim(𝒮2 )                   subadditivity
 {𝑆1 ∪ 𝑆2 : 𝑆1 ∈ 𝒮1 , 𝑆2 ∈ 𝒮2 }           ≤ HHdim(𝒮1 ) + HHdim(𝒮2 )            union-product of 𝒮1 , 𝒮2

 Table 1: Simple examples and basic properties of heavy-hitter dimension over domain [𝑛].

    Let us consider some natural extensions of the above theorem. First, our algorithm extends
to subset-ℓ 1 as well, as shown in Theorem 2.22. Second, it is stated for insertion-only streams,
however for the more general turnstile and sliding-window models (i. e., streams with both
insertions and deletions or streams where old items expire), we show that the subset-ℓ 𝑝 problem,
for any 𝑝 ≥ 0, requires space Ω(𝑛) even if HHdim(𝒮) = 𝑂(1). This is striking because such a
large separation between insertion-only and turnstile stream or sliding-window algorithms is
rare.
    Indeed, it may be instructive to see why the smooth-histograms technique of [17] fails in this
case.

Theorem 1.2 (Informal Statement of Theorems 2.15 and 2.17). There exists 𝒮 ⊂ 2[𝑛] with
HHdim(𝒮) = 𝑂(1), such that every universal streaming algorithm achieving multiplicative approxima-
tion for subset-ℓ 𝑝 for 𝒮 requires Ω(𝑛) bits of space in both the turnstile and sliding-window models.


Variants of the problem. Duffield, Lund and Thorup [26] consider a similar problem in the
entrywise update model, in which each entry of the vector appears in the stream at most once.
Their “subset-sum” problem is equivalent to our subset-ℓ1 problem if all entries of the vector are
non-negative.1 They devise a Priority Sampling algorithm that approximates the subset-ℓ 1 norm
of every subset 𝑆 ⊂ [𝑛], achieving in fact an optimal space usage for this model [48]. However,
their result actually guarantees an additive approximation, i. e., the error for every subset-ℓ 1 query
is proportional to the ℓ 1 norm of the entire vector.2 We additionally provide two extensions to
their results in new directions.
   1Non-negativity is a mild assumption in this entrywise update model, because one can easily separate the positive
and negative entries and execute in parallel two algorithms.
   2Indeed, Theorem 1 of [48] bounds the variance of the estimator by k𝑣 k 21 /(𝑘 − 1), where 𝑘 is number of samples
                                                                                                           √
being stored. This implies that with high probability, the estimator’s additive error is at most 𝑂(k𝑣 k 1 / 𝑘 − 1).


                           T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                  4
                                 U NIVERSAL S TREAMING OF S UBSET N ORMS

    The first extension is a full characterization of space complexity of multiplicative approximation
of subset-ℓ 𝑝 ’s in the entrywise update model. In contrast to the results of [26, 48], once a
multiplicative approximation is required, the space complexity depends on the query set system
𝒮. Indeed, by modifying the priority sampling algorithm of [26, 48] and employing our lower
bound, we show (in Theorem 2.23) that the space complexity is now precisely Θ(HHdim(𝒮)).
                                                                                      e
    Our second extension achieves the same additive approximation, but in the more general
turnstile model (i. e., additive updates to entries). Similarly to the algorithm of [26, 48], our
algorithm for the subset-ℓ 𝑝 problem achieves additive error 𝜀k𝑣 k 𝑝 with space complexity that
does not depend on the query set system 𝒮. This result is summarized in the following theorem;
we note that a matching lower bound follows immediately from known results (e. g., [7]), as the
case 𝑆 = [𝑛] is the usual approximation of ℓ 𝑝 norm.

Theorem 1.3 (Informal Statement of Theorem 3.1). There exists a one-pass streaming algorithm that,
given a stream of additive updates to a vector 𝑣 ∈ ℝ 𝑛 , uses only 𝑂(1)
                                                                   e words of space for 0 ≤ 𝑝 ≤ 2 (and
𝑂
e𝜀 (𝑛 1−2/𝑝 ) words for 𝑝 > 2), and can then approximate k𝑣 |𝑆 k 𝑝 within additive error 𝜀k𝑣k 𝑝 for each
𝑆 ⊂ [𝑛] with high probability.

Summary.           Our results are summarized in Table 2, where the first row lists the main results.

      Problem         Update Model      Approximation                  Space                Theorems
 subset-ℓ 0 or ℓ 1 insertion-only        multiplicative          Θ(HHdim(𝒮))
                                                                 e                       2.10, 2.11, 2.22
                          turnstile                                 Ω(𝑛)                       2.15
      subset-ℓ 𝑝     sliding-window      multiplicative             Ω(𝑛)                       2.17
                         entrywise                               Θ(HHdim(𝒮))
                                                                 e                             2.23
                                                           Θ(1)
                                                           e             for 0 ≤ 𝑝 ≤ 2
      subset-ℓ 𝑝         turnstile         additive                                             3.1
                                                           e 1−2/𝑝 )
                                                           Θ(𝑛             for 𝑝 > 2

                                     Table 2: Summary of our results


1.1     Related work
There is a large body of literature that deals with approximating functions of a vector, i. e.,
norms and heavy hitters, in the streaming model of computation. This line of work was started
by Alon, Mathias, and Szegedy [2]; they gave a surprising logspace algorithm to approximate
the ℓ 2 norm, and proved that space 𝑛 Ω(1) is required for integer values 𝑝 ≥ 6. [31] gives the first
near-optimal algorithm (in terms of 𝑛) for ℓ 𝑝 for all 0 ≤ 𝑝 ≤ 2; and [33] gives the first near-optimal
algorithm for ℓ 𝑝 for all 𝑝 > 2; [32, 51, 21, 7] give tight lower bounds on this problem with respect
to the approximation parameter 𝜀 and dimension 𝑛. There is a sequence of papers gradually
improving the space complexity with respect to other parameters and studying variants of
the problem. Due to the lack of space we mention only a small subset of the relevant papers:

                          T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                       5
                   V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

[23, 21, 9, 37, 36, 35, 5, 3, 16, 14, 15]; and references therein. Most of these papers design methods
that approximate a single function such as an ℓ 𝑝 norm for a fixed 𝑝.
    Our setting is also related to the “subset sum” problem [1, 26, 48, 49] where one is interested
in approximating a sum of the entries of a vector indexed by a subset. It is not difficult to see
that our problem is the same as the objective in [26] when the input is restricted to non-negative
vectors; indeed the subset-sum problem is equivalent to the subset-ℓ 1 problem. However, the
model in [26] is slightly different. In [26] the algorithm sees each coordinate of the frequency
vector at most once. In this paper we consider the additive updates streaming model that allows
incremental updates to the coordinates of the frequency vector. Thus, our model generalizes the
model in [26]. In addition, the algorithms in [26] solve the subset-sum problem but with an
additive error.

1.2   Preliminaries
We identify a binary vector 𝑠 ∈ {0, 1} 𝑛 as a subset in [𝑛]. For two vectors 𝑢, 𝑣 ∈ ℝ 𝑛 , we denote
𝑢 ◦ 𝑣 ∈ ℝ 𝑛 as the Hadamard product, i. e., each (𝑢 ◦ 𝑣)𝑖 = 𝑢𝑖 𝑣 𝑖 . We denote the support of 𝑣,
supp(𝑣) ⊂ [𝑛], as the set of non-zero coordinates in 𝑣, i. e., supp(𝑣) = {𝑖 ∈ [𝑛] : 𝑣 𝑖 ≠ 0}. For
each 𝑝 > 0, we denote the ℓ 𝑝 norm of a vector 𝑣 ∈ ℝ 𝑛 as k𝑣 k 𝑝 = ( 𝑖∈[𝑛] |𝑣 𝑖 | 𝑝 )1/𝑝 . For 𝑝 = 0,
                                                                               Í
k𝑣k 0 := | supp(𝑣)| is the size of the support of 𝑣. For 𝑝 = ∞, k𝑣 k ∞ := max𝑖∈[𝑛] |𝑣 𝑖 |. Note that for
𝑝 < 1, ℓ 𝑝 is not a “norm” but was called a norm by convention.
      In this paper we are focusing on the updates of a vector 𝑣 ∈ ℝ 𝑛 . In the insertion-only
model, the input is a stream h𝑎 1 , . . . , 𝑎 𝑚 i, where each item 𝑎 𝑗 ∈ [𝑛] represents an increment to
coordinate 𝑎 𝑗 of a vector 𝑣 ∈ ℝ 𝑛 , which is initialized to all zeros. Thus the accumulated vector is
𝑣= 𝑚     𝑗=1 𝑒 𝑎 𝑗 , where {𝑒 𝑖 : 𝑖 ∈ [𝑛]} is the standard basis. Here 𝑚 is usually assumed to be upper
       Í
bounded by poly(𝑛). In the turnstile model, the input is a stream h(𝑎 1 , Δ1 ), . . . , (𝑎 𝑚 , Δ𝑚 )i, where
each item (𝑎 𝑗 , Δ 𝑗 ) ∈ [𝑛] × {−1, 1} represents an increment to coordinate 𝑎 𝑗 of a vector 𝑣 ∈ ℝ 𝑛 by
Δ 𝑗 .3 Thus the accumulated vector is 𝑣 = 𝑚           𝑗=1 Δ 𝑗 · 𝑒 𝑎 𝑗 .
                                                    Í
      We are interested in the following problem.
Problem 1.4 (Subset-ℓ 𝑝 ). Let 𝛼 ≥ 1 be a parameter. Given a set of binary vectors 𝒮 ⊂ {0, 1} 𝑛 ,
design an algorithm that makes a single pass over a stream of updates to a vector 𝑣 ∈ ℝ 𝑛 , and at
the end of the stream the algorithm outputs a function 𝐸 : 𝒮 → ℝ that satisfies

                          ∀𝑠 ∈ 𝒮,       Pr[k𝑣 ◦ 𝑠 k 𝑝 ≤ 𝐸(𝑠) ≤ 𝛼k𝑣 ◦ 𝑠 k 𝑝 ] ≥ 0.9.

We call this problem the 𝛼-approximation subset-ℓ 𝑝 problem w.r.t. 𝒮.
    Note that the above definition requires the algorithm to approximate, for each given 𝑠 ∈ 𝒮,
the quantity k𝑣 ◦ 𝑠 k 𝑝 multiplicatively. A standard parallel repeating argument can lead to
the “for-all” guarantee, i. e., the algorithm succeeds on approximating k𝑣 ◦ 𝑠 k 𝑝 for all 𝑠 ∈ 𝒮.
However, we pay an additional log |𝒮| factor in the space—this can be linear in 𝑛 if |𝒮| is large.
It would be interesting if one can design a for-all algorithm with space not depending on log |𝒮|.
   3A more general model allows Δ𝑖 ∈ {−𝑀, . . . , 𝑀} for some 𝑀 = poly(𝑛). The space and time requirement of the
case 𝑀 = 1 is the same up to an 𝑂(log 𝑛) factor. We use 𝑀 = 1 for sake of presentation.


                        T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                 6
                            U NIVERSAL S TREAMING OF S UBSET N ORMS

    Note that the set system 𝒮 is given to the algorithm via a read-only tape, hence the space of
storing 𝒮 is not counted. A variant of this problem is the additive approximation problem.
Problem 1.5 (Additive Subset-ℓ 𝑝 ). For a set of binary vectors 𝒮 ⊂ {0, 1} 𝑛 , design a one-pass
algorithm over a stream of updates to some underlying vector 𝑣 ∈ ℝ 𝑛 such that after one pass
over the stream, the algorithm outputs a function 𝐸 : 𝒮 → ℝ satisfies,

                          ∀𝑠 ∈ 𝒮,   Pr[k𝑣 ◦ 𝑠 k 𝑝 − 𝐸(𝑠) ≤ 𝜀k𝑣k 𝑝 ] ≥ 0.9.

We call this problem the 𝜀-additive-approximation subset-ℓ 𝑝 problem (w.r.t. 𝒮).

1.3   Technical overview
Multiplicative Subset-ℓ 𝑝 algorithm for 𝑝 ∈ {0, 1}. Estimating ℓ0 of a stream is a well-studied
problem, for instance the first streaming algorithm was given in [28], and the problem’s space
complexity was settled in [2, 51, 37, 10, 34]. Let us first recall a classical sample-and-estimate
technique for this problem (see, e. g., [8]). Here, the algorithm samples each coordinate of 𝑣 with
some probability 𝑞, and then uses the sampled non-zero coordinates to estimate k𝑣 k 0 (simply
count their number and divide by 𝑞). Suppose we could guess the correct rate 𝑞, such that
number of non-zero samples is about Θ(𝜀−2 ); then we would obtain a good estimate to the ℓ 0
of the stream, i. e., a (1 ± 𝜀)-approximation with constant probability. The number of guesses
is at most Θ(log 𝑛), since k𝑣k 0 ≤ 𝑛, and we can try all of them in parallel. Observe that this
algorithm actually stores all distinct samples up to a point—when the samples for a guess 𝑞
exceeds the 𝑂(𝜀−2 ) space bound, the algorithm starts rejecting any extra samples.
    Consider now approximating k𝑣 ◦ 𝑠 k 0 for any 𝑠 in a known set system 𝒮 ⊂ 2[𝑛] . To use
the above sample-and-estimate technique, the guess 𝑞 should be chosen according to k𝑣 ◦ 𝑠 k 0 .
However, an algorithm that is not tailored to 𝑠 will store (distinct) samples from all supp(𝑣),
and thus it might reach its 𝑂(𝜀−2 ) space bound and start rejecting samples, without storing
enough samples from supp(𝑣 ◦ 𝑠) ⊆ supp(𝑣). The challenge is thus to store enough samples
from supp(𝑣 ◦ 𝑠) for every 𝑠 ∈ 𝒮. Our idea is to rely on the structure of the set system 𝒮,
and store every sample that might be necessary for any 𝑠 ∈ 𝒮, which clearly maintains the
correctness (accuracy guarantee). However, this might require large space, perhaps even linear
in |𝒮|, and our solution is to actively delete samples that are not necessary.
    To formalize this idea, we let the algorithm store a set ℋ ⊂ [𝑛] of (distinct) samples from the
stream. Now whenever the number of samples from some 𝑠 ◦ 𝑣 is smaller than our 𝑂(𝜀−2 ) bound,
all these samples are stored, and then ℋ can always be used to estimate k𝑠 ◦ 𝑣 k 0 . However,
when some 𝑖 ∈ ℋ is no longer necessary for any 𝑠 ∈ 𝒮 (which might happen as new samples are
stored), the algorithm deletes this 𝑖 from ℋ . The question is then: what is the maximum possible
size of ℋ ? Luckily, we can show that |ℋ | = 𝑂[𝜀−2 · HHdim(𝒮)] via an inductive argument,
whose base case is precisely the heavy-hitter dimension. Maintaining ℋ in a streaming fashion
is straightforward and requires only 𝑂[𝜀−2 · HHdim(𝒮)] words of space. Recalling there are
𝑂(log 𝑛) guesses for 𝑞, the algorithm actually stores 𝑂(log 𝑛) sets of samples, which altogether
can simulate the sample-and-estimate algorithm for any 𝑠 ∈ 𝒮 given at the query phase, to
achieve a multiplicative approximation of 𝑣 ◦ 𝑠. This result is presented in Theorem 2.10.

                     T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                       7
                   V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

    In insertion-only streams, the ℓ 1 norm is just the sum of all (relevant) stream updates. We
can thus reduce the ℓ1 estimation problem to ℓ 0 estimation of a new vector of dimension 𝑛𝑚,
where 𝑚 is the length of the stream. We show that the converted set system has exactly the same
heavy-hitter dimension, yielding again an algorithm with space usage 𝑂[𝜀−2 · HHdim(𝒮)]. This
result is presented in Theorem 2.22.
    The upper bound for subset-ℓ 𝑝 norm in the entrywise update model follows similar ideas to
store a small subset that is important to the set system. The only difference is that we use the
priority sampling technique [26, 48] as the bottom-level algorithm. This result is presented in
Theorem 2.23.

Lower bound for subset-ℓ 𝑝 . Our lower bound is via reduction from the well-known INDEX
problem.4 Suppose we have a set system with heavy-hitter dimension HHdim(𝒮) (formally
defined in Definition 2.1), we can then find a vector 𝑣 with HHdim(𝒮) non-zero coordinates
and for each coordinate 𝑣 𝑖 , there exists an 𝑠 𝑖 ∈ 𝑆 such that {𝑖} = supp(𝑠 𝑖 ◦ 𝑣). Therefore, we can
encode an INDEX instance into the non-zero coordinates of 𝑣 and by approximating k𝑣 ◦ 𝑠 𝑖 k 𝑝
multiplicatively for any 𝑝, we can have a protocol for the INDEX problem. This implies an
Ω(HHdim(𝒮)) lower bound. This result is presented in Theorem 2.11.

Strong lower bound in the turnstile model and sliding-window model. It is striking that in the
turnstile model or sliding-window model, there does not exist sublinear one-pass multiplicative
approximation subset-ℓ 𝑝 algorithms even for a very simple set system. We show that for a simple
set system, e. g., a set system containing all the intervals with size 𝑛/2, which has heavy-hitter
dimension 𝑂(1), any multiplicative approximation of subset-ℓ 𝑝 for any 𝑝 ≥ 0 requires Ω(𝑛)
space. We show this via a reduction from a communication problem called Augmented INDEX.
In this problem, Alice has a binary vector 𝑥 ∈ {0, 1} 𝑛 , and Bob has an index 𝑗 ∈ [𝑛] together with
𝑥 𝑗+1 , 𝑥 𝑗+2 , . . . , 𝑥 𝑛 . Alice sends one message to Bob (only one round), who needs to determine
𝑥 𝑗 . It has been shown in [6] that, any constant-probability success protocol for this problem
requires Ω(𝑛) bits of communication. We construct a protocol using the subset-ℓ 𝑝 algorithm.
Alice simply maps each of its coordinates of 𝑥 to some stream updates. Bob removes all 𝑥 𝑗0
for 𝑗 0 > 𝑗. Bob then picks the interval that contains at most one non-zero coordinate, 𝑥 𝑗 , and
asks the algorithm to compute the ℓ 𝑝 norm. Hence any multiplicative approximation can be
used to decide whether 𝑥 𝑗 is 0. Thus, any algorithm in the turnstile model requires Ω(𝑛) space
for this simple set system. Similar lower bounds can be shown for the sliding-window model
with a lower bound of Ω(min(𝑊 , 𝑛)), where 𝑊 is the window size. These results are formally
presented in Theorem 2.15 and Theorem 2.17.

Additive approximation. Our additive approximation to the subset-ℓ 𝑝 norm follows a similar
flavor of the priority sampling algorithm [26, 48]. We use the algorithmic ideas that appeared
in [20] (similar ideas also appear earlier in [5] and [3], but of different form). To approximate
      4INDEX is a two-party communication problem, in which Alice’s input is a binary vector 𝑥 ∈ {0, 1} 𝑛 and Bob’s
input is an index 𝑗 ∈ [𝑛]. Alice sends one message to Bob (only one round), who needs to determine the coordinate
𝑥 𝑗 . This problem requires Ω(𝑛) bits of communication, see, e. g., [6] for details.


                         T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                   8
                               U NIVERSAL S TREAMING OF S UBSET N ORMS

the ℓ 𝑝 norm of a vector, we first generate 𝑛 pseudorandom numbers to scale each entry of the
input vector 𝑣, which can be implemented using Θ(log 𝑛) space in the streaming setting. If the
distribution of the random numbers has a nice tail, e.g, Pr[𝑋 > 𝑥] = 1/𝑥 𝑝 , the ℓ 2 -heavy hitter of
the scaled vector can be shown to be a good estimation of the ℓ 𝑝 norm. The scaling is “oblivious”
to the subset, i. e., for each 𝑠 ∈ 2[𝑛] , the ℓ2 -heavy hitter of 𝑠 ◦ 𝑣 0 is a good estimator to k𝑠 ◦ 𝑣 k 𝑝 ,
where 𝑣 0 is the scaled version 𝑣 0. This result is presented in Theorem 3.1.


2     The streaming complexity of subset-ℓ 𝑝 norms
In this section we study algorithms for the subset-ℓ 𝑝 problem (namely, achieve multiplicative
approximation) for 𝑝 = 0 and 𝑝 = 1. Our main finding is that the space complexity in
insertion-only streams is characterized by the following combinatorial quantity.
Definition 2.1 (Heavy-Hitter Dimension). For a set system 𝒮 ⊂ 2[𝑛] and a vector 𝑣 ∈ ℝ 𝑛 , denote
by 𝐻(𝒮, 𝑣) the set of heavy hitters induced by 𝒮, i. e.,
                        𝐻(𝒮, 𝑣) := 𝑖 ∈ [𝑛] : ∃𝑠 ∈ 𝒮 s.t. supp(𝑠 ◦ 𝑣) = {𝑖} ,
                                     

and define the heavy-hitter dimension of 𝒮 ⊂ {0, 1} 𝑛 as
                                     HHdim(𝒮) := sup |𝐻(𝒮, 𝑣)|.
                                                      𝑣∈ℝ 𝑛

    Let us provide some intuition for this definition. If supp(𝑠 ◦ 𝑣) = {𝑖} for some vector 𝑣 and
set 𝑠 ∈ 𝒮, then we call the index 𝑖 a heavy-hitter, to reflect that it dominates the other (zero)
coordinates in 𝑠 ◦ 𝑣. The heavy-hitter dimension then measures the number of heavy-hitters that
a set system 𝒮 could possibly induce (for an unknown 𝑣). It is known from previous sketching
algorithms, cf. [32], that estimating a vector norm can essentially be reduced to estimating the
heavy-hitters, which may provide an intuition why HHdim characterizes the space complexity
of sketching subset norms. The above definition can be easily compared to that of VC-dimension
(for the same set system), see Section 2.5 for details.
    Our main result is a streaming algorithm with space complexity that is linear in the heavy-
hitter dimension, i. e., 𝑂
                         e𝜀 (HHdim(𝒮)), see Theorem 2.10 in Section 2.2. We then provide several
complementary results. From the direction of space lower bounds, we prove a linear lower
bound Ω(HHdim(𝒮)), which matches our algorithm above (in Section 2.3), and also a much
bigger bound for turnstile and sliding-window streams, which separates these richer models
from insertion-only streams (in Section 2.4). From the direction of applications of our algorithmic
techniques, we extend our algorithm to the “for-all” guarantee (in Section 2.5), and to the case
𝑝 = 1 (in Section 2.6), and we also design a variant for the more restricted model of entrywise
updates (in Section 2.7).

2.1   Examples and properties of heavy-hitter dimension
We present a few simple examples and basic properties of the heavy-hitter dimension that may
be useful in applications, essentially proving the bounds shown in Table 1.

                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                               9
                  V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

    We begin with an alternative description of HHdim(𝒮) where we view the set system
𝒮 ⊂ 2[𝑛] as an incidence matrix, i. e., a 0 − 1 matrix describing the incidence between sets 𝑆 ∈ 𝒮
and coordinates 𝑖 ∈ [𝑛]. Recall that a matrix 𝑀 ∈ {0, 1} 𝑘×𝑘 is called a permutation matrix if every
row and every column contain a single non-zero, i. e., exactly one 1. Clearly, up to reordering
the rows and/or columns, such a matrix can be viewed as an identity matrix.
Lemma 2.2 (Permutation Submatrix). Let 𝒮 ⊂ 2[𝑛] , and let 𝑀 ∈ {0, 1} |𝒮|×𝑛 be its incidence matrix.
Then HHdim(𝒮) is exactly the maximum size (number of rows/columns) in a submatrix of 𝑀 that is a
permutation matrix.
Proof. Denote 𝒮 = {𝑠 1 , 𝑠2 , . . .}, and let 𝑘 be the largest size of permutation submatrix of 𝑀.
Suppose that this submatrix is formed of rows 𝑖1 , . . . , 𝑖 𝑘 and columns 𝑗1 , . . . , 𝑗 𝑘 . Consider a
vector 𝑣 ∈ {0, 1} 𝑛 in which coordinates 𝑗1 , . . . , 𝑗 𝑘 have value 1 and all other coordinates have
value 0. Then it is straightforward to see that

                                     ∀𝑙 ∈ [𝑘],   supp(𝑠 𝑖 𝑙 ◦ 𝑣) = {𝑗𝑙 }.

Thus 𝑘 ≤ HHdim(𝒮).
       For the other direction, let 𝑢 ∈ ℝ 𝑛 be a vector that realizes HHdim(𝒮). Then there are sets
𝑠 𝑖1 , . . . , 𝑠 𝑖HHdim(𝒮) ∈ 𝒮 and coordinates 𝑗1 , . . . , 𝑗HHdim(𝒮) ∈ [𝑛] such that

                                 ∀𝑙 ∈ [HHdim(𝒮)],      supp(𝑠 𝑖 𝑙 ◦ 𝑢) = {𝑗𝑙 }.

It is easily verified that rows 𝑖1 , . . . , 𝑖 HHdim(𝒮) and columns 𝑗1 , . . . , 𝑗HHdim(𝒮) form a permutation
submatrix of 𝑀. Thus, HHdim(𝒮) ≤ 𝑘, which completes the proof.                                              
    Using the above lemma one can analyze the heavy-hitter dimension of several explicit set
systems, as listed in the first four lines in Table 1. This lemma also implies bounds for several
operations on set systems, as listed in the last three lines in Table 1. The proofs are rather
straightforward and we give examples in Proposition 2.3 and Proposition 2.4 below.
Proposition 2.3 (Random Sets). Let 𝒮 ⊂ 2[𝑛] be a set system of size |𝒮| = 𝑘, whose incidence matrix
𝑀 is formed of entries that are independent Bernoulli random variables with parameter 𝑝 ∈ (0, 1/2]. In
other words, every set 𝑆 𝑖 ∈ 𝒮 contains every coordinate 𝑗 ∈ [𝑛] independently with probability 𝑝. Then

                           Pr HHdim(𝒮) ≤ 𝑂(log(𝑛 𝑘)/𝑝) ≥ 1 − 1/poly(𝑛 𝑘).
                                                             

Proof. Fix 𝑡 rows and 𝑡 = 𝑐 log(𝑛 𝑘)/𝑝 columns of 𝑀 for some absolute constant 𝑐 > 0 and
consider the corresponding submatrix 𝑀 0. The probability that 𝑀 0 is an identity matrix is
𝑝 𝑡 (1 − 𝑝)𝑡 −𝑡 ≤ 𝑒 −𝑝𝑡 /2 . For the event HHdim(𝒮) ≥ 𝑡 to occur there must be ordered sequences
              2        2


of 𝑡 rows and 𝑡 columns that yield an identity 𝑀 0. Since the number of such choices is at most
𝑛 𝑡 · 𝑘 𝑡 , we obtain

                   Pr HHdim(𝒮) ≥ 𝑡 ≤ (𝑛 𝑘)𝑡 · 𝑒 −𝑝𝑡 /2 ≤ 𝑒 −𝑝𝑡 /4 ≤ 1/poly(𝑛 𝑘),
                                                        2           2




where we use the fact that (𝑛 𝑘)𝑡 = 𝑒 𝑝𝑡 /𝑐 and choose 𝑐 appropriately.
                                             2
                                                                                                           

                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                              10
                             U NIVERSAL S TREAMING OF S UBSET N ORMS

Proposition 2.4 (Subadditivity). For every 𝒮1 , 𝒮2 ⊂ 2[𝑛] ,

                         HHdim(𝒮1 ∪ 𝒮2 ) ≤ HHdim(𝒮1 ) + HHdim(𝒮2 ).

Proof. Let 𝑀1 and 𝑀2 be the incidence matrices of 𝒮1 and 𝒮2 , respectively.
                                                                    1  Then the incidence
matrix of 𝒮1 ∪ 𝒮2 is, using block-matrix notation, simply 𝑀 = 𝑀     𝑀2 . Every permutation
submatrix of 𝑀 can be partitioned into 𝑀1 and 𝑀2 . As each part must contain a permutation
submatrix that uses all its rows, the proposition follows.                                

2.2   Streaming algorithm for the subset-ℓ 0 norm
We now design a one-pass streaming algorithm for the subset-ℓ 0 problem in an insertion-only
stream. Recall that in this model the input is a stream h𝑎 1 , . . . , 𝑎 𝑚 i, where each item 𝑎 𝑗 ∈ [𝑛]
represents an increment to coordinate 𝑎 𝑗 of a vector 𝑣 ∈ ℝ 𝑛 . The streaming algorithm has two
phases, an update phase that scans the stream, and a query phase that evaluates a query 𝑠 ∈ 𝒮.
(In the case below, the query phase formally does not answer a specific query 𝑠 ∈ 𝒮, but rather
reports a list that implicitly represents all queries 𝑠 ∈ 𝒮.) We assume that the set system 𝒮 is
given to the algorithm via a read-only tape, and thus requires no storage. See Section 1.2 for
detailed definitions.
    The algorithm uses a well-known technique of subsampling the coordinates of 𝑣 (i. e., the
set [𝑛]) at a predetermined rate 𝑝 ∈ (0, 1], and producing an estimate only if the resulting vector
has Ω(𝜀−2 ) non-zeros. Usually, counting the number of sampled non-zeros requires little space,
but this is more challenging in our case of all norms 𝑠 ∈ 𝒮.
    The key to bounding the total space usage is the following proposition, which bounds the
global number of samples stored, when each of these samples is “needed” locally by some
subset 𝑠 ∈ 𝒮. This condition has a parameter 𝑘, and the reader may initially think of 𝑘 = 1.
Proposition 2.5. Let 𝒮 ⊂ {0, 1} 𝑛 be a set system. Suppose 𝑍 ⊂ [𝑛] and 𝑘 ≥ 1 are such that for every
𝑖∈𝑍

                                   ∃𝑠 ∈ 𝒮,   𝑖 ∈ 𝑠 and |𝑍 ∩ 𝑠 | ≤ 𝑘                              (2.1)

(in words, index 𝑖 is “𝑘-needed” by some 𝑠 ∈ 𝒮). Then |𝑍| ≤ 𝑘 · HHdim(𝒮).
Proof. We proceed by induction on 𝑘. For the base case 𝑘 = 1, consider a vector 𝑣 ∈ {0, 1} 𝑛
whose support is exactly the given 𝑍. Then for every 𝑖 ∈ 𝑍, there is 𝑠 ∈ 𝒮 such that 𝑍 ∩ 𝑠 = {𝑖},
and thus 𝐻(𝑠, 𝑣) := supp(𝑠 ◦ 𝑣) = {𝑖}. It follows that |𝑍| ≤ |𝐻(𝒮, 𝑣)| ≤ HHdim(𝒮).
     For the inductive step, consider 𝑘 ≥ 2. Given 𝑍, construct 𝐴 ⊂ 𝑍 as follows. Start with
𝐴 = 𝑍, and then repeatedly, if 𝐴 contains an index 𝑖 for which there is no 𝑠 ∈ 𝒮 with 𝐴 ∩ 𝑠 = {𝑖}
(i. e., 𝑖 is not 1-needed), then remove 𝑖 from 𝐴. The repetitions stop when 𝐴 contains no such
index 𝑖. We claim that the final set 𝐴 satisfies

                               ∀𝑠 ∈ 𝒮,       𝑍 ∩ 𝑠 ≠ ∅ ⇔ 𝐴 ∩ 𝑠 ≠ ∅.

For the forward direction, observe that initially |𝐴 ∩ 𝑠| = |𝑍 ∩ 𝑠 | ≥ 1, and that no iteration ever
decreases any |𝐴 ∩ 𝑠 | from 1 to 0. The reverse direction is obvious because 𝐴 ⊂ 𝑍.

                     T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                          11
                  V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

    We can verify that 𝑍 \ 𝐴 satisfies the induction hypothesis, i. e., that every 𝑖 ∈ 𝑍 \ 𝐴 is
(𝑘 − 1)-needed. Indeed, since 𝑖 ∈ 𝑍, it is 𝑘-needed by some 𝑠 ∈ 𝒮, as expressed by (2.1), and by
the claim, this same 𝑠 also satisfies |𝐴 ∩ 𝑠| ≥ 1. Hence, |(𝑍 \ 𝐴) ∩ 𝑠| ≤ 𝑘 − 1, which shows 𝑖 is
(𝑘 − 1)-needed. Applying the induction hypothesis, we have |𝑍 \ 𝐴| ≤ (𝑘 − 1) · HHdim(𝒮).
    In addition, 𝐴 satisfies the induction’s base case 𝑘 = 1, because the iterations stop when
every 𝑖 ∈ 𝐴 is 1-needed. Hence |𝐴| ≤ HHdim(𝒮), and we conclude that |𝑍| = |𝑍 \ 𝐴| + |𝐴| ≤
𝑘 · HHdim(𝒮).                                                                                  

   Our algorithm is based on simulating the simple estimator defined in the following lemma.
(The difficulty will be to apply it to 𝑠 ◦ 𝑣 for 𝑠 ∈ 𝒮 that is not known in advance.)

Lemma 2.6. Fix 𝑣 ∈ ℝ 𝑛 , and sample its coordinates to form 𝑣 0 ∈ ℝ 𝑛 as follows. Suppose each coordinate
is 𝑣 0𝑖 = 𝑣 𝑖 𝑋𝑖 , where 𝑋1 , . . . , 𝑋𝑛 are pairwise-independent identically distributed Bernoulli random
variables with parameter 𝑝 ∈ (0, 1]. Then

                      𝔼 𝑝1 k𝑣 0 k 0 = k𝑣k 0 ,              0
                                                                           𝑝 k𝑣 k 0 ,
                                                                       1−𝑝
                                                       𝑝 k𝑣 k 0                         and
                                                       1
                                                 Var                  =
                         h                                            i
                                0
                                                ≥ 3( 𝑝 k𝑣 k 0 )1/2 ≤ 91 .
                                                     1−𝑝
                            𝑝 k𝑣 k 0 − k𝑣k 0
                            1
                      Pr

Proof. The expectation and variance follow from direct calculation (note that it suffices to assume
pairwise independence). The tail bound is straightforward from Chebyshev’s inequality.            

    Our basic algorithm for storing a subsample of the coordinates is described in Algorithm 1.
The idea is to sample the universe [𝑛] at rate 𝑝 ∈ (0, 1] using pairwise independent random
variables 𝜉1 , . . . , 𝜉𝑛 ∈ {0, 1} (this can be viewed as a hash function 𝜉 : [𝑛] → {0, 1}). We store
the sampled items (coordinates) from the stream so long as they are “needed” by some 𝑠 ∈ 𝒮,
or more precisely, 𝑈-needed in the sense of Proposition 2.5. Here, the budget parameter 𝑈
represents a “local” bound that holds separately for each 𝑠 ∈ 𝒮. We show that at the end, for
every 𝑠 ∈ 𝒮, if 𝑠 contains at most 𝑈 sampled coordinates, then all these samples are completely
stored; otherwise, the number of samples stored is at least 𝑈. We will later use this algorithm to
simulate a pairwise-independent sampling from a desired 𝑠 ∈ 𝒮.

Lemma 2.7. Consider Algorithm 1 for 𝒮 ⊂ {0, 1} 𝑛 with parameters 𝑝 ∈ (0, 1] and 𝑈 ∈ [1, 𝑛]. When
run on an insertion-only stream accumulating to 𝑣 ∈ ℝ 𝑛 , it makes one pass, uses 𝑂(𝑈) · HHdim(𝒮)
words of space, and outputs ℋ ⊂ [𝑛] of size |ℋ | ≤ 𝑈 · HHdim(𝒮). Moreover, suppose that 𝜉 ∈ {0, 1} 𝑛
is the sampling vector from the algorithm. Then for every 𝑠 ∈ 𝒮, if k𝜉 ◦ 𝑠 ◦ 𝑣k 0 ≤ 𝑈 then

                                           supp(𝜉 ◦ 𝑠 ◦ 𝑣) ⊂ ℋ ;

and otherwise |ℋ ∩ 𝑠 | ≥ 𝑈.

Proof. Observe that after each update operation, every 𝑖 ∈ ℋ is 𝑈-needed, i. e.,

                                   ∃𝑠 ∈ 𝒮,       𝑖 ∈ 𝑠 and |𝑠 ∩ ℋ | ≤ 𝑈.

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                            12
                               U NIVERSAL S TREAMING OF S UBSET N ORMS

Algorithm 1 Bounded-Sampler for 𝒮 ⊂ {0, 1} 𝑛
 1: Input: 𝑝 ∈ (0, 1], 𝑈 ∈ [1, 𝑛], an insertion-only stream h𝑎 1 , . . . , 𝑎 𝑚 i, where each item 𝑎 𝑗 ∈ [𝑛]
 2: Initialize:
 3:    ℋ ←∅
 4:    pick random 𝜉1 , . . . , 𝜉𝑛 ∈ {0, 1} 𝑛 , with each Pr[𝜉𝑖 = 1] = 𝑝 and pairwise independent
 5: Update(𝑎 𝑗 ):
 6: if 𝜉 𝑎 𝑗 = 1 and 𝑎 𝑗 ∉ ℋ then
 7:     ℋ ← ℋ ∪ {𝑎 𝑗 }
 8:     while there is 𝑖 ∈ ℋ such that all 𝑠 ∈ 𝒮 with 𝑖 ∈ 𝑠 satisfy |𝑠 ∩ ℋ | > 𝑈 do
 9:           remove this 𝑖 from ℋ                                     ⊲ remove 𝑖 that is not 𝑈-needed
10: Query():
11:    return ℋ                                        ⊲ implicit answer supp(𝜉 ◦ 𝑠 ◦ 𝑣) for each 𝑠 ∈ 𝒮


By applying Proposition 2.5 to this ℋ we have that |ℋ | ≤ 𝑈 · HHdim(𝒮). This bound applies in
particular to the output ℋ , and also implies that the algorithm uses 𝑂(𝑈) · HHdim(𝒮) words of
space.
    Now consider ℋ at the end of the stream, and some 𝑠 ∈ 𝒮. Let 𝑍 = supp(𝜉 ◦ 𝑠 ◦ 𝑣). If
|𝑍| ≤ 𝑈, then every 𝑖 ∈ 𝑍 must have been stored in the final ℋ (because it must be added at
some update and can never be removed), which proves that 𝑍 ⊂ ℋ .
    Next, suppose that |𝑍| > 𝑈. Observe that every index 𝑖 ∈ 𝑍 is added at some point to ℋ ,
hence |𝑠 ∩ ℋ | is increased more than 𝑈 times. Moreover, whenever any 𝑖 ∈ 𝑠 ∩ ℋ is removed
from ℋ , it may only decrease |𝑠 ∩ ℋ | from 𝑈 + 1 to 𝑈, but never below 𝑈. It follows that at the
end of the execution, |𝑠 ∩ ℋ | ≥ 𝑈.                                                             

    We shall use Algorithm 1 as a subroutine twice, first to compute an 𝑂(1)-approximation
to a query 𝑠 ∈ 𝒮, and then (a refined version of it) to compute a (1 ± 𝜀)-approximation. En
route to an 𝑂(1)-approximation, we introduce Algorithm 2, which simply runs Algorithm 1
in parallel Θ(log log 𝑛) times, and then when given a query 𝑠 ∈ 𝒮, it outputs one bit. The next
lemma shows that this algorithm solves a promise (gap) version of the subset-ℓ0 problem: if
k𝑣 ◦ 𝑠 k 0 ≤ 1/(2𝑝) then with high probability it outputs 0, and if k𝑣 ◦ 𝑠 k 0 ≥ 2/𝑝 then with high
probability it outputs 1.

Lemma 2.8. Consider Algorithm 2 for 𝒮 ⊂ {0, 1} 𝑛 with parameter 𝑟 ∈ [ 14 , 𝑛]. When run on an
insertion-only stream accumulating to 𝑣 ∈ ℝ 𝑛 , it makes one pass and uses 𝑂(log log 𝑛 · HHdim(𝒮))
words of space, and then when queried for 𝑠 ∈ 𝒮, with probability at least 1 − 𝑂(1/log2 𝑛) its output
satisfies: if k𝑣 ◦ 𝑠 k 0 ≤ 𝑟/2 the output is 0, and if k𝑣 ◦ 𝑠 k 0 ≥ 2𝑟 the output is 1.

Proof. The algorithm’s space usage is dominated by the 𝑡 instances of the Bounded-Sampler,
and thus follows from Lemma 2.7.
    If 𝑟 ≤ 𝑈, then 𝑝 = 1 and the algorithm is deterministic, with the following guarantee by
Lemma 2.7. If k𝑠 ◦ 𝑣k 0 ≤ 𝑟/2 ≤ 𝑈, then 𝑧 = k𝑠 ◦ 𝑣 k 0 is an exact estimator, and the output is 0. If
k𝑠 ◦ 𝑣 k 0 ≥ 2𝑟, then 𝑧 ≥ min(k𝑠 ◦ 𝑣k 0 , 𝑈) ≥ 𝑟 and the output is 1.

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                             13
                      V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

Algorithm 2 Constant-Detector for 𝒮 ⊂ {0, 1} 𝑛
 1: Input: 𝑟 ∈ [ 14 , 𝑛], an insertion-only stream h𝑎 1 , . . . , 𝑎 𝑚 i, where each item 𝑎 𝑗 ∈ [𝑛]
 2: Initialize:
 3:  𝑈 ← 100, 𝑝 ← min(1, 𝑈/𝑟), 𝑡 ← Θ(log log 𝑛)
 4:  let 𝒜1 , . . . , 𝒜 𝑡 be instances of Bounded-Sampler with parameters 𝑝 and 𝑈
 5: Update(𝑎 𝑗 ):
 6:  update each 𝒜 𝑖 with 𝑎 𝑗
 7: Query(𝑠 ∈ 𝒮):
 8:  query each 𝒜 𝑖 for 𝑠 and let ℋ𝑖 be its output
     𝑧¯ ← 𝑝1 · median |ℋ1 ∩ 𝑠|, . . . , |ℋ𝑡 ∩ 𝑠|
                                                 
 9:
10:  return 1{ 𝑧¯ ≥𝑟}


    We thus assume henceforth that 𝑟 > 𝑈 and thus 𝑝 = 𝑈/𝑟. Consider an instance 𝒜 𝑖 of the
Bounded-Sampler, let ℋ𝑖 be its output, and let 𝜉 𝑖 ∈ {0, 1} 𝑛 be its sampling vector. We would
like to analyze the quantity 𝑝1 |ℋ𝑖 ∩ 𝑠| used in the algorithm.
    Now suppose k𝑠 ◦ 𝑣k 0 ≥ 2𝑟 = 2𝑈                                                    𝑖
                                       𝑝 . Then by Lemma 2.6, the expectation is 𝔼[ 𝑝 𝜉 ◦ 𝑠 ◦ 𝑣 0 ] =
                                                                                    1

k𝑠 ◦ 𝑣 k 0 , and with probability at least 8/9,

      𝜉 𝑖 ◦ 𝑠 ◦ 𝑣 0 ≥ k𝑠 ◦ 𝑣k 0 − 3     𝑝 k𝑠 ◦ 𝑣k 0       ≥ k𝑠 ◦ 𝑣k 0 − (2𝑈)3 1/2 k𝑠 ◦ 𝑣 k 0 ≥ 34 k𝑠 ◦ 𝑣k 0 ≥ 3𝑈
                                                                                                              2𝑝 .
  1                                    1−𝑝           1/2
  𝑝


In this event, 𝜉 𝑖 ◦ 𝑠 ◦ 𝑣 0 ≥ 3𝑈
                                2 > 𝑈, which by Lemma 2.7 implies that |ℋ𝑖 ∩𝑠 | ≥ 𝑈, and therefore
the estimate obtained from the instance 𝒜 𝑖 is 𝑝1 |ℋ𝑖 ∩ 𝑠 | ≥ 𝑝1 𝑈 ≥ 𝑟. By a standard probability
amplification argument, i. e., Chernoff bound, with probability at least 1 − 𝑂(1/log2 𝑛), the
median of 𝑡 independent repetitions is 𝑧¯ ≥ 𝑟, and the output is 1.
  Suppose next that k𝑠 ◦ 𝑣k 0 ≤ 2𝑟 = 2𝑝
                                     𝑈
                                        . Then by Lemma 2.6, with probability at least 8/9,

                       𝜉 𝑖 ◦ 𝑠 ◦ 𝑣 0 ≤ k𝑠 ◦ 𝑣k 0 + 3
                                                                     1/2 𝑈 3 𝑈 1/2 3𝑈
                                                        𝑝 k𝑠 ◦ 𝑣k 0      ≤ 2𝑝 + 𝑝 ( 2 ) ≤ 4𝑝 .
                  1                                    1−𝑝
                  𝑝

In this event, 𝜉 𝑖 ◦ 𝑠 ◦ 𝑣 0 ≤ 3𝑈                                                       𝑖
                                4 which by Lemma 2.7 implies that ℋ𝑖 ∩ 𝑠 = supp(𝜉 ◦ 𝑠 ◦ 𝑣), and
therefore the estimate obtained from the instance 𝒜 𝑖 is 𝑝1 |ℋ𝑖 ∩ 𝑠| = 𝑝1 𝜉 𝑖 ◦ 𝑠 ◦ 𝑣 0 ≤ 3𝑈
                                                                                          4𝑝 < 𝑟. By
a standard probability amplification argument, i. e., Chernoff bound, with probability at least
1 − 𝑂(1/log2 𝑛), the median of 𝑡 independent repetitions is 𝑧¯ < 𝑟, and the output is 0.           
    Using Algorithm 2 we can now easily design an algorithm achieving 𝑂(1)-approximation.
This algorithm uses standard ideas from the literature, like subsampling a stream at different
rates, which goes back to [28]). We adapt it here to our setting of a set system.

Lemma 2.9 (8-approximation algorithm). There is an algorithm that when run for 𝒮 ⊂ {0, 1} 𝑛 with
parameter 𝑟 ∈ [ 41 , 𝑛] on an insertion-only stream accumulating to 𝑣 ∈ ℝ 𝑛 , makes one pass and uses
𝑂(log 𝑛 · log log 𝑛 · HHdim(𝒮)) words of space, and then when queried for 𝑠 ∈ 𝒮, it output a number
𝑧 that with probability at least 0.99 satisfies k𝑠 ◦ 𝑣 k 0 < 𝑧 < 8k𝑠 ◦ 𝑣k 0 .

                          T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                      14
                                 U NIVERSAL S TREAMING OF S UBSET N ORMS

Proof. The algorithm consists of 𝑙 = 𝑂(log 𝑛) parallel independent instances of Algorithm 2,
denoted ℬ0 , ℬ1 , . . . , ℬ𝑙 . For each instance ℬ 𝑗 , the parameter is 𝑟 𝑗 = 2 𝑗−2 . To process a query
𝑠 ∈ 𝒮, the algorithm queries every instance for this 𝑠. Let 𝑥 𝑗 denote the output from instance ℬ 𝑗 .
By Lemma 2.8, 𝑥0 = 0 if and only if k𝑠 ◦ 𝑣k 0 = 0. Hence, 𝑥 0 can be used to distinguish whether
k𝑠 ◦ 𝑣 k 0 = 0, and we may assume henceforth that k𝑠 ◦ 𝑣k 0 > 0.
    The algorithm computes 𝑗 ∗ which is the smallest 𝑗 ≥ 1 such that 𝑥 𝑗 = 0, and outputs
𝑧 = 2 𝑗 . By Lemma 2.8 and a union bound, with probability at least 0.99, for all 𝑗 = 1, . . . , 𝑙, if
       ∗


k𝑠 ◦ 𝑣k 0 ≥ 2𝑟 𝑗 = 2 𝑗 then 𝑥 𝑗 = 1, and if k𝑠 ◦ 𝑣 k 0 ≤ 𝑟 𝑗 /2 = 2 𝑗−2 then 𝑥 𝑗 = 0. Assuming this event
happens, and by applying the first condition to 𝑗 ∗ and the second one to 𝑗 ∗ − 1 (both in the
contrapositive form), we obtain 2(𝑗 −1)−2 < k𝑠 ◦ 𝑣k 0 < 2 𝑗 = 𝑧.
                                        ∗                        ∗


    It is easy to verify the algorithm’s space requirement, and this completes the proof.              
   We now design an (1 ± 𝜀)-approximation algorithm, by using the above 𝑂(1)-approximation
algorithm and a variant of Algorithm 1.

Algorithm 3 Multiplicative Approximation for 𝒮 ⊂ {0, 1} 𝑛
 1: Input: 𝜀 ∈ (0, 1), an insertion-only stream h𝑎 1 , . . . , 𝑎 𝑚 i, where each item 𝑎 𝑗 ∈ [𝑛]
 2: Initialize:
 3:   𝑡 ← Θ(log 𝑛)
 4:   let {𝒜 𝑖 } 𝑖=1,...,𝑡 be instances of Bounded-Sampler with parameters 𝑝 𝑖 = 21−𝑖 and 𝑈 =
    d400𝜀−2 e
 5:   let ℬ be an 8-approximation algorithm (from Lemma 2.9)
 6: Update(𝑎 𝑗 ):
 7:   Update ℬ and each 𝒜 𝑖 with 𝑎 𝑗
 8: Query(𝑠 ∈ 𝒮):
 9:   query ℬ for 𝑠 and let 𝑧 be its output
10:   𝑙 ← max(1, dlog(𝑧𝜀2 /100)e);
11:   query 𝒜 𝑙 for 𝑠 and let ℋ𝑙 be its output
12:   return 𝑝1𝑙 |ℋ𝑙 ∩ 𝑠|;


Theorem 2.10. Consider Algorithm 3 for 𝒮 ⊂ {0, 1} 𝑛 with parameter 𝜀 ∈ (0, 1). When run on an
insertion-only stream accumulating to 𝑣 ∈ ℝ 𝑛 , it makes one pass and uses 𝑂((𝜀−2 + log log 𝑛) log 𝑛 ·
HHdim(𝒮)) words of space, and then when queried for 𝑠 ∈ 𝒮, its output b  𝑧(𝑠) satisfies5
                                                h                          i
                              ∀𝑠 ∈ 𝒮,           𝑧(𝑠) ∈ (1 ± 𝜀)k𝑠 ◦ 𝑣k 0 ≥ 0.8.
                                             Pr b

Proof. We assume k𝑠 ◦ 𝑣k 0 > 0, since otherwise the algorithm specified in Lemma 2.9 already
gives the correct answer. Next, by Lemma 2.9, with probability at least 0.99, the value 𝑧 reported
by ℬ is an 8-approximation to k𝑠 ◦ 𝑣 k 0 , and for the rest of the proof we assume this event
    5Our sketching algorithm uses pairwise-independent random bits, and its success probability can be amplified by
independent parallel repetitions of the algorithm. A possibly better approach is to use 𝑘-wise independent random
bits for larger 𝑘 to amplify the success probability directly.


                        T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                   15
                  V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

happens. Consider 𝑙 as in the algorithm, and its corresponding 𝑝 𝑙 = 21−𝑙 = min(1, 200𝜀−2 /𝑧),
then
                  min(1, 25𝜀−2 /k𝑠 ◦ 𝑣 k 0 ) ≤ 𝑝 𝑙 ≤ min(1, 200𝜀−2 /k𝑠 ◦ 𝑣k 0 ).
Let 𝑍 𝑙 ⊂ supp(𝑠 ◦ 𝑣) be the set of indices sampled in algorithm 𝒜 𝑙 by the hash function 𝜉,
and recall it samples each index in supp(𝑠 ◦ 𝑣) with probability 𝑝 𝑙 pairwise independently. By
Lemma 2.6, with probability at least 0.8,

                          𝑝 𝑙 |𝑍 𝑙 | − k𝑠 ◦ 𝑣 k 0   ≤ 3( 𝑝 𝑙 𝑙 k𝑠 ◦ 𝑣 k 0 )1/2 ≤ 𝜀k𝑠 ◦ 𝑣k 0 ,
                          1                              1−𝑝



which implies that |𝑍 𝑙 | ≤ (1 + 𝜀)𝑝 𝑙 k𝑠 ◦ 𝑣 k 0 ≤ 400𝜀−2 = 𝑈. When this happens, by Lemma 2.7
instance 𝒜 𝑙 of Algorithm 1 outputs ℋ𝑙 that satisfies ℋ𝑙 ∩ 𝑠 = 𝑍 𝑙 , and we conclude that our
algorithm’s output is 𝑝1𝑙 |ℋ𝑙 ∩ 𝑠 | = 𝑝1𝑙 |𝑍 𝑙 | ∈ (1 ± 𝜀)k𝑠 ◦ 𝑣 k 0 .
    The proof of Theorem 2.10 is completed by easily verifying the space usage of the algorithm.
                                                                                                           

2.3   Matching lower bound
We prove that for every set system 𝒮 ⊂ {0, 1} 𝑛 , the space complexity of every universal
streaming algorithm must be Ω(HHdim(𝒮)), which matches Theorem 2.10 in terms of the linear
dependence on the heavy-hitter dimension.
Theorem 2.11. Let 𝒮 ⊂ {0, 1} 𝑛 be a non-empty set system. Suppose 𝒜 is a (randomized) one-pass
streaming algorithm that solves the subset-ℓ 𝑝 problem for 𝒮 within approximation factor 𝛼 ≥ 1 for some
𝑝 ≥ 0. Then 𝒜 requires Ω(HHdim(𝒮))
                             p            bits of space for some insertion-only stream input. Moreover,
if 𝛼 = 1 + 𝜀 for some 𝜀 ≥ 1/ max𝑠∈𝒮 k𝑠 k 0 and 𝑝 ≠ 1, then 𝒜 requires Ω(HHdim(𝒮) + 𝜀−2 ) bits of
space.
Proof. We begin with the lower bound for Ω(HHdim(𝒮)). Let 𝑘 = HHdim(𝒮) ≥ 1, then
there exists a vector 𝑣 ∈ ℝ 𝑛 such that |𝐻(𝑆, 𝑣)| ≥ 𝑘. Without loss of generality, we can
assume 𝑣 ∈ {0, 1} 𝑛 since replacing each non-zero coordinate of 𝑣 with 1 does not change
𝐻(𝑠, 𝑣) := supp(𝑠) ∩ supp(𝑣) for any 𝑠 ∈ {0, 1} 𝑛 .
    Since 𝒮 is given to the algorithm before the stream begins, we can use the vector 𝑣 and the
algorithm 𝒮 to design a one-way communication protocol that solves INDEX(𝑘), in which Alice
is holding a binary vector 𝑥 ∈ {0, 1} 𝑘 of dimension 𝑘 and Bob is holding index 𝑖 ∈ [𝑘]. Alice
needs to send one round of messages to Bob. Bob needs to figure out the 𝑖-th coordinate in
Alice’s string. It is well-known that any protocol with at least constant probability of success
requires Ω(𝑘) bits (see, e. g., [38]).
    We now describe the protocol for the INDEX problem using the algorithm for the subset-ℓ 𝑝
problem. Firstly, since |𝐻(𝒮, 𝑣)| ≥ 𝑘, there exists 𝑠 1 , 𝑠2 , . . . , 𝑠 𝑘 ∈ 𝒮 such that each k𝑠 𝑗 ◦ 𝑣k 0 = 1
and 𝐻(𝑠 𝑗 , 𝑣)s are disjoint. Therefore, each 𝑠 𝑗 uniquely picks up a coordinate in 𝑣. We denote
the non-zero coordinate of 𝑠 𝑗 ◦ 𝑣 as 𝑧 𝑗 . Alice and Bob (without communication) then map the
𝑗-th element in [𝑘] to the 𝑧 𝑗 -th coordinate in 𝑣. Alice then modifies 𝑣 such that 𝑣 𝑧 𝑗 = 𝑥 𝑗 for
each 𝑗. As such, Alice obtains a vector 𝑣 0. She then converts the vector 𝑣 0 to a insertion-only

                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                              16
                               U NIVERSAL S TREAMING OF S UBSET N ORMS

stream and runs the algorithm 𝒜 on vector 𝑣 and sends the memory content to Bob. Bob
recovers the instance of the algorithm 𝒜 and runs a query on approximating the ℓ 𝑝 norm of
vector 𝑣 0 ◦ 𝑠 𝑖 . Since k𝑣 0 ◦ 𝑠 𝑖 k 𝑝 = 𝑥 𝑖 , Bob can recover an answer for the INDEX problem from any
𝛼-multiplicative approximation of k𝑣 0 ◦ 𝑠 𝑖 k 𝑝 . Thus algorithm 𝒜 must use Ω(𝑘) bits of space in
the worst case.
    Lastly, the Ω(𝜀−2 ) lower bound follows from the standard lower bound for ℓ 𝑝 norm [32]. 

Remark 2.12. For every 𝑝 ≤ 2, the lower bound Ω(HHdim(𝒮) + 𝜀−2 ) is existentially tight up
to polylog(𝑛)-factor. Indeed, the system 𝒮 ∗ = {𝑒1 , . . . , 𝑒 𝑘−1 , [𝑘 . . 𝑛]} has HHdim(𝒮 ∗ ) = |𝒮 ∗ | = 𝑘,
but admits an algorithm with space usage 𝑂(HHdim(𝒮 ∗ ) + 1/𝜀2 ) by explicitly storing the first
𝑘 − 1 coordinates and running an ℓ 𝑝 -norm algorithm for the coordinates-subset [𝑘 . . 𝑛]. Thus,
Theorem 2.11 provides the best-possible dependence on 𝜀.

Remark 2.13. Some set systems 𝒮 ⊂ {0, 1} 𝑛 admit a stronger lower bound hand of Ω(HHdim(𝒮)·
𝜀−2 ). Indeed, consider a set system with HHdim(𝒮) disjoint subsets where each subset has
cardinality 𝜀−2 , then a lower bound follows by a reduction from HHdim(𝒮) independent
instances of Gap Hamming Distance, and apply [4], which is based on [22, 12]. Thus, the space
complexity in Theorem 2.10 provides the best-possible dependence on 𝜀.

2.4   Strong lower bounds for turnstile and sliding-window models
We now show an impossibility result for the subset-ℓ 𝑝 problem in richer data streams, namely,
the strict turnstile and sliding-window models. Specifically, we exhibit a family of subsets
that has a small heavy-hitter dimension but does not admit efficient (nontrivial) streaming
algorithms in those richer data streams. This shows a strong separation from the insertion-only
model.
   Recall that in the turnstile model, the stream contains additive updates to a vector 𝑣 ∈ ℝ 𝑛 ,
which is initialized to all-zeros. As the updates may be negative, it captures both insertions and
deletions. In the strict turnstile model, the coordinates of 𝑣 must remain non-negative at all
times. For an even integer 𝑛, let 𝒮int ⊂ 2[𝑛] be the family of all intervals of length 𝑛/2 (and thus
cardinality 𝑛/2 + 1), i. e.,

                                𝒮int := [𝑎..𝑎 + 𝑛/2] : 𝑎 = 1, . . . , 𝑛/2 .
                                        

We next show that 𝒮int has a small heavy-hitter dimension (much smaller than its cardinality
|𝒮int | = 𝑛/2), and thus admits efficient algorithms for insertion-only streams.

Proposition 2.14. HHdim(𝒮int ) ≤ 2.

Proof. Consider any vector 𝑣 ∈ ℝ 𝑛 . We claim that 𝐻(𝒮int , 𝑣) contains at most one index from
[1..𝑛/2]. Indeed, if there is such an index at all, let 𝑗 be the smallest one. Then there is 𝑠 ∈ 𝒮int
such that supp(𝑠 ◦ 𝑣) = {𝑗}, which implies that 𝑣 𝑗+1 , . . . , 𝑣 𝑛/2 must all be equal to 0. It follows
that 𝑗 + 1, . . . , 𝑛/2 ∉ 𝐻(𝒮int , 𝑣), and the claim follows. A symmetric claim applies to the indices
from [𝑛/2 + 1..𝑛], and thus |𝐻(𝒮int , 𝑣)| ≤ 2.                                                         

                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                              17
                       V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

  We now show a strong space lower bound for every streaming algorithm in the turnstile
model.

Theorem 2.15. Suppose 𝒜 is a (randomized) one-pass streaming algorithm that solves the subset-ℓ 𝑝
problem for 𝒮int within approximation factor 𝛼 ≥ 1 for some 𝑝 ≥ 0. Then for some turnstile stream input,
𝒜 requires Ω(𝑛) bits of space.

    Before proving the theorem, we recall a known communication lower bound, which was
introduced and proved in [6], following the classical Index problem from [40]. In the Augmented
Index problem, denoted AUG𝑛 , Alice holds a binary vector 𝑥 ∈ {0, 1} 𝑛 , and Bob holds an index
𝑗 ∈ [𝑛] and a sequence 𝑥 𝑗+1 , . . . , 𝑥 𝑛 ∈ {0, 1} (part of Alice’s vector). Alice then sends a single
round of message to Bob, who is required to output 𝑥 𝑗 .

Theorem 2.16 (Lower Bound for Augmented Index [6]). In every shared-randomness protocol for
AUG𝑛 (with success probability at least 0.9), Alice must send Ω(𝑛) bits.

     We are now ready to prove our theorem.

Proof of Theorem 2.15. Without loss of generality, suppose 𝑛 > 0 is an even integer. Suppose we
have an algorithm 𝒜 that solves, with 𝛼-approximation, the subset-ℓ 𝑝 problem of 𝒮int . We now
describe a protocol that solves the AUG𝑛/2 problem. Now Alice has a binary vector 𝑥 ∈ {0, 1} 𝑛/2 ,
and Bob has an index 𝑗 and 𝑥 𝑗+1 , 𝑥 𝑗+2 , . . . 𝑥 𝑛/2 . Alice treats her vector 𝑥 as a stream of updates
and feeds it to 𝒜. She then sends the memory content of 𝒜 to Bob. Bob continues running 𝒜
based on the memory content received from Alice. He then feeds −𝑥 𝑗+1 , −𝑥 𝑗+2 , . . . , −𝑥 𝑛/2 to the
stream. After this, Bob queries the set 𝑠 𝑗 = [𝑗 . . 𝑗 + 𝑛/2]. If the algorithm answers a number > 0,
Bob then claims 𝑥 𝑗 = 1. Otherwise he claims 𝑥 𝑗 = 0.
    To show the correctness, we observe that after Bob’s updates, the vector in the stream is
exactly 𝑥 0 = (𝑥 1 , 𝑥2 , . . . , 𝑥 𝑗 , 0, 0, . . . , 0). Therefore, 𝑥 0 ◦ 𝑠 𝑗 = (0, . . . , 0, 𝑥 𝑗 , 0, . . . , 0). Hence if 𝑥 𝑗 = 0,
then with probability at least 0.9, 𝒜 outputs 0 and if 𝑥 𝑗 ≠ 0, then with probability at least 0.9, 𝒜
outputs > 0. Thus, the lower bound of AUG𝑛/2 implies a space lower bound of 𝒜.                                                      

   A similar argument applies to the sliding-window model, where the stream has a parameter
𝑊 ≥ 1 called window-size, and the input vector 𝑣 ∈ ℝ 𝑛 at any time 𝑡 is determined by the last
𝑊 additive updates, i. e., items from time 𝑡 − 𝑊 or earlier in the stream expire (are ignored).

Theorem 2.17. Suppose 𝒜 is a (randomized) one-pass streaming algorithm that solves the subset-ℓ 𝑝
problem for 𝒮int within approximation factor 𝛼 ≥ 1 for some 𝑝 ≥ 0. Then for some sliding-window
stream input, 𝒜 requires Ω(min(𝑛, 𝑊)) bits of space.

Proof. Suppose we have an algorithm 𝒜 that solves, with 𝛼-approximation, the subset-ℓ 𝑝
problem for 𝒮int for the most recent 𝑊 updates at any time 𝑡. Let 𝑑 = min(𝑛/2, 𝑊). We now
describe a protocol that solves the AUG𝑑 problem. Now Alice has a binary vector 𝑥 ∈ {0, 1} 𝑑 ,
and Bob has an index 𝑗 and 𝑥 𝑗+1 , 𝑥 𝑗+2 , . . . 𝑥 𝑑 . Alice treats her vector 𝑥 as a stream of updates
and feeds it to 𝒜 in an order 𝑥 𝑑 , 𝑥 𝑑−1 , . . . , 𝑥 1 . She then sends the memory content of 𝒜 to Bob.
Bob continues running 𝒜 based on the memory content received from Alice. He then feeds

                            T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                                 18
                               U NIVERSAL S TREAMING OF S UBSET N ORMS

arbitrary updates for 𝑥 𝑗−1 to the stream such that all updates of 𝑥 𝑑 , 𝑥 𝑑−1 , . . . , 𝑥 𝑗+1 expire except
the update of 𝑥 𝑗 , i. e., Bob sends 𝑊 − 𝑑𝑗0=𝑗+1 𝑥 𝑗0 updates to the (𝑗 − 1)-st coordinate of 𝑣. After
                                           Í
this, Bob queries the set 𝑠 𝑗 = [𝑗, 𝑗 + 𝑛/2]. If the algorithm answers a number > 0, Bob then claims
𝑥 𝑗 = 1. Otherwise he claims 𝑥 𝑗 = 0.
     It is easy to verify the correctness of the protocol and hence proves a Ω(min(𝑛, 𝑊)) space
lower bound of the algorithm.                                                                             

2.5   Streaming algorithm with “for all” guarantee
We now show how to extend our algorithm to achieve the “for all” guarantee using space usage
𝑂
e𝜀 (HHdim(𝒮)2 ). The key is to establish a connection between the heavy-hitter dimension and
the VC-dimension (of every set system 𝒮), and the algorithm then follows by the standard
technique of amplifying the success probability by independent repetitions.
    Recall that the VC-dimension of 𝒮 is defined as the maximum cardinality of a set 𝐴 ⊂ [𝑛]
that is shattered, where 𝐴 is called shattered if every subset of 𝐴 can be realized as 𝐴 ∩ 𝑆 for some
𝑆 ∈ 𝒮. The heavy-hitter dimension can be defined analogously, by modifying the definition
of being shattered to this: for every element 𝑎 ∈ 𝐴, there is 𝑆 ∈ 𝒮 such that 𝑆 ∩ 𝐴 = {𝑎}. It
then follows easily that VCdim(𝒮) ≤ HHdim(𝒮), proved formally in Proposition 2.18 below.
However the gap between them cannot be bounded by any fixed factor. For instance, when 𝒮 is
the set of 𝑘 ∈ [𝑛] singleton sets, VCdim(𝒮) = 1 whereas HHdim(𝒮) = 𝑘.
Proposition 2.18. Let 𝒮 ⊂ 2[𝑛] , and denote its VC-dimension by VCdim(𝒮). Then
                                       VCdim(𝒮) ≤ HHdim(𝒮).
Proof. We show that 𝒮 cannot shatter any set of size HHdim(𝒮) + 1. Suppose 𝒮 can shatter a set
𝒮 with |𝑆| = HHdim(𝒮) + 1. Then for each 𝑎 ∈ 𝒮, we have a set 𝑠 𝑎 ∈ 𝒮 such that {𝑎} = 𝑠 𝑎 ∩ 𝑆.
This in turn indicates HHdim(𝒮) ≥ HHdim(𝒮) + 1, a contradiction.                            
Lemma 2.19 (Sauer-Shelah Lemma [43, 46]). Every 𝒮 ⊂ 2[𝑛] with VC-dimension 𝑘 has cardinality
|𝒮| = 𝑂(𝑛 𝑘 ).
    The next theorem achieves the “for all” guarantee by standard amplification, namely, by
reporting the median value of 𝑂(log|𝒮|) independent repetitions, and using the above to bound
this number of repetitions.
Theorem 2.20. For every 𝒮 ⊂ {0, 1} 𝑛 there is a one-pass streaming algorithm, that when run and
𝜀 ∈ (0, 1) on an insertion-only stream accumulating to 𝑣 ∈ ℝ 𝑛 , it uses 𝑂((𝜀−2 +log log 𝑛)·HHdim(𝒮)2 ·
log2 𝑛) words of space, and then when queried for 𝑠 ∈ 𝒮, its output b   𝑧(𝑠) satisfies
                                  h                                 i
                                         𝑧(𝑠) ∈ (1 ± 𝜀)k𝑠 ◦ 𝑣 k 0 ≥ 0.9.
                              Pr ∀𝑠 ∈ 𝒮, b

Proof. We apply Algorithm 2.10 𝑂(log|𝒮|) times independently in parallel, and answer any
query using the median value of all the repetitions. For the space bound, we use Lemma 2.19
and Proposition 2.18 to bound the number of repetitions by
                    𝑂(log|𝒮|) ≤ 𝑂(VCdim(𝒮) · log 𝑛) ≤ 𝑂(HHdim(𝒮) · log 𝑛).

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                              19
                   V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

For the desired error bound we use a union bound over all 𝑠 ∈ 𝒮, where the error probability for
a fixed query 𝑠 ∈ 𝒮 is at most 2−Ω(log|𝒮|) ≤ 0.1/|𝒮| by a standard Chernoff bound and a suitable
choice of constants.                                                                          

2.6   Generalizing the algorithm to the subset-ℓ1 norm
In the insertion-only streaming model, the ℓ 1 norm of a vector 𝑣 is simply the sum of all updates
(effectively without absolute values). Thus, we can then reduce ℓ 1 to ℓ0 , and obtain an algorithm
for subset-ℓ1 , as follows. Assuming that the stream length 𝑚 is bounded by 𝑚 = poly(𝑛), we
convert each update 𝑎 𝑗 ∈ [𝑛] to an update of the form (𝑎 𝑗 , 𝑗) to 𝑣 0 ∈ ℝ 𝑛×𝑚 , a binary vector in a
larger dimension. It is easy to verify that k𝑣 0 k 0 = k𝑣k 1 . We also convert the set system 𝒮 ∈ 2[𝑛]
to the new universe as follows. For each 𝑠 ∈ 𝒮, we expand each of its entries 𝑠 𝑖 to 𝑚 duplicates
of 𝑠 𝑖 , which yields a new set system 𝒮 0 ⊂ 2[𝑛×𝑚] . The following lemma shows that the new set
system 𝒮 0 has the same-heavy hitter dimension as 𝒮.

Lemma 2.21. HHdim(𝒮 0) = HHdim(𝒮).

Proof. We first show HHdim(𝒮) ≤ HHdim(𝒮 0). For each vector 𝑣 ∈ ℝ 𝑛 , we can pad each
coordinate with 𝑚 − 1 zeros to obtain a vector 𝑣 0. Therefore, |𝐻(𝑣, 𝒮)| = |𝐻(𝑣 0 , 𝒮 0)| and thus
HHdim(𝒮) ≤ HHdim(𝒮 0). We now show the other direction, HHdim(𝒮 0) ≤ HHdim(𝒮). For
any vector 𝑣 0 ∈ ℝ 𝑛×𝑚 , suppose an 𝑠 0 ∈ 𝒮 0 satisfies supp(𝑠 0 ◦𝑣 0) = {(𝑖, 𝑗)} for some (𝑖, 𝑗) ∈ [𝑛]×[𝑚].
Then it must be the case that 𝑣 (𝑖,𝑗0) = 0 for any 𝑗 0 ≠ 𝑗 since (𝑖, 𝑗 0) ∈ 𝑠 0 for all 𝑗 0 ∈ [𝑚]. Suppose
|𝐻(𝑣 0 , 𝑆0)| = 𝑘, then there exists 𝑘 distinct indices 𝑖1 , 𝑖2 , . . . , 𝑖 𝑘 ∈ [𝑛] and some 𝑘 indices
𝑗1 , 𝑗2 , . . . 𝑗 𝑘 ∈ [𝑚] such that for each 𝑙 ∈ [𝑘] there exists 𝑠 0𝑙 ∈ 𝒮 0 with supp(𝑠 0𝑙 ◦ 𝑣 0) = {(𝑖 𝑙 , 𝑗𝑙 )}.
Consider a corresponding vector 𝑣 ∈ ℝ 𝑛 with only 𝑣 𝑖 𝑙 = 1 for 𝑙 = 1, 2, . . . , 𝑘 and other places 0.
Then we have, for each 𝑙 ∈ [𝑘], supp(𝑠 𝑙 ◦ 𝑣) = {𝑖 𝑙 }, where 𝑠 𝑙 is the corresponding set of 𝑠 0𝑙 . Hence
HHdim(𝒮 0) ≤ HHdim(𝒮).                                                                                           

   We can now apply our algorithm for subset-ℓ 0 on 𝑣 0, and obtain an algorithm for the subset-ℓ 1
norm.

Theorem 2.22. There is an algorithm that when run on an insertion-only stream that accumulates to 𝑣 ∈
ℝ 𝑛 and has length poly(𝑛), the algorithm makes one pass using 𝑂((𝜀−2 + log log 𝑛) log 𝑛 · HHdim(𝒮))
words of space, and then when queried for 𝑠 ∈ 𝒮, its output b
                                                            𝑧(𝑠) satisfies

                                ∀𝑠 ∈ 𝒮,        𝑧(𝑠) ∈ (1 ± 𝜀)k𝑠 ◦ 𝑣k 1 ] ≥ 0.9.
                                            Pr[b

2.7   The entrywise update model
In this section, we show an algorithm that computes the subset-ℓ 𝑝 in the entrywise update model.
For the entrywise update model, the algorithm for ℓ 𝑝 is essentially equivalent for all 𝑝 ≥ 0,
because when an entry 𝑣 𝑖 arrives, the algorithm can simply raise it to power 𝑝 for free. Therefore,
we show only an algorithm for the subset-ℓ1 problem, and the algorithms for all other 𝑝 follows
automatically. Unlike the priority sampling algorithm of [48, 1, 26], we simply reduce each

                        T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                  20
                                U NIVERSAL S TREAMING OF S UBSET N ORMS

entrywise update to a sequence of updates to an insertion-only stream of a higher-dimensional
vector.
     Specifically, we assume that all entrywise updates are integers bounded by 𝑚 = poly(𝑛),
which ensures that every entry can be stored using 𝑂(log 𝑛) bits. Given an entrywise update,
say entry 𝑣 𝑖 in a vector 𝑣 ∈ ℝ 𝑛 , our algorithm converts it to a sequence of 𝑣 𝑖 updates to a vector
𝑣 0 ∈ ℝ 𝑛×𝑚 ; similarly to Section 2.6, these updates are of the form ((𝑖, 𝑗), 1) for 𝑗 = 1, 2, . . . , 𝑣 𝑖 , and
we also convert the set system 𝒮 ∈ 2[𝑛] to a corresponding set system 𝒮 0 ⊂ 2[𝑛×𝑚] . By applying
Theorem 2.22, we obtain the following corollary.
Corollary 2.23. Let 𝒮 ⊂ 2[𝑛] be an arbitrary set system. Let 𝑝 ≥ 0, 𝜀 ∈ (0, 1). In the entrywise update
model of an vector 𝑣 ∈ ℝ 𝑛 , there exists an algorithm that uses 𝑂((𝜀−2 + log log 𝑛) log 𝑛 · HHdim(𝒮))
words of space and for each query 𝑠 ∈ 𝒮, it outputs a (1 ± 𝜀) approximation to k𝑠 ◦ 𝑣 k 𝑝 with probability
at least 0.9.


3    Additive error for subset-ℓ 𝑝 norms
In this section we design additive-error approximation algorithms for subset-norms (in contrast
to multiplicative approximation in the preceding section). Consider a set system 𝒮 ⊂ 2[𝑛] . If
𝒮 includes the all-ones vector (i. e., the set [𝑛]), then the space complexity of (𝜀k𝑣k 𝑝 )-additive
approximation of subset-ℓ 𝑝 norm is clearly at least that of (1 + 𝜀)-multiplicative approximation
of subset-ℓ 𝑝 norm (simply because of the entire vector 𝑣). We present an algorithm that matches
this lower bound, up to poly(𝜀−1 log 𝑛) factors, for every 𝑝 ∈ (0, ∞). It works for all possible
subsets, i. e., the set system 𝒮 = 2[𝑛] .
Theorem 3.1. Given 𝑝 ∈ (0, ∞) and 𝜀 ∈ (0, 1), Algorithm 4 makes a single pass over a data stream of
additive updates to a vector 𝑣 ∈ ℝ 𝑛 , and outputs a function 𝐹 : {0, 1} 𝑛 → ℝ that satisfies
                                                 h                            i
                       ∀𝑠 ∈ {0, 1} 𝑛 ,      Pr 𝐹(𝑠) ∈ k𝑠 ◦ 𝑣k 𝑝 ± 2𝜀k𝑣k 𝑝 ≥ 0.75,

where the probability is over the algorithm’s randomness. Moreover, the space complexity of the algorithm
and the function 𝐹 is 𝑂 𝑝 (𝜀−3 polylog(𝑛)) bits for 𝑝 ≤ 2 and 𝑂 𝑝 (𝜀−3 𝑛 1−2/𝑝 polylog(𝑛)) bits for 𝑝 > 2.
   At a high-level, Algorithm 4 follows the framework developed in [36, 5, 20], where every
coordinate is scaled at random and then the Count-Sketch algorithm is used to find heavy-hitters.
We employ a specific simple method established in [20], and thus need the following definition
and lemma from their work.
Definition 3.2 (𝛼-inverse distribution [20]). Let 𝛼 ∈ (0, ∞) be a parameter. The distribution of a
random variable 𝑋 ∈ ℕ = {1, 2, . . .} is called 𝛼-inverse if
                                                                       1
                                      ∀𝑥 ∈ ℕ ,       Pr[𝑋 < 𝑥] = 1 −      .                                (3.1)
                                                                       𝑥𝛼
Lemma 3.3 (Lemma 4 in [20]). Let 𝑝 ∈ (0, ∞) and 𝜀 ∈ (0, 1). Let 𝑘 ≥ 𝑐 𝑝 𝜀−2 be an even integer for
large enough 𝑐 𝑝 > 0 that depends only on 𝑝. Let 𝑋 ∈ ℝ 𝑘×𝑛 be a random matrix, whose entries have a

                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                  21
                  V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

𝑝-inverse distribution and they are pairwise independent. Given 𝑣 ∈ ℝ 𝑛 , let 𝑋 ◦ 𝑣 ∈ ℝ 𝑘×𝑛 be a matrix
given by (𝑋 ◦ 𝑣)𝑖,𝑗 := 𝑋𝑖,𝑗 𝑣 𝑗 . Let 𝑍 be the (𝑘/2)-largest entry in absolute value in 𝑋 ◦ 𝑣. Then
                                     h                                  i
                                  Pr 2−1/𝑝 |𝑍| ∈ (1 ± 𝜀)k𝑣k 𝑝 ≥ 0.9.

    We henceforth define 𝑉 := 𝑋 ◦ 𝑣 ∈ ℝ 𝑘𝑛 , and view it as a vector by simply flattening the 𝑘 × 𝑛
matrix. Throughout, let 𝑉tail(𝑘) denote the vector obtained from 𝑉 by zeroing the 𝑘 entries of
largest absolute value. Thus,
                                                             𝑘𝑛
                                                             Õ
                                         k𝑉tail(𝑘) k 22 :=           𝑉[𝑗]
                                                                       2
                                                                          ,
                                                             𝑗=𝑘+1

where 𝑈[𝑗] denotes the 𝑗-largest coordinate in absolute value in 𝑈.
   Another ingredient is the famous Count-Sketch algorithm of [23], but we need a well-known
and slightly stronger guarantee from [25], as follows.
Proposition 3.4 (Lemma 7 in [25]). There is a one-pass algorithm, Count-Sketch, with parameters
𝑘 ∈ ℕ and 𝜀0 ∈ (0, 1), that given a stream of additive updates to a vector 𝑉 ∈ ℝ 𝑛 , uses space of
                                                                                           0


                                               b ∈ ℝ 𝑛0 , such that with high probability 1 − 1/𝑛 02 ,
𝑂(𝑘/𝜀02 · log3 𝑛 0) bits to output an estimate 𝑉

                                       b − 𝑉 k ∞ ≤ √𝜀 k𝑉tail(𝑘) k 2 .
                                                     0
                                      k𝑉
                                                     𝑘
    We can now present our Algorithm 4 which is used in Theorem 3.1. The idea is to use
the estimator from Lemma 3.3, but in order to save space, instead of storing 𝑉 = 𝑋 ◦ 𝑣 ∈ ℝ 𝑘𝑛
explicitly, it estimates this vector using the Count-Sketch algorithm (where the parameters 𝑘
and 𝜀0 in Proposition 3.4 are set to 𝑘 and 𝜀0(𝑝, 𝜀, 𝑛) given in line 5 of Algorithm 4). We write 𝑉|𝑠 ,
for 𝑠 ∈ {0, 1} 𝑛 (representing 𝑠 ⊂ [𝑛]), to denote the vector obtained by restricting 𝑉 to entries
corresponding to (𝑖, 𝑗) where 𝑖 ∈ [𝑘] and 𝑗 ∈ 𝑠 (i. e., zeroing all other entries).
    Before proving Theorem 3.1, we need the next lemma to bound the error of the Count-Sketch
algorithm in our setting. Its proof follows from a direct calculation and appears in Section 3.1.
Lemma 3.5. Let 𝑉 := 𝑋 ◦ 𝑣 ∈ ℝ 𝑘𝑛 be a vector defined as in Lemma 3.3. Then with probability at least
0.9,
                                       
                                       
                                        𝑘 k𝑣 k 2𝑝                    if 𝑝 < 2,
                                       
                                       
                     𝑉tail(𝑘) 2 ≤ 𝐶 𝑝 · 𝑘 k𝑣 k 2 ≤ 𝑘 k𝑣 k 𝑝 · 𝑛
                              2                 2         2     1−2/𝑝 if 𝑝 > 2,
                                        𝑘 k𝑣 k · log 𝑛               if 𝑝 = 2.
                                       
                                               2
                                               2
holds for a suitable 𝐶 𝑝 > 0 that depends only on 𝑝.
Proof (of Theorem 3.1). We know by Lemma 3.3 that to estimate k𝑣 ◦ 𝑠 k 𝑝 , it suffices to find the
(𝑘/2)-largest coordinate in absolute value in 𝑉| 𝑠 for 𝑘 = Θ(𝜀−2 ), and report its absolute value
scaled by 2−1/𝑝 . Observe that the input to the Count-Sketch instance CS is exactly the vector
𝑉 = 𝑋 ◦ 𝑣, and thus, with probability at least 0.99, its output vector 𝑉
                                                                       b satisfies

                                       b − 𝑉 k ∞ ≤ √𝜀 k𝑉tail(𝑘) k 2 ,
                                                     0
                                      k𝑉                                                          (3.2)
                                                     𝑘

                      T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                         22
                                   U NIVERSAL S TREAMING OF S UBSET N ORMS

where 𝜀0 is defined in line 5 of Algorithm 4, and we assume henceforth this event occurs.
   Now let b 𝑧 be as in the algorithm, i. e., the (𝑘/2)-largest coordinate in absolute value in 𝑉
                                                                                                b| ,
                                                                                                  𝑠
and let 𝑧 be similarly in 𝑉| 𝑠 . It follows easily from (3.2) that

                                                           𝜀0
                                                  𝑧 − 𝑧| ≤ √ k𝑉tail(𝑘) k 2 .
                                                 |b                                                          (3.3)
                                                            𝑘

Indeed, by definition of 𝑧, at least 𝑘/2 coordinates in 𝑉| 𝑠 have value at least 𝑧, and (3.2) implies
that all the corresponding coordinates in 𝑉 b| have value at least 𝑧 − √𝜀0 k𝑉tail(𝑘) k 2 , which implies
                                              𝑠                                                𝑘
𝑧 ≥ 𝑧 − √𝜀 k𝑉tail(𝑘) k 2 . The other direction is proved similarly.
          0

          𝑘
b
    We now bound the additive error in (3.3) using Lemma 3.5. By plugging in the value of
𝜀0 (depending on whether 𝑝 < 2, 𝑝 > 2, or 𝑝 = 2) and setting a sufficiently small 𝑐 0𝑝 > 0, with
probability at least 0.9,

                                𝜀0
                                √ k𝑉tail(𝑘) k 2 ≤ 𝑐 0𝑝 𝐶 𝑝 · 𝜀 k𝑣 k 𝑝 ≤ 21/𝑝 · 𝜀k𝑣k 𝑝 ,
                                                         1/2

                                 𝑘
            𝑧 − 𝑧| ≤ 𝜀 k𝑣 k 𝑝 . The claimed overall accuracy now follows, via a union bound, from
thus 2−1/𝑝 |b
this last bound and Lemma 3.3.
    To complete the proof, observe that the space complexity is dominated by that of the
Count-Sketch instance, which is 𝑂(𝑘/𝜀02 · polylog(𝑛)) bits (Proposition 3.4), as required.      


Boosting the probability. The algorithm in Theorem 3.1 is oblivious to the vector 𝑠, and thus
one can boost its success probability by parallel repetition and reporting the median estimate.
In particular, with only an 𝑂(log |𝒮|) factor increase in the space, the additive approximation
will hold simultaneously for all the sets 𝑆 ∈ 𝒮.

3.1   Proof of Lemma 3.5
Proof of Lemma 3.5. We will prove the lemma separately for 𝑝 > 2 and for 𝑝 ≤ 2.

Case 𝑝 > 2: By definition of 𝑉,
                                    Õ                                            Õ
                     k𝑉 k 22 =                 𝑉𝑖,𝑗
                                                 2
                                                      and     𝔼[k𝑉 k 22 ] =                 𝑣 2𝑗 𝔼[𝑋𝑖,𝑗
                                                                                                    2
                                                                                                        ].
                                 𝑖∈[𝑘],𝑗∈[𝑛]                                  𝑖∈[𝑘],𝑗∈[𝑛]


To calculate 𝔼[𝑋𝑖,𝑗
                2
                    ], observe that (3.1) implies Pr[𝑋𝑖,𝑗 = 𝑥] = 1/𝑥 𝑝 − 1/(𝑥 + 1)𝑝 for all 𝑥 ∈ ℕ , and
therefore
                          ∞  2                          ∞          ∞                              ∞
                          Õ  𝑥          𝑥 2  Õ 𝑥 2 Õ (𝑥 − 1)2     0
                                                                     Õ 1
              𝔼[𝑋𝑖,𝑗
                 2
                     ]=             −         =    −           ≤ 𝐶 𝑝        ≤ 𝐶 𝑝00 .
                                 𝑥 𝑝 (𝑥 + 1)𝑝   𝑥𝑝       𝑥𝑝           𝑥 𝑝−1
                          𝑥=1                           𝑥=1        𝑥=2                         𝑥=1


                          T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                23
                  V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG




Algorithm 4 Additive Subset-ℓ 𝑝 of 𝑣 ∈ ℝ 𝑛
 1: Input: 𝑝 ∈ (0, ∞) and 𝜀 ∈ (0, 1)
 2: Initialize:
 3: 𝑘 ← Θ(𝜀−2 )                                                                              ⊲ 𝑘 is an even integer
 4: generate a random matrix 𝑋 ∈ ℝ 𝑘×𝑛 whose entries have a 𝑝-inverse distribution and they
   are pairwise independent
 5: initialize a Count-Sketch instance CS for a vector 𝑉 ∈ ℝ 𝑘𝑛 with parameters 𝑘 and


                                                      𝜀              for 𝑝 < 2;
                                                      
                                                      
                                                      
                                                      
                                       0
                                      𝜀    = 𝑐 0𝑝 ·    𝜀/(log 𝑛)1/2   for 𝑝 = 2;
                                                       𝜀/𝑛 1/2−1/𝑝   for 𝑝 > 2
                                                      
                                                      
                                                      
    for a suitable constant 𝑐 0𝑝 > 0 that depends on 𝑝
 6: Update(𝑖, Δ):
 7: feed the Count-Sketch instance CS with 𝑘 updates:                                        ⊲ maintain 𝑉 = 𝑋 ◦ 𝑣

                               (1, 𝑖), 𝑋1,𝑖 Δ , (2, 𝑖), 𝑋2,𝑖 Δ , . . . , (𝑘, 𝑖), 𝑋 𝑘,𝑖 Δ .
                                                                                      

 8: Query(𝑠 ∈ {0, 1} 𝑛 ):
        b be the estimate of 𝑉 ∈ ℝ 𝑘𝑛 provided by CS
 9: let 𝑉
        𝑧 be the (𝑘/2)-largest coordinate in absolute value in 𝑉
10: let b                                                      b| .
                                                                 𝑠
                  𝑧|.
11: return 2−1/𝑝 |b




                        T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                   24
                                 U NIVERSAL S TREAMING OF S UBSET N ORMS

for some constants 𝐶 𝑝0 , 𝐶 𝑝00 > 0 that depend on 𝑝. Thus, 𝔼[k𝑉 k 22 ] ≤ 𝐶 𝑝00 𝑘 k𝑣 k 22 , and by Markov’s
inequality, we obtain as claimed
                                              h                                   i         1
                                         Pr       k𝑉 k 22 ≥ 10𝐶 𝑝00 𝑘 k𝑣k 22          ≤        ,
                                                                                            10
                                                                                                                     1/2−1/𝑝
and we can also plug in the well-known comparison of norms k𝑣k 2 ≤ k𝑣k 𝑝                                                       .

Case 𝑝 ≤ 2: Define the set
                                      n                                                                    o
                              𝑊 := (𝑖, 𝑗) ∈ [𝑘] × [𝑛] : |𝑉𝑖,𝑗 | ≥ 201/𝑝 k𝑣 k 𝑝 .

A simple calculation shows that
                                 Õ            h                                  i           Õ |𝑣 𝑗 | 𝑝                𝑘
                  𝔼[|𝑊 |] = 𝑘            Pr |𝑣 𝑗 |𝑋𝑖,𝑗 ≥ 201/𝑝 k𝑣 k 𝑝 = 𝑘                                𝑝 =    .
                                 𝑗∈[𝑛]                                                      𝑗∈[𝑛]
                                                                                                  20k𝑣 k 𝑝
                                                                                                             20

By Markov’s inequality, the event ℰ = {|𝑊 | ≥ 𝑘} has probability Pr[ℰ] ≤ 20         1
                                                                                      . When the
complement event ℰ̄ occurs, every coordinate of 𝑉tail(𝑘) is not in 𝑊, i. e., has magnitude smaller
than 201/𝑝 k𝑣 k 𝑝 . Let us define the random variable
                                                  Õ
                                  𝑅 :=                   𝑣 2𝑗 𝑋𝑖,𝑗
                                                               2
                                                                   · 1{|𝑣 𝑗 |𝑋𝑖,𝑗 ≤201/𝑝 k𝑣k 𝑝 } .
                                          𝑖∈[𝑘],𝑗∈[𝑛]


When ℰ̄ occurs, clearly k𝑉tail(𝑘) k 22 ≤ 𝑅, and we thus wish to bound 𝑅 (with high probability).
  To this end, observe that for all 𝑖 ∈ [𝑘], 𝑗 ∈ [𝑛] with 𝑣 𝑗 ≠ 0,

                                                                 (20)1/𝑝 k𝑣 k 𝑝 /|𝑣 𝑗 | 
                      h                                   i               Õ                 𝑥2     𝑥2 
                    𝔼 𝑋𝑖,𝑗
                       2
                           · 1{|𝑣 𝑗 |𝑋𝑖,𝑗 ≤201/𝑝 k𝑣k 𝑝 } =                                     −
                                                                                            𝑥 𝑝 (𝑥 + 1)𝑝
                                                                          𝑥=1
                                                                          (20)1/𝑝 k𝑣k 𝑝 /|𝑣 𝑗 |
                                                                                 Õ                  1
                                                              ≤ 𝐶 𝑝0 ·
                                                                                  𝑥=1
                                                                                                   𝑥 𝑝−1
                                                                          (
                                                                              (k𝑣k 𝑝 /|𝑣 𝑗 |)2−𝑝               if 𝑝 < 2,
                                                              ≤ 𝐶 𝑝00 ·
                                                                              log 𝑛                            if 𝑝 = 2,

for some constants 𝐶 𝑝0 , 𝐶 𝑝00 > 0 that depend on 𝑝, where we used the fact that 𝑚 = poly(𝑛) and
thus log 𝑚 = 𝑂(log 𝑛). It immediately follows that
                                                         (
                                                             𝑘 k𝑣k 2𝑝                 if 𝑝 < 2,
                                   𝔼[𝑅] ≤ 𝐶 𝑝00 ·
                                                             𝑘 k𝑣k 22 · log 𝑛         if 𝑝 = 2,


                       T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                                                       25
                 V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

By Markov’s inequality,
                          h                           i    1
                      Pr 𝑅 ≥ 20𝐶 𝑝00 · 𝑘 k𝑣k 2𝑝 ≤                                            for 𝑝 < 2
                                                           20
                          h                                  i     1
               and    Pr 𝑅 ≥ 20𝐶 𝑝00 · 𝑘 k𝑣k 22 log 𝑛 ≤                                  for 𝑝 = 2.
                                                                   20
Now by a union bound on the above event and ℰ̄, with probability at least 0.9 we have
                                                 (
                                                     20𝐶 𝑝00 · 𝑘 k𝑣k 2𝑝          if 𝑝 < 2,
                          k𝑉tail(𝑘) k 22 ≤ 𝑅 <
                                                     20𝐶 𝑝00 · 𝑘 k𝑣 k 22 log 𝑛   if 𝑝 = 2,

which completes the proof of this case.                                                                  


4   Concluding remarks
To conclude, we study the universal streaming problem for subset-ℓ 𝑝 norms, i. e., providing a
single summary of a stream of insertion-only updates to an input vector 𝑣 ∈ ℝ 𝑛 , which suffices to
approximate any subset-ℓ 0 norm from a given family 𝒮 ⊂ 2[𝑛] . (Recall that a subset-ℓ 𝑝 norm of 𝑣
is the ℓ 𝑝 norm of the vector induced by a subset of coordinates 𝑆 ∈ 𝒮.) We prove that the space
complexity of this problem is characterized by the heavy-hitter dimension of the set 𝒮, a notion
that we introduce and define as the maximum number (over all 𝑣 ∈ ℝ 𝑛 ) of distinct heavy-hitters
with respect to all subsets 𝑆 ∈ 𝒮. We further show that this characterization holds also for
subset-ℓ1 norms in the same insertion-only setting. However, it does not hold for more general
streaming models, namely, for the turnstile setting and the sliding-window setting, and thus there
is a strict separation between these models.
    For subset-ℓ 𝑝 norms with general 𝑝, namely, every 𝑝 ∈ (0, ∞)\{1}, we prove that the heavy-
hitter dimension characterizes the space complexity of universal streaming in the entrywise
updates model, where each coordinate of the vector is updated at most once. In the more general
model of insertion-only updates, it is remains open whether subset-ℓ 𝑝 norms for 𝑝 ≠ 0, 1 admits
uniform streaming with space complexity 𝑂(HHdim(𝒮)).
                                             e                For example, the major obstacle for
subset-ℓ 2 norms is how to maintain the distinct ℓ2 -heavy-hitters for every subset of coordinates
𝑆 ∈ 𝒮. We leave this problem for future investigations.


References
 [1] Noga Alon, Nick Duffield, Carsten Lund, and Mikkel Thorup: Estimating arbitrary
     subset sums with few probes. In Proc. 24th Symp. on Principles of Database Systems (PODS’05),
     pp. 317–325. ACM Press, 2005. [doi:10.1145/1065167.1065209] 2, 6, 20

 [2] Noga Alon, Yossi Matias, and Mario Szegedy: The space complexity of approximating
     the frequency moments. J. Comput. System Sci., 58(1):137–147, 1999. Preliminary version in
     STOC’96. [doi:10.1006/jcss.1997.1545] 2, 5, 7

                     T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                               26
                           U NIVERSAL S TREAMING OF S UBSET N ORMS

 [3] Alexandr Andoni: High frequency moments via max-stability. In Proc. 42nd Internat. Conf.
     on Acoustics, Speech and Signal Processing (ICASSP’17), pp. 6364–6368. IEEE Comp. Soc.,
     2017. [doi:10.1109/ICASSP.2017.7953381] 6, 8
 [4] Alexandr Andoni, Jiecao Chen, Robert Krauthgamer, Bo Qin, David P. Woodruff, and Qin
     Zhang: On sketching quadratic forms. In Proc. 7th Innovations in Theoret. Comp. Sci. Conf.
     (ITCS’16), pp. 311–319. ACM Press, 2016. [doi:10.1145/2840728.2840753, arXiv:1511.06099]
     17
 [5] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak: Streaming algorithms
     via precision sampling. In Proc. 52nd FOCS, pp. 363–372. IEEE Comp. Soc., 2011.
     [doi:10.1109/FOCS.2011.82] 6, 8, 21
 [6] Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar: The sketching
     complexity of pattern matching. In Proc. 8th Internat. Workshop on Randomization and
     Computation (RANDOM’04), pp. 261–272. Springer, 2004. [doi:10.1007/978-3-540-27821-
     4_24] 8, 18
 [7] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar: An information statistics
     approach to data stream and communication complexity. J. Comput. System Sci., 68(4):702–
     732, 2004. Preliminary version in FOCS’02. [doi:10.1016/j.jcss.2003.11.006] 5
 [8] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan: Counting
     distinct elements in a data stream. In Proc. 6th Internat. Workshop on Randomization and
     Computation (RANDOM’02), pp. 1–10. Springer, 2002. [doi:10.1007/3-540-45726-7_1] 7
 [9] Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh, and Chandan Saha: Sim-
     pler algorithm for estimating frequency moments of data streams. In Proc. 17th
     Ann. ACM–SIAM Symp. on Discrete Algorithms (SODA’06), pp. 708–713. SIAM, 2006.
     [doi:10.1145/1109557.1109634] 6
[10] Jarosław Błasiok: Optimal streaming and tracking distinct elements with high prob-
     ability. ACM Trans. Algorithms, 16(1):3:1–28, 2020. Preliminary version in SODA’18.
     [doi:10.1145/3309193, arXiv:1804.01642] 7
[11] Jarosław Błasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, and
     Lin F. Yang: Streaming symmetric norms via measure concentration. In Proc. 49th STOC,
     pp. 716–729. ACM Press, 2017. [doi:10.1145/3055399.3055424, arXiv:1511.01111] 2
[12] Mark Braverman, Ankit Garg, Denis Pankratov, and Omri Weinstein: Information lower
     bounds via self-reducibility. Theory Computing Sys., 59(2):377–396, 2016. Preliminary
     version in CSR’13. [doi:10.1007/s00224-015-9655-z, ECCC:TR12-177] 17
[13] Vladimir Braverman and Stephen R. Chestnut: Universal sketches for the frequency
     negative moments and other decreasing streaming sums. In Proc. 19th Internat. Workshop
     on Randomization and Computation (RANDOM’15), pp. 591–605. Schloss Dagstuhl–Leibniz-
     Zentrum fuer Informatik, 2015. [doi:10.4230/LIPIcs.APPROX-RANDOM.2015.591] 2

                   T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                    27
                V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

[14] Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang,
     and David P. Woodruff: BPTree: An ℓ 2 heavy hitters algorithm using constant memory. In
     Proc. 36th Symp. on Principles of Database Systems (PODS’17), pp. 361–376. ACM Press, 2017.
     [doi:10.1145/3034786.3034798, arXiv:1603.00759] 6

[15] Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, and Lin F. Yang: Streaming
     space complexity of nearly all functions of one variable on frequency vectors. In Proc.
     35th Symp. on Principles of Database Systems (PODS’16), pp. 261–276. ACM Press, 2016.
     [doi:10.1145/2902251.2902282, arXiv:1601.07473] 2, 6

[16] Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger: An
     optimal algorithm for large frequency moments using 𝑂(𝑛 1−2/𝑘 ) bits. In Proc. 18th Internat.
     Workshop on Randomization and Computation (RANDOM’14), pp. 531–544. Schloss Dagstuhl–
     Leibniz-Zentrum fuer Informatik, 2014. [doi:10.4230/LIPIcs.APPROX-RANDOM.2014.531]
     6

[17] Vladimir Braverman and Rafail Ostrovsky: Smooth histograms for sliding windows. In
     Proc. 48th FOCS, pp. 283–293. IEEE Comp. Soc., 2007. [doi:10.1109/FOCS.2007.55] 4

[18] Vladimir Braverman and Rafail Ostrovsky: Zero-one frequency laws. In Proc. 42nd STOC,
     pp. 281–290. ACM Press, 2010. [doi:10.1145/1806689.1806729] 2

[19] Vladimir Braverman, Rafail Ostrovsky, and Alan Roytman: Zero-one laws for sliding
     windows and universal sketches. In Proc. 19th Internat. Workshop on Randomization and Com-
     putation (RANDOM’15), pp. 573–590. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik,
     2015. [doi:10.4230/LIPIcs.APPROX-RANDOM.2015.573] 2

[20] Vladimir Braverman, Emanuele Viola, David P. Woodruff, and Lin F. Yang: Revisiting
     frequency moment estimation in random order streams. In Proc. 45th Internat. Colloq. on
     Automata, Languages, and Programming (ICALP’18), pp. 25:1–14. Schloss Dagstuhl–Leibniz-
     Zentrum fuer Informatik, 2018. [doi:10.4230/LIPIcs.ICALP.2018.25, arXiv:1803.02270] 8,
     21

[21] Amit Chakrabarti, Subhash Khot, and Xiaodong Sun: Near-optimal lower bounds on the
     multi-party communication complexity of set disjointness. In Proc. 18th IEEE Conf. on Comput.
     Complexity (CCC’03), pp. 107–117. IEEE Comp. Soc., 2003. [doi:10.1109/CCC.2003.1214414]
     5, 6

[22] Amit Chakrabarti and Oded Regev: An optimal lower bound on the communication
     complexity of GAP-HAMMING-DISTANCE. SIAM J. Comput., 41(5):1299–1317, 2012.
     [doi:10.1137/120861072] 17

[23] Moses Charikar, Kevin Chen, and Martin Farach-Colton: Finding frequent items in
     data streams. Theoret. Comput. Sci., 312(1):3–15, 2004. Preliminary version in ICALP’02.
     [doi:10.1016/S0304-3975(03)00400-6] 6, 22

                    T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                      28
                            U NIVERSAL S TREAMING OF S UBSET N ORMS

[24] Edith Cohen, Nick Duffield, Haim Kaplan, Carstent Lund, and Mikkel Thorup: Algo-
     rithms and estimators for summarization of unaggregated data streams. J. Comput. System
     Sci., 80(7):1214–1244, 2014. [doi:10.1016/j.jcss.2014.04.009] 2

[25] Graham Cormode and S. Muthukrishnan: Combinatorial algorithms for compressed
     sensing. In Proc. 13th Internat. Colloq. on Structural Information and Communication Complexity
     (SIROCCO’06), pp. 280–294. Springer, 2006. [doi:10.1007/11780823_22] 22

[26] Nick Duffield, Carsten Lund, and Mikkel Thorup: Priority sampling for estimation of
     arbitrary subset sums. J. ACM, 54(6):32, 2007. [doi:10.1145/1314690.1314696] 2, 3, 4, 5, 6, 8,
     20

[27] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang:
     On graph problems in a semi-streaming model. Theoret. Comput. Sci., 348(2–3):207–216,
     2005. Preliminary version in ICALP’04. [doi:10.1016/j.tcs.2005.09.013] 2

[28] Philippe Flajolet and G. Nigel Martin: Probabilistic counting algorithms for data base
     applications. J. Comput. System Sci., 31(2):182–209, 1985. [doi:10.1016/0022-0000(85)90041-8]
     7, 14

[29] Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and
     Walter Willinger: Sonata: Query-driven streaming network telemetry. In Proc.
     31st SIG Data Communication Conf. (SIGCOMM’18), pp. 357–371. ACM Press, 2018.
     [doi:10.1145/3230543.3230555, arXiv:1705.01049] 3

[30] Sariel Har-Peled and Soham Mazumdar: On coresets for 𝑘-means and 𝑘-median clustering.
     In Proc. 36th STOC, pp. 291–300. ACM Press, 2004. [doi:10.1145/1007352.1007400] 2

[31] Piotr Indyk: Stable distributions, pseudorandom generators, embeddings, and data
     stream computation. J. ACM, 53(3):307–323, 2006. Preliminary version in FOCS’00.
     [doi:10.1145/1147954.1147955] 2, 5

[32] Piotr Indyk and David P. Woodruff: Tight lower bounds for the distinct elements problem.
     In Proc. 44th FOCS, pp. 283–288. IEEE Comp. Soc., 2003. [doi:10.1109/SFCS.2003.1238202]
     5, 9, 17

[33] Piotr Indyk and David P. Woodruff: Optimal approximations of the frequency
     moments of data streams. In Proc. 37th STOC, pp. 202–208. ACM Press, 2005.
     [doi:10.1145/1060590.1060621] 2, 5

[34] T. S. Jayram and David P. Woodruff: Optimal bounds for Johnson-Lindenstrauss transforms
     and streaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26:1–17, 2013.
     Preliminary version in SODA’11. [doi:10.1145/2483699.2483706] 7

[35] Daniel M. Kane, Jelani Nelson, Ely Porat, and David P. Woodruff: Fast moment
     estimation in data streams in optimal space. In Proc. 43rd STOC, pp. 745–754. ACM Press,
     2011. [doi:10.1145/1993636.1993735, arXiv:1007.4191] 6

                     T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                       29
                 V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

[36] Daniel M. Kane, Jelani Nelson, and David P. Woodruff: On the exact space complexity
     of sketching and streaming small norms. In Proc. 21st Ann. ACM–SIAM Symp. on Discrete
     Algorithms (SODA’10), pp. 1161–1178. SIAM, 2010. [doi:10.1137/1.9781611973075.93] 6, 21
[37] Daniel M. Kane, Jelani Nelson, and David P. Woodruff: An optimal algorithm for the
     distinct elements problem. In Proc. 29th Symp. on Principles of Database Systems (PODS’10),
     pp. 41–52. ACM Press, 2010. [doi:10.1145/1807085.1807094] 6, 7
[38] Ilan Kremer, Noam Nisan, and Dana Ron: On randomized one-round communication
     complexity. Comput. Complexity, 8(1):21–49, 1999. Preliminary version in STOC’95, Errata
     in Comput. Complexity 10 (2001), 314–315. [doi:10.1007/s000370050018] 16
[39] Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman:
     One sketch to rule them all: Rethinking network flow monitoring with univmon. In
     Proc. 29th SIG Data Communication Conf. (SIGCOMM’16), pp. 101–114. ACM Press, 2016.
     [doi:10.1145/2934872.2934906] 2
[40] Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson: On data structures
     and asymmetric communication complexity. J. Comput. System Sci., 57(1):37–49, 1998.
     Preliminary version in STOC’95. [doi:10.1006/jcss.1998.1577] 18
[41] Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun,
     Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim: Language-directed
     hardware design for network performance monitoring. In Proc. 30th SIG Data Communication
     Conf. (SIGCOMM’17), pp. 85–98. ACM Press, 2017. [doi:10.1145/3098822.3098829] 3
[42] Jelani Nelson: List of open problems in sublinear algorithms: Problem 30. sublinear.info,
     accessed 2018-06-20. 2
[43] Norbert Sauer: On the density of families of sets. J. Combin. Theory–A, 13(1):145–147, 1972.
     [doi:10.1016/0097-3165(72)90019-2] 19
[44] Vyas Sekar, Nick G. Duffield, Oliver Spatscheck, Jacobus E. van der Merwe, and Hui
     Zhang: LADS: Large-scale automated DDoS detection system. In Proc. 12th USENIX
     Annual Technical Conference, General Track (UATC’06), pp. 171–184. USENIX Association,
     2006. USENIX. 3
[45] Vyas Sekar, Michael K. Reiter, and Hui Zhang: Revisiting the case for a minimalist
     approach for network flow monitoring. In Proc. 10th SIGCOMM Internet Measurement Conf.
     (IMC’10), pp. 328–341. ACM Press, 2010. [doi:10.1145/1879141.1879186] 2
[46] Saharon Shelah: A combinatorial problem; stability and order for models and theories in
     infinitary languages. Pacific J. Math, 41(1):247–261, 1972. [doi:10.2140/pjm.1972.41.247] 19
[47] Anshumali Shrivastava, Arnd Christian Konig, and Mikhail Bilenko: Time adaptive
     sketches (ada-sketches) for summarizing data streams. In Proc. 41st Internat. Conf. on Manage-
     ment of Data (SIGMOD’16), pp. 1417–1432. ACM Press, 2016. [doi:10.1145/2882903.2882946]
     3

                    T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                       30
                           U NIVERSAL S TREAMING OF S UBSET N ORMS

[48] Mario Szegedy: The DLT priority sampling is essentially optimal. In Proc. 38th STOC, pp.
     150–158. ACM Press, 2006. [doi:10.1145/1132516.1132539] 2, 4, 5, 6, 8, 20

[49] Daniel Ting: Data sketches for disaggregated subset sum and frequent item estimation. In
     Proc. 44th Internat. Conf. on Management of Data (SIGMOD’18), pp. 1129–1140. ACM Press,
     2018. [doi:10.1145/3183713.3183759, arXiv:1709.04048] 2, 3, 6

[50] David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P. Chakkappen: Join
     size estimation subject to filter conditions. Proc. VLDB Endowment, 8(12):1530–1541, 2015.
     [doi:10.14778/2824032.2824051] 3

[51] David P. Woodruff: Optimal space lower bounds for all frequency moments. In Proc. 15th
     Ann. ACM–SIAM Symp. on Discrete Algorithms (SODA’04), pp. 167–175. SIAM, 2004. ACM
     DL. 5, 7


AUTHORS

     Vladimir Braverman
     Associate professor
     Department of Computer Science
     Johns Hopkins University
     Baltimore, Maryland, USA
     vova cs jhu edu
     https://www.cs.jhu.edu/~vova/


     Robert Krauthgamer
     Professor
     Department of Computer Science and Applied Mathematics
     The Weizmann Institute of Science
     Rehovot, Israel
     robert.krauthgamer weizmann ac il
     http://www.wisdom.weizmann.ac.il/~robi/


     Lin F. Yang
     Assistant Professor
     Department of Electrical and Computer Engineering
     University of California, Los Angeles
     Los Angeles, California, USA
     linyang ee ucla edu
     http://drlinyang.net




                    T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                   31
              V LADIMIR B RAVERMAN , ROBERT K RAUTHGAMER , AND L IN F. YANG

ABOUT THE AUTHORS

    Vladimir Braverman is an associate professor in the Department of Computer Science
       at Johns Hopkins University, with a secondary appointment in the Department
       of Applied Mathematics and Statistics. He is a member of the Algorithms and
       Complexity group, the Johns Hopkins Mathematical Institute for Data Science
       (MINDS), the Institute for Data Intensive Engineering and Science (IDIES) and
       the Machine Learning group. Braverman graduated from UCLA; before that, he
       was leading a research group at HyperRoll, a startup that was acquired by Oracle
       in 2009. Braverman’s research interests include efficient sublinear algorithms
       (such as sketches and coresets) and their applications to data science. He is a
       recipient of an NSF CAREER award, a Google Faculty Award and a Cisco Faculty
       Award.


    Robert Krauthgamer (called “Robi” by his friends and colleagues) received his
       Ph. D. at the Weizmann Institute of Science in 2001 under Uriel Feige. He was
       subsequently a postdoc in Berkeley’s theory group, and then a Research Staff
       Member at the theory group in the IBM Almaden Research Center. Since 2007,
       he has been a faculty member at the Weizmann Institute of Science. Robi’s main
       research area is the design of algorithms for problems involving combinatorial
       optimization, finite metric spaces, high-dimensional geometry, data analysis, and
       related areas. His favorite sport since youth has been swimming. Once he swam
       across the Sea of Galilee in a 10 km competitive race, and was the last one to
       arrive at the finish line.


    Lin F. Yang is an assistant professor in the Electrical and Computer Engineering
       Department at the University of California, Los Angeles. His research focuses on
       developing and applying fast algorithms for machine learning and data science.
       His current research focuses on reinforcement learning theory and applications,
       learning for control, non-convex optimization, and streaming algorithms. He
       received a Ph. D. in physics and a Ph. D. in computer science from Johns Hopkins
       University and was a postdoctoral fellow at the Department of Operations
       Research and Financial Engineering, Princeton University. He was a recipient of
       the Simons Research Fellowship, the Dean Robert H. Roy Fellowship, and the
       JHU MINDS best dissertation award.




                  T HEORY OF C OMPUTING, Volume 18 (20), 2022, pp. 1–32                    32