Authors David Felber, Rafail Ostrovsky,
License CC-BY-3.0
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18
www.theoryofcomputing.org
S PECIAL ISSUE : APPROX-RANDOM 2015
A Randomized Online Quantile Summary
in O((1/ε) log(1/ε)) Words
David Felber Rafail Ostrovsky∗
Received November 18, 2015; Revised May 16, 2017; Published November 14, 2017
Abstract: A quantile summary is a data structure that approximates to ε error the order
statistics of a much larger underlying dataset.
In this paper we develop a randomized online quantile summary for the cash register
data input model and comparison data domain model that uses O((1/ε) log(1/ε)) words
of memory. This improves upon the previous best upper bound of O((1/ε) log3/2 (1/ε)) by
Agarwal et al. (PODS 2012). Further, by a lower bound of Hung and Ting (FAW 2010) no
deterministic summary for the comparison model can outperform our randomized summary in
terms of space complexity. Lastly, our summary has the nice property that O((1/ε) log(1/ε))
words suffice to ensure that the success probability is at least 1 − exp(−poly(1/ε)).
ACM Classification: F.2.2, E.1, F.1.2
AMS Classification: 68W20, 68W25, 68W27, 68P05
Key words and phrases: algorithms, data structures, data stream, approximation, approximation algo-
rithms, online algorithms, randomized, summary, quantiles, RANDOM
A conference version of this paper appeared in the Proceedings of the 19th Internat. Workshop on Randomization and
Computation (RANDOM 2015).
∗ Research supported in part by NSF grants CCF-0916574; IIS-1065276; CCF-1016540; CNS-1118126; CNS-1136174;
US-Israel BSF grant 2008411, OKAWA Foundation Research Award, IBM Faculty Research Award, Xerox Faculty Research
Award, B. John Garrick Foundation Award, Teradata Research Award, and Lockheed-Martin Corporation Research Award. This
material is also based upon work supported by the Defense Advanced Research Projects Agency through the U.S. Office of
Naval Research under Contract N00014-11-1-0392. The views expressed are those of the author and do not reflect the official
policy or position of the Department of Defense or the U.S. Government.
© 2017 David Felber and Rafail Ostrovsky
c b Licensed under a Creative Commons Attribution License (CC-BY) DOI: 10.4086/toc.2017.v013a014
DAVID F ELBER AND R AFAIL O STROVSKY
1 Introduction
A quantile summary S is a fundamental data structure that summarizes an underlying dataset X of size n,
in space much less than n. Given a query φ , S returns an element y of X such that the rank of y in X is
(probably) approximately φ n. Quantile summaries are used in sensor networks to aggregate data in an
energy-efficient manner and in database query optimizers to generate query execution plans.
1.1 The model
Quantile summaries have been developed for a variety of different models and metrics. The data input
model we consider is the standard online cash register streaming model. In this model there is a single
machine at which computation takes place. Attached to the machine is a data stream that feeds items
to the machine one at a time online. Time is measured in stream items received, so that the first item x1
arrives at t = 1, the second at time t = 2, and so forth. After each xt arrives, the machine may do any
processing that it needs to do. Timestep t does not end, and time t + 1 does not start, until this processing
is complete. The set of all stream items received through time t is denoted Xt or X(t). Xt can also be
viewed as a (multi)set of the t-item prefix of the infinite stream X of all stream items that will be fed to the
machine. In addition to receiving the data stream, the machine is also able to accept and answer queries.
After stream item xt arrives, the machine can be given queries φ , to which it responds within the same
timestep t. The next item xt+1 does not arrive until any queries for timestep t have completed. Further, for
randomized algorithms, the machine has a source of randomness from which it can sample random bits.
The data domain model we consider is the comparison model, in which stream items come from
an arbitrary totally ordered domain D of stream items. D is specifically not required to be the integers.
Stream items are indivisible tokens. The only operations permitted on stream items is to compare them
and to copy, move, and delete them. In particular, stream items do not need to have any numeric value,
and the comparison function can be an arbitrarily complex one.
The memory model we consider has two regions of memory. The first region can contain only stream
items. Each copy of a stream item uses one unit of memory. When a new stream item xt arrives from
the data stream it arrives at a set location in this region of memory, and during processing can be copied
to other locations in this region. When the next stream item xt+1 arrives, if xt was not copied, then xt+1
overwrites the only copy of xt , and therefore xt is lost and cannot be retrieved. Stream items can also be
deleted from this region, in which case they are also lost and cannot be retrieved.
The second region of memory is a scratch pad or bookkeeping area that can store numeric tokens
and perform operations with and on those tokens. In particular, the operations permitted on the tokens
are arithmetic, comparison, referencing locations of stream items in the first region of memory, and
referencing locations of numeric tokens in the second region of memory. Also, to support randomness,
the machine can in one operation sample a random bit into a fixed numeric token location in this second
memory region. Each numeric token can take on a value in {0, 1, 2, . . . , max}, where max is O(t) for a
constant number of those tokens and O(poly(1/ε)) for the rest of the tokens.
For simplicity, we assume that there are items inf D and sup D in D. If not, we can define two new
items inf D and sup D that are lesser and greater than any item in D, and simulate them in the scratch pad.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 2
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
1.2 The problem
Our problem is defined by an error parameter ε ≤ 1/2. The goal is to maintain in the machine at all times
t a quantile summary St of the dataset Xt . A quantile summary is a data structure with two operations.
The first is an update operation, that takes a summary St−1 and a stream item xt and returns an updated
summary St . This update operation is performed in the processing phase of each timestep after stream
item xt arrives. The second operation is a query operation, that takes a summary St and a value φ in
(0, 1], and returns a stream item y = y(φ ) from the first memory region (and therefore from Xt ) so that
|R(y, Xt ) − φt| ≤ εt, where R(a, Z) is the rank of item a in set Z, defined as |{z ∈ Z : z ≤ a}|.
For randomized quantile summaries, there is an additional parameter δ , and we only require that for
any time t and any query φ we have that P(|R(y, Xt ) − φt| ≤ εt) ≥ 1 − δ ; that is, y’s rank is only probably
close to φt, not definitely close. For our main result we use a fixed δ = e−poly(1/ε) . It will be easier to
deal with the rank directly, so we define ρ = φt and use that in what follows.
In either case, the measure of an algorithm for the problem is the space required to store the quantile
summary, meaning that the number of stream items should be small and also the number of numeric
tokens should be small. The algorithm we develop here uses O((1/ε) log(1/ε)) stream items and
O((1/ε) log(1/ε)) numeric tokens through time t. Calling each of these stream items or numeric tokens
a memory word, our algorithm uses a total of O((1/ε) log(1/ε)) words.
The quantile summary is maintained in memory. Any stream items in the data structure are stored
in the first memory region that contains only stream items. Any bookkeeping information, such as the
number t, is stored in the second memory region. However, to make our algorithm easy to understand we
ignore this distinction in the algorithm’s description.
Lastly, note that the problem of maintaining a small summary is only interesting in the online case,
in which items arrive one by one. Offline, a trivial summary of Xn is the set of 1/ε items at ranks
εn, 2εn, 3εn, . . . , n.
1.3 Previous and related work
The two most directly relevant pieces of prior work ([1] and [8]) are randomized online quantile summaries
for the cash register/comparison model. Aside from oblivious sampling algorithms (which require storing
Ω(1/ε 2 ) samples) the only other such work of which we are aware is an approach by Wang, Luo, Yi, and
Cormode [12] that combines the methods of [1] and [8] into a hybrid with the same space bound as [1].
The newer of the two is that of Agarwal, Cormode, Huang, Phillips, Wei, and Yi [1]. Among other
results, Agarwal et al. develop a randomized online quantile summary for the cash register/comparison
model that uses O((1/ε) log3/2 (1/ε)) words of memory. This summary has the nice property that any
two such summaries can be combined to form a summary of the combined underlying dataset without
loss of accuracy or increase in size.
The earlier such summary is that of Manku, Rajagopalan, and Lindsay [8], which uses (1/ε) log2 (1/ε)
space. At a high level, their algorithm downsamples the input stream in a non-uniform way and feeds the
downsampled stream into a deterministic summary, while periodically adjusting the downsampling rate.
We note here for those familiar with the result of Manku et al. that, while our algorithm at a high level
may appear similar, there are important differences. We defer a discussion of similarities and differences
to Section 4 after the presentation of our algorithm in Section 3.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 3
DAVID F ELBER AND R AFAIL O STROVSKY
For the comparison model, the best deterministic online summary to date is the (GK) summary of
Greenwald and Khanna [4], which uses O((1/ε) log(εn)) space. This improved upon a deterministic
(MRL) summary of Manku, Rajagopalan, and Lindsay [7] and a summary implied by Munro and
Paterson [9], which use O((1/ε) log2 (εn)) space.
A more restrictive domain model than the comparison model is the bounded universe model, in which
elements are drawn from the integers {1, . . . , u}. For this model there is a deterministic online summary
by Shrivastava, Buragohain, Agrawal, and Suri [10] that uses O((1/ε) log(u)) space.
Between submission of our paper to Theory of Computing and its acceptance, a new method was
developed by Karnin, Lang, and Liberty [6] that obtains matching upper and lower bounds for the problem
of maintaining a randomized online summary. Their upper bound, O((1/ε) log log(1/δ )), can be viewed
as extending our upper bound of O((1/ε) log(1/ε)) to arbitrary δ , which is particularly valuable when δ
is large compared to e−poly(1/ε) . Prior to [6] the best lower bound was a simple lower bound of Ω(1/ε)
which intuitively comes from the fact that no one element can satisfy more than 2εn different rank queries.
For the deterministic version of the problem there is a lower bound of Ω((1/ε) log(1/ε)) by Hung and
Ting [5].
1.4 Our contributions
In the next section we describe a simple O((1/ε) log(1/ε)) streaming summary that is online except
that it requires n to be given up front and that it is unable to process queries until it has seen a constant
fraction of the input stream. This simple summary is not new (it is mentioned in Wang et al. [12], for
example) but the discussion provides exposition for Section 3, in which we develop this summary into a
fully online summary with the same asymptotic space complexity that can answer queries at any point in
time. At that point we will have proven the following theorem, which constitutes our main result.
Theorem 1.1. There is a randomized online quantile summary algorithm that runs in deterministic
O((1/ε) log(1/ε)) space and guarantees that, for any sequence X of items, for any n, and for any rank
ρ ≤ n, the summary’s response to query ρ just after receiving xn will be an item y ∈ Xn such that
P(|R(y, Xn ) − ρ| ≤ εn) ≥ 1 − exp(−1/ε) .
Finally, we close in Section 4 by examining the similarities and differences between our summary
and previous work and discuss a design approach for similar streaming problems.
2 A simple streaming summary
Before we describe the algorithm we must first describe its two main components in a bit more detail
than was used in the introduction. The two components are Bernoulli sampling and the GK summary [4].
2.1 Bernoulli sampling
Bernoulli sampling downsamples a stream X of size n to a sample stream S by choosing to include each
next item into S with independent probability m/n. (As stated this requires knowing the size of X in
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 4
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
advance.) At the end of processing X, the expected size of S is m, and the expected rank of any sample y in
S is E(R(y, S)) = (m/n)R(y, X). In fact, for any times t ≤ n and partial streams Xt and St , where St is the
sample stream of Xt , we have E(|St |) = mt/n and E(R(y, St )) = (m/n)R(y, Xt ). To generate an estimate
for R(y, Xt ) from St we use R̂(y, Xt ) = (n/m)R(y, St ). The following theorem bounds the probability that
S is very large or that R̂(y, Xt ) is very far from R(y, Xt ). A generalization of this theorem is due to Vapnik
and Chervonenkis [11]; the proof of this special case is a simple known application of Chernoff bounds.
Theorem 2.1. Let 0 ≤ m ≤ n and 0 < ε < 1. For any time t ≥ n/64,
P(|St | > 2tm/n) < exp(−m/192) .
Further, for any time t ≥ n/64 and any item y,
P(|R̂(y, Xt ) − R(y, Xt )| > εt/8) < 2 exp(−ε 2 m/12288) .
Proof. For the first part we use a Chernoff bound of
P(|St | > (1 + δ )E(|St |)) < exp(−δ E(|St |)/3) .
Here, E(|St |) = tm/n, so δ = 1, and
P(|St | > 2tm/n) < exp(−tm/3n) < exp(−m/192)
since t ≥ n/64. For the second part,
P(|R̂(y, Xt ) − R(y, Xt )| > εt/8) = P(|R(y, St ) − E(R(y, St ))| > εtm/8n) .
The Chernoff bound is
P(|R(y, St ) − E(R(y, St ))| > δ E(R(y, St ))) < 2 exp(− min{δ , δ 2 }E(R(y, St ))/3) .
Here, δ = εt/8E(R(y, St )). If δ 2 > δ > 1 then
P < 2 exp(−εt/24) ≤ 2 exp(−εn/1536) ≤ 2 exp(−ε 2 m/12288) .
Otherwise, δ 2 ≤ δ ≤ 1, in which case
P < 2 exp(−ε 2t 2 m/192nE(R(y, St ))) ≤ 2 exp(−ε 2 m/12288) ,
finishing the proof.
This means that, given any 1 ≤ ρ ≤ t, if we choose to return the sample y ∈ St with R(y, St ) = ρm/n,
then R(y, Xt ) is likely to be close to ρ, as long as m is Ω((1/ε 2 ) log(1/ε))
2.2 GK summary
The GK summary is a deterministic summary that can answer queries to ε error over any portion of
the received stream. Let Gt be the summary after inserting the first t items Xt from stream X into G.
Greenwald and Khanna guarantee in [4] that with only O((1/ε) log(εt)) words, given any 1 ≤ ρ ≤ t, Gt
can return an element y ∈ Xt so that |R(y, Xt ) − ρ| ≤ εt/8. We call this the GK guarantee.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 5
DAVID F ELBER AND R AFAIL O STROVSKY
2.3 A simple streaming summary
We combine Bernoulli sampling with the GK summary by downsampling the input data stream X to a
sample stream S and then feeding S into a GK summary G. It looks like this.
X → S S stream of
stream GK(ε/8) query
sampling ≈ m samples
input X quantiles
Figure 1: The big picture.
The key reason this gives us a small summary is that we never need to store S; each time we sample
an item into S we immediately feed it into G. Therefore, we only use as much space as G(S(Xt )) uses. In
particular, for m = O(poly(1/ε)), we use only O((1/ε) log(1/ε)) words. To answer a query ρ for Xt , we
scale ρ by m/n, ask G(S(Xt )) for that, and return the resulting sample y.
We formalize this intuition in the following lemma, which combines the ideas in the proof of
Theorem 2.1 with the GK guarantee to yield approximation and correctness guarantees.
Lemma 2.2. Fix some time t ≥ n/64 and some rank ρ ≤ t, and consider querying G(S(Xt )) with
q = min{ρm/n, |S|}, obtaining y as the result. Say that S = S(Xt ) is good if
|S| − mt/n ≤ εmt/8n
and if none of the first ≤ mt/n samples z in S has
m
R(z, S) − R(z, Xt ) > εmt/8n .
n
If S is good then
|R(y, Xt ) − ρ| ≤ εt/2 .
Further, if
400000 ln 1/ε
m≥
ε3
then
P(S is not good) ≤ ε 3 e−1/ε /8 .
Proof. First, by the GK guarantee, G(S) returns some item y with |R(y, S) − q| ≤ εt/8. If S is good, then
ρm εmt m εmt
q− ≤ , and also R(y, S) − R(y, Xt ) ≤ .
n 8n n 8n
By the triangle inequality,
m ρm 3εmt
R(y, Xt ) − ≤ .
n n 8n
Equivalently, |R(y, Xt ) − ρ| ≤ 3εt/8.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 6
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
Now, following the proof of Theorem 2.1, we have that
P(||St | − mt/n| > εmt/8n) < 2 exp(−ε 2 m/12288)
and also for each of the first ≤ m samples z that
P(|R(z, S) − mn R(z, Xt )| > εmt/8n) < 2 exp(−ε 2 m/12288) .
By the union bound,
P(S is not good) ≤ 4m exp(−ε 2 m/12288) .
Choosing
400000 ln 1/ε
m≥
ε3
suffices to bound this quantity by ε 3 e−1/ε /8.
2.4 Caveats
There are two serious issues with this summary. The first is that it requires us to know the value of n in
advance to perform the sampling. Also, as a byproduct of the sampling, we can only obtain approximation
guarantees after we have seen at least 1/64 (or at least some constant fraction) of the items. This means
that while the algorithm is sufficient for approximating order statistics over streams stored on disk, more
is needed to get it to work for online streaming applications, in which (1) the stream size n is not known
in advance, and (2) queries can be answered approximately at any time t ≤ n and not just when t ≥ n/64.
Adapting this basic streaming summary idea to work online constitutes the next section and the
bulk of our contribution. We start with a high-level overview of our online summary algorithm. In
Section 3.1 we formally define an initial version of our algorithm whose expected size at any given time is
O((1/ε) log(1/ε)) words. In Section 3.2 we show that our algorithm guarantees that for any n and any ρ
we have that P(|R(y, Xn ) − ρ| ≤ εn) ≥ 1 − exp(−1/ε). In Section 3.3 we discuss the slight modifications
necessary to get a deterministic O((1/ε) log(1/ε)) space complexity, and also perform a time complexity
analysis.
3 An online summary
Our algorithm works in rows, which are illustrated in Figure 2 and Figure 3. Row r is a summary of
the first 2r 32m stream items. Since we do not know how many items will actually be in the stream, we
cannot start all of these rows running at the outset. Therefore, we start each row r ≥ 1 once we have
seen 1/64 of its total items. However, since we cannot save these items for every row we start, we need
to construct an approximation of this fraction of the stream, which we do by using the summary of the
previous row, and join this approximating stream with the new items that arrive while the row is live. We
then wait until the row has seen a full half of its items before we permit it to start answering queries; this
dilutes the influence of approximating the 1/64 of its input that we could not store.
Operation within a row is very much like the operation of our fixed-n streaming summary. We feed
the joint approximate prefix + new item stream through a Bernoulli sampler to get a sample stream, which
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 7
Q5 =
Dup each …
Join Sample S5
B5=X16 m+1… 1024 m GK5(ε/8) 8/ε quantiles 32mε/8
R5 & B 5 w/p = 1/1024 ≈ m samples
A5=X1 … 16 m J5 times
R5 @ end of time 16m query active
over times
Q4 = 512m+1…
Join S4 Dup each 1024m
Sample 8/ε quantiles
B4=X8 m+1 … 512 m R4 & B 4
GK4(ε/8) 16mε/8
w/p = 1/512 ≈ m samples
A4=X1 … 8 m J4 times
R4 @ end of time 8m query active
over times
Q3 = 256m+1…
Join S3 Dup each
Sample 8/ε quantiles 512m
B3=X4 m+1 … 256 m R3 & B 3
GK3(ε/8) 8mε/8
w/p = 1/256 ≈ m samples
A3=X1 … 4 m J3 times
R3 @ end of time 4m query active
over times
Q2 = 128m+1…
Join S2 Dup each
Sample 8/ε quantiles 256m
B2=X2 m+1 … 128 m R2 & B 2
GK2(ε/8) 4mε/8
w/p = 1/128 ≈ m samples
A2=X1 … 2 m J2 times
R2 @ end of time 2m query active
over times
Q1 = 64m+1…
Join S1 Dup each
Sample 128m
DAVID F ELBER AND R AFAIL O STROVSKY
B1=Xm+1 … 64 m GK1(ε/8) 8/ε quantiles 2mε/8
stream generated from Gr at time 2r m to get the replacement prefix Rr+1 .
R1 & B 1 w/p = 1/64 ≈ m samples
A1=X1 … m J1 times
query active
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18
R1 @ end of time m
over times
Q0 = 32m+1…
Dup each
8/ε quantiles 64m
GK0(ε/8) mε/8
B0=X1 … 32 m J 0 = B0 S0 = J0
times
A0=R0=∅
query active
over times
1…32m
8
the previous row), Jr is the joint stream Rr followed by Br , Sr is its sample stream, and Qr is a one-time
Ar is the prefix stream of row r, Br is its suffix stream, Rr is its prefix stream replacement (generated by
Figure 2: Each row r has its own copy Gr of the GK algorithm that approximates its input to ε/8 error.
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
1024m 2048m 4096m 8192m 16384m
Row 9
active
Row 8
active
Row 7
active
Row 6
active
Row 5
active
Row 9 live
Row 4
active
512m
Row 8 live
Row 3
active
256m
Row 7 live
Row 2
active
12m
Row 6 live
Row 1
active
64m
Row 5 live
32m
Row 4 live
16m
Row 3 live
Row 0 live and active
8m
Row 2 live
4m
Row 1 live
2m
m
Figure 3: There are at most six live rows and one active row at any time. Time is shown here in
dashed boxes at the bottom of the diagram on a logarithmic scale; for example, box 2m represents items
m+1, . . . , 2m. The dashed red line at time 32m marks the first deletion of a row.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 9
DAVID F ELBER AND R AFAIL O STROVSKY
is then fed into a GK summary (which is stored). After row r has seen half of its items, its GK summary
becomes the one used to answer quantile queries. When row r + 1 has seen 1/64 of its total items, row r
generates an approximation of those items from its GK summary and feeds them as a stream into row
r + 1.
Row 0 is slightly different in order to bootstrap the algorithm. There is no join step since there is no
previous row to join. Also, row 0 is active from the start. Lastly, we get rid of the sampling step so that
we can answer queries over timesteps 1, . . . , m/2.
After the first 32m items, row 0 is no longer needed, so we can clean up the space used by its GK
summary. Similarly, after the first 2r 32m items, row r is no longer needed. The upshot of this is that we
never need storage for more than six rows at a time. Since each GK summary uses O((1/ε) log(1/ε))
words, the six live GK summaries also only use O((1/ε) log(1/ε)) words.
Our error analysis, on the other hand, will require us to look back as many as Θ(log 1/ε) rows to
ensure our approximation guarantee. We stress that we will not need to actually store these Θ(log 1/ε)
rows for our guarantee to hold; we will only need that they did not have any bad events (as will be defined)
when they were alive.
3.1 Algorithm description
Our algorithm works in rows. Each row r has its own copy Gr of the GK algorithm that approximates its
input to ε/8 error. For each row r we define several streams: Ar is the prefix stream of row r, Br is its
suffix stream, Rr is its prefix stream replacement (generated by the previous row), Jr is the joint stream Rr
followed by Br , Sr is its sample stream, and Qr is a one-time stream generated from Gr by querying it
with ranks ρ1 , . . . , ρ8/ε , where ρq = q(ε/8)(m/32) for r ≥ 1 and ρq = qεm/8 for r = 0.
The prefix stream Ar = X(2r−1 m) for row r ≥ 1, importantly, is not directly received by row r. Instead,
at the end of timestep 2r−1 m, row r−1 generates Qr−1 and duplicates each of those 8/ε items 2r−1 εm/8
times to get the replacement prefix Rr , which is then immediately fed into row r before timestep 2r−1 m+1
begins.
Each row can be live or not and active or not. Row 0 is live in timesteps 1, . . . , 32m and row r ≥ 1
is live in timesteps 2r−1 m+1, . . . , 2r 32m. Live rows require space; once a row is no longer live we can
free up the space it used. Row 0 is active in timesteps 1, . . . , 32m and row r ≥ 1 is active in timesteps
2r 16m+1, . . . , 2r 32m. This definition means that exactly one row r(t) is active in any given timestep t.
Any queries that are asked in timestep t are answered by Gr(t) . Given query ρ, we ask Gr(t) for ρ/2r(t) 32
(if r ≥ 1) or for ρ (if r = 0) and return the result.
At each timestep t, when item xt arrives, it is fed as the next item in the suffix stream Br for each live
row r. Br joined with Rr defines the joined input stream Jr . For r ≥ 1, Jr is downsampled to the sample
stream Sr by sampling each item independently with probability 1/2r 32. For row 0, no downsampling is
performed, so S0 = J0 . Lastly, Sr is fed into Gr .
Figure 2 exhibits the operation of and the communication between the first six rows. Solid arrows
indicate continuous streams and dashed arrows indicate one-time messages. Figure 3 shows when rows
are created, change from live to active, and are deleted. Table 1 summarizes the variables used in this
section. Algorithm 1 is a pseudocode listing of the algorithm.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 10
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
Initially, allocate space for G0 . Mark row 0 as live and active.
for t = 1, 2, . . . do
foreach live row r ≥ 0 do
with probability 1/2r 32 do
Insert xt into Gr .
r−1
if t = 2 m for some r ≥ 1 then
Allocate space for Gr . Mark row r as live.
Query Gr−1 with ρ1 , . . . , ρ8/ε to get y1 , . . . , y8/ε .
for q = 1, . . . , 8/ε do
for 1, . . . , 2r−1 εm/8 do
with probability 1/2r 32 do
Insert yq into Gr .
if t = 2r 16m for some r ≥ 1 then
Unmark row r−1 as active and mark row r as active.
Unmark row r−1 as live and free space for Gr−1 .
on query ρ do
Let r = r(t) be the active row.
Query Gr for rank ρ/2r 32 (if r ≥ 1) or for rank ρ (if r = 0).
Return the result.
Algorithm 1: Procedural listing of the algorithm in Section 3.1.
3.2 Error analysis
Define Cr = x(2r 32m+1), x(2r 32m+2), . . . and Yr to be Rr followed by Br and then Cr . That is, Yr is just
the continuation of Jr for the entire length of the input stream.
Fix some time t. All of our claims will be relative to time t; that is, if we write Sr we mean Sr (t). Our
error analysis proceeds as follows. We start by proving that R(y,Yr ) is a good approximation of R(y,Yr−1 )
when certain conditions hold for Sr−1 . By induction, this means that R(y,Yr ) is a good approximation of
R(y, X =Y0 ) when the conditions hold for all of S0 , . . . , Sr−1 , and actually it is enough for the conditions
to hold for just Sr−log 1/ε , . . . , Sr−1 to get a good approximation. Having proven this claim, we then
prove that the result y = y(ρ) of a query to our summary has R(y, X) close to ρ. Lastly, we show
that m = O(poly(1/ε)) suffices to ensure that the conditions hold for Sr−log 1/ε , . . . , Sr−1 with very high
probability (1 − e−1/ε ). Table 1 summarizes the quantities used in this and the preceding section.
Lemma 3.1. Let αr be the event that |Sr | > 2m and let βr be the event that any of the first ≤ 2m samples
z in Sr has
εt
|2r 32R(z, Sr ) − R(z,Yr )| > .
8
Say that Sr is good if neither αr nor βr occur (or if r = 0). For all rows r ≥ 1 such that t ≥ tr = 2r−1 m,
and all for all items y, if Sr−1 is good then we have that
|R(y,Yr ) − R(y,Yr−1 )| ≤ 2r εm .
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 11
DAVID F ELBER AND R AFAIL O STROVSKY
Proof. At the end of time tr we have Yr (tr ) = Rr (tr ), which is each item y(ρq ) in Qr−1 duplicated εtr /8
times. If Sr−1 (tr ) is good then |R(y(ρq ),Yr−1 (tr )) − 2r−1 32ρq | ≤ εtr /2 following Lemma 2.2.
Fix q so that y(ρq ) ≤ y < y(ρq+1 ), where y(ρ0 ) and y(ρ1+8/ε ) are defined to be inf D and sup D
for completeness. Fixing q this way implies that R(y,Yr (tr )) = 2r−1 32ρq . By the above bound on
R(y(ρq ),Yr−1 (tr )) we also have that
2r−1 32ρq − εtr /2 ≤ R(y,Yr−1 (tr )) < 2r−1 32ρq+1 + εtr /2 .
Recalling that ρq = qεm/256, these bounds imply that
|R(y,Yr (tr )) − R(y,Yr−1 (tr ))| ≤ 2r εm .
For each time t after tr , the new item xt changes the rank of y in both streams Yr and Yr−1 by the same
additive offset, so
|R(y,Yr ) − R(y,Yr−1 )| = |R(y,Yr (tr )) − R(y,Yr−1 (tr ))| ≤ 2r εm ,
yielding the lemma.
By applying this lemma inductively we can bound the difference between Yr and X = Y0 as follows.
Corollary 3.2. For any r ≥ 1 such that t ≥ tr = 2r−1 m, if all of S0 (t1 ), S1 (t2 ), . . . , Sr−1 (tr ) are good, then
|R(y,Yr ) − R(y, X)| ≤ 2 · 2r εm.
To ensure that all of these Si are good would require m to grow with n, which would be bad, because
the space complexity would also need to grow with n. Happily, it is enough to require only the last
log2 1/ε sample summaries to be good, since the other items we disregard constitute only a small fraction
of the total stream.
Corollary 3.3. Let d = log2 1/ε. For any r ≥ 1 such that t ≥ tr = 2r−1 m, if all of Sr−1 (tr ), . . . , Sr−d (tr−d+1 )
are good, then |R(y,Yr ) − R(y, X)| ≤ 2r+2 εm.
Proof. By Lemma 3.1 we have |R(y,Yr ) − R(y,Yr−d )| ≤ 2r+1 εm. At time t ≥ tr−d , Yr−d and X share all
except possibly the first 2(r−d)−1 m = 2r−1 m/2d = 2r−1 εm items. Thus
|R(y,Yr ) − R(y, X)| ≤ |R(y,Yr ) − R(y,Yr−d )| + |R(y,Yr−d ) − R(y, X)| ≤ 2r+1 εm + 2r εm ,
proving the corollary.
We now prove that if the last several sample streams were good then querying our summary will give
us a good result.
Lemma 3.4. Let d = log2 (1/ε) <– and r = r(t). If all Sr (t), Sr−1 (tr ), . . . , Sr−d (tr−d+1 ) are good, then
querying our summary with rank ρ (= querying the active GK summary Gr with ρ/2r 32 if r ≥ 1, or with
ρ if r = 0) returns y = y(ρ) such that |R(y, X) − ρ| ≤ εt.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 12
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
Proof. For r ≥ 1 we have by Corollary 3.3 that |R(y,Yr ) − R(y, X)| ≤ 2r+2 εm ≤ εt/2. We apply
Lemma 2.2 once more at row r, which tells us that |R(y,Yr ) − ρ| ≤ εt/2, and combine these bounds with
the triangle inequality.
For r = 0, the GK guarantee alone proves the lemma.
Lastly, we prove that m = O(poly(1/ε)) suffices to ensure that all of Sr (t), Sr−1 (tr ), . . . , Sr−d (tr−d+1 )
are good with probability at least 1 − e−1/ε .
Lemma 3.5. Let d = log2 1/ε and r = r(t). If
400000 ln 1/ε
m≥
ε3
then all of Sr (t), Sr−1 (tr ), . . . , Sr−d (tr−d+1 ) are good with probability at least 1 − e−1/ε .
Proof. There are at most 1+log2 1/ε ≤ 4/ε of these summary streams total. Lemma 2.2 and the union
bound give us
4 ε 3 −1/ε
P(some Sr is bad) ≤ e ≤ e−1/ε ,
ε 8
which implies our claim.
3.3 Space and time complexity
A minor issue with the algorithm is that, as written in Section 3.1, we do not actually have a bound on
the worst-case space complexity of the algorithm; we only have a bound on the space needed at any
given point in time. This issue is due to the fact that there are low probability events in which |Sr | can
get arbitrarily large and the fact that over n items there are a total of Θ(log n) sample streams. The space
complexity of the algorithm is O(max |Sr |), and to bound this value with constant probability using the
Chernoff bound appears to require that max |Sr | = Ω(log log n), which is too big.
Fortunately, fixing this problem is simple. Instead of feeding every sample of Sr into the GK summary
Gr , we only feed each next sample if Gr has seen < 2m samples so far. That is, we deterministically
restrict Gr to receiving only 2m samples. Lemmas 3.1 through Lemma 3.4 condition on the goodness
of the sample streams Sr , which ensures that the Gr receive at most 2m samples each, and the claim of
Lemma 3.5 is independent of the operation of Gr . Therefore, by restricting each Gr to receive at most 2m
inputs we can ensure that the space complexity is deterministically O((1/ε) log(1/ε)) without breaking
our error guarantees.
The assumption in the streaming setting is that new items arrive over the input stream X at a high
rate, so both the worst-case per-item processing time as well as the amortized time to process n items
are important. For our per-item time complexity, the limiting factor is the duplication step that occurs
at the end of each time tr = 2r−1 m, which makes the worst-case per-item processing time as large as
Θ(n). Instead, at time tr we could generate Qr−1 and store it in O(1/ε) words, and then on each arrival
t = 2r−1 m+1, . . . , 2r m we could insert both xt and also the next item in Rr . By the time tr+1 = 2tr that we
generate Qr , all items in Rr will have been inserted into Jr . Thus the worst-case per-item time complexity
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 13
DAVID F ELBER AND R AFAIL O STROVSKY
ε The error parameter of our algorithm as a whole.
ε/8 The error parameter used in internal GK summaries.
m The size of our random sample in each row. Following Lemma 2.2 we use m≥
400000 ln(1/ε)
ε3
.
r = 0, 1, . . . The current row.
2r 32m The number of stream items summarized by row r.
2r≥1 m The time at which we create row r.
X(2r 32m) The first 2r 32m items of X, which are also the stream items summarized by row r.
A0 The prefix stream of row 0, which is empty.
Ar≥1 The prefix stream of row r, which contains 2r m items. By the time 2r m at which
we create row r these items have already expired.
Rr The replacement prefix of row r, which is generated by row r−1 to act as a proxy
for the expired items Ar .
Br The suffix stream of row r, which is everything in X(2r 32m) except for Ar .
Jr The result of joining Rr and Br . Jr is to (Rr , Br ) as X(2r 32m) is to (Ar , Br ).
Sr≥1 The result of applying Bernoulli sampling on Jr at rate 1/2r 32. This rate is chosen
so that the expected size of Sr is m.
Qr The result of querying the GK summary for row r with the 8/ε evenly-spaced
queries ε/8, 2ε/8, 3ε/8, . . . , 1. Each item in Qr is duplicated 2r mε/8 times to get
Rr+1 . The duplication number 2r mε/8 is chosen so that Rr+1 and Ar+1 have the
same size.
q = 1, 2, . . . , 8/ε The indices of evenly-spaced queries into Qr for the current value of r.
ρq The evenly-spaced rank queries into Qr .
Cr The continuation of the potentially infinite stream X beyond the items in Br ; that
is, x(2r 32m+1), x(2r 32m+2), . . .
Yr The potentially infinite stream that is the result of joining Rr , Br , and Cr . Equiva-
lently, the stream defined by replacing Ar in X with Rr .
R(z, Sr ) The rank of item z in the Bernoulli sample Sr .
R(z,Yr ) The rank of item z in the stream Yr .
r
2 32R(z, Sr ) This is R(z, Sr ) scaled up so that it is approximately the size of R(z,Yr ).
αr , βr Low-probability events that can kill our algorithm if they occur. We proved that
these are low-probability events in Lemma 2.2
y(ρq ) The item returned by Gr on a query of ρq .
Table 1: A table summarizing the variables used in the algorithm description of Section 3.1 and the
statements and proofs of Section 3.2.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 14
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
is O((1/ε)TGKmax ), where T max is the worst-case per-item time to query or insert into one of our GK
GK
summaries. Over 2r 32m items there are at most 2m insertions into any one GK summary, so the amortized
time over n items in either case is
m log(n/m)
O TGK ,
n
where TGK is the amortized per-item time to query or insert into one of our GK summaries. Algorithm 2
includes the changes of this section, with the relevant lines highlighted to show the difference from
Algorithm 1.
Initially, allocate space for G0 . Mark row 0 as live and active.
for t = 1, 2, . . . do
foreach live row r ≥ 0 do
with probability 1/2r 32 do
Insert xt into Gr if Gr has seen < 2m insertions.
if r ≥ 1 and 2r−1 m < t ≤ 2r m and Gr has seen < 2m insertions then
with probability 1/2r 32 do
Also insert item t−2r−1 m of Rr into Gr .
if t = 2r−1 m for some r ≥ 1 then
Allocate space for Gr . Mark row r as live.
Query Gr−1 with ρ1 , . . . , ρ8/ε to get Qr−1 = y1 , . . . , y8/ε .
Store Qr−1 , to implicitly define Rr .
if t = 2r 16m for some r ≥ 1 then
Unmark row r−1 as active and mark row r as active.
Unmark row r−1 as live and free space for Gr−1 .
on query ρ do
Let r = r(t) be the active row.
Query Gr for rank ρ/2r 32 (if r ≥ 1) or for rank ρ (if r = 0).
Return the result.
Algorithm 2: Procedural listing of the algorithm in Section 3.3. The changes between Section 3.1
and Section 3.3 are that Gr never has more than 2m insertions and that Rr is paired with items in Br .
The modifications of this section notwithstanding, from a practical perspective we expect that the
number of items required for our algorithm to begin to outperform a plain GK summary is likely to be
prohibitive, even if we optimize the number of live rows and the value of m for any given ε, if we are
required to maintain our provable guarantees.
4 Discussion
Our starting point is a very natural idea used in Manku et al. [8]: downsample the input stream and feed
the resulting sample stream into a deterministic summary data structure (compare our Figure 1 with figure
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 15
DAVID F ELBER AND R AFAIL O STROVSKY
1 on page 254 of [8]). At a very high level, we are simply replacing their deterministic O((1/ε) log2 (εn))
MRL summary [7] with the deterministic O((1/ε) log(εn)) GK summary [4].
However, our implementation of this idea differs conceptually from the implementation of Manku et
al. in two important ways. First, we use the GK algorithm strictly as a black box, whereas Manku et al.
peek into the internals of their MRL algorithm, using its algorithm-specific interface (N EW, C OLLAPSE,
O UTPUT) rather than the more generic interface (I NSERT, Q UERY). At an equivalent level, dealing with
the GK algorithm is already unpleasant—the space complexity analysis in [4] is quite involved, and in
fact a simpler analysis of the GK algorithm is an open problem [2]. Using the generic interface, our
implementation could just as easily replace the GK boxes in the diagram in Figure 2 with MRL boxes; or,
for the bounded universe model, with boxes running the q-digest summary of Shrivastava et al. [10].
The second way in which our algorithm differs critically from that of Manku et al. is that we operate
on streams rather than on stream items. We use this approach in our proof strategy too; the key step
in our error analysis, Lemma 3.1, is a statement about (what to us are) static objects, so we can trade
out the complexity of dealing with time-varying data structures for a simple induction. We believe that
developing streaming algorithms with analyses that hinge on analyzing streams rather than just stream
items is likely to be a useful design approach for many problems.
Acknowledgements We thank the anonymous reviewers for Theory of Computing and RANDOM
2015 for providing historical context, pointers and insights into previous and related work, and suggestions
for clarifications. Research supported in part by NSF grant 1619348, DARPA, US-Israel BSF grant
2012366, OKAWA Foundation Research Award, IBM Faculty Research Award, Xerox Faculty Research
Award, B. John Garrick Foundation Award, Teradata Research Award, and Lockheed-Martin Corporation
Research Award. The views expressed are those of the authors and do not reflect position of the
Department of Defense or the U.S. Government.
References
[1] PANKAJ K. AGARWAL , G RAHAM C ORMODE , Z ENGFENG H UANG , J EFF M. P HILLIPS , Z HEWEI
W EI , AND K E Y I: Mergeable summaries. ACM Trans. Database Syst., 38(4):26:1–26:28, 2013.
Preliminary version in PODS’12. [doi:10.1145/2500128] 3
[2] G RAHAM C ORMODE: List of Open Problems in Sublinear Algorithms: Problem 2. https:
//sublinear.info/2. 16
[3] DAVID F ELBER AND R AFAIL O STROVSKY: A randomized online quantile summary in
O(1/ε · log(1/ε)) words. In Proc. 19th Internat. Workshop on Randomization and Computa-
tion (RANDOM’15), pp. 775–785. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2015.
[doi:10.4230/LIPIcs.APPROX-RANDOM.2015.775]
[4] M ICHAEL G REENWALD AND S ANJEEV K HANNA: Space-efficient online computation of quantile
summaries. In 2001 ACM SIGMOD Internat. Conf. on Managment of Data, pp. 58–66. ACM Press,
2001. [doi:10.1145/375663.375670] 4, 5, 16
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 16
A R ANDOMIZED O NLINE Q UANTILE S UMMARY IN O((1/ε) log(1/ε)) W ORDS
[5] R EGANT Y. S. H UNG AND H INGFUNG F. T ING: An Ω(1/ε · log 1/ε) space lower bound for finding
ε-approximate quantiles in a data stream. In Proceedings of the 4th International Conference on
Frontiers in Algorithmics (FAW’10), pp. 89–100. Springer, 2010. [doi:10.1007/978-3-642-14553-
7_11] 4
[6] Z OHAR K ARNIN , K EVIN L ANG , AND E DO L IBERTY: Optimal quantile approximation in streams.
In Proc. 57th FOCS, pp. 71–78. IEEE Comp. Soc. Press, 2016. [doi:10.1109/FOCS.2016.17,
arXiv:1603.05346] 4
[7] G URMEET S INGH M ANKU , S RIDHAR R AJAGOPALAN , AND B RUCE G. L INDSAY: Approximate
medians and other quantiles in one pass and with limited memory. In 1998 ACM SIGMOD Internat.
Conf. on Managment of Data, pp. 426–435. ACM Press, 1998. [doi:10.1145/276304.276342] 4, 16
[8] G URMEET S INGH M ANKU , S RIDHAR R AJAGOPALAN , AND B RUCE G. L INDSAY: Random
sampling techniques for space efficient online computation of order statistics of large datasets.
In 1999 ACM SIGMOD Internat. Conf. on Managment of Data, pp. 251–262. ACM Press, 1999.
[doi:10.1145/304182.304204] 3, 15, 16
[9] J. I AN M UNRO AND M IKE PATERSON: Selection and sorting with limited storage. Theoret. Comput.
Sci., 12(3):315–323, 1980. Preliminary version in FOCS’78. [doi:10.1016/0304-3975(80)90061-4]
4
[10] N ISHEETH S HRIVASTAVA , C HIRANJEEB B URAGOHAIN , D IVYAKANT AGRAWAL , AND S UBHASH
S URI: Medians and beyond: New aggregation techniques for sensor networks. In Proc. 2nd
Internat. Conf. Embedded Networked Sensor Systems (SenSys’04), pp. 239–249. ACM Press, 2004.
[doi:10.1145/1031495.1031524, arXiv:cs/0408039] 4, 16
[11] V LADIMIR N. VAPNIK AND A LEXEY YA . C HERVONENKIS: On the uniform convergence of
relative frequencies of events to their probabilities. Theory of Probability & Its Applications,
16(2):264–280, 1971. [doi:10.1137/1116025] 5
[12] L U WANG , G E L UO , K E Y I , AND G RAHAM C ORMODE: Quantiles over data streams: experimental
comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449–472, 2016.
Preliminary version in SIGMOD’13. [doi:10.1007/s00778-016-0424-7] 3, 4
AUTHORS
David Felber
Software engineer
Google, Inc.
dvfelber ucla edu
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 17
DAVID F ELBER AND R AFAIL O STROVSKY
Rafail Ostrovsky
Professor
Department of Computer Science
Department of Mathematics
UCLA
Los Angeles, CA
rafail cs ucla edu
http://web.cs.ucla.edu/~rafail/
ABOUT THE AUTHORS
David Felber completed his Ph. D. in 2015 at UCLA under the supervision of Rafail Ostro-
vsky. At press time he works as a software engineer at Google.
Rafail (Rafi) Ostrovsky is Professor of Computer Science and Professor of Mathematics at
UCLA. He received his Ph. D. in computer science from MIT in 1992. He heads the
multidisciplinary Center for Information and Computation Security at the Henry Samueli
School of Engineering and Applied Science.
T HEORY OF C OMPUTING, Volume 13 (14), 2017, pp. 1–18 18