scispace - formally typeset
Search or ask a question

Showing papers by "Mikkel Thorup published in 2009"


Proceedings ArticleDOI
04 Jan 2009
TL;DR: VarOptk as discussed by the authors is the most efficient online reservoir sampling scheme in terms of estimation quality, which is based on a generic sample of a certain limited size k that can later be used to estimate the total weight of arbitrary subsets.
Abstract: From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

48 citations


Journal ArticleDOI
TL;DR: This paper reviews optimization techniques that have been deployed for managing intra-domain routing in networks operated with shortest path routing protocols, and the state-of-the-art research that has been carried out in this direction.
Abstract: Throughout the last decade, extensive deployment of popular intra-domain routing protocols such as open shortest path first and intermediate system–intermediate system, has drawn an ever increasing attention to Internet traffic engineering. This paper reviews optimization techniques that have been deployed for managing intra-domain routing in networks operated with shortest path routing protocols, and the state-of-the-art research that has been carried out in this direction.

47 citations


Journal ArticleDOI
TL;DR: In the most important case of d = Theta (lg n), the first superconstant lower bound is obtained, which is the highest known for any static data-structure problem, significantly improving on previous records.
Abstract: We convert cell-probe lower bounds for polynomial space into stronger lower bounds for near-linear space. Our technique applies to any lower bound proved through the richness method. For example, it applies to partial match and to near-neighbor problems, either for randomized exact search or for deterministic approximate search (which are thought to exhibit the curse of dimensionality). These problems are motivated by searching in large databases, so near-linear space is the most relevant regime. Typically, richness has been used to imply $\Omega(d/\lg n)$ lower bounds for polynomial-space data structures, where $d$ is the number of bits of a query. This is the highest lower bound provable through the classic reduction to communication complexity. However, for space $n\lg^{O(1)}n$, we now obtain bounds of $\Omega(d/\lg d)$. This is a significant improvement for natural values of $d$, such as $\lg^{O(1)}n$. In the most important case of $d=\Theta(\lg n)$, we have the first superconstant lower bound. From a complexity-theoretic perspective, our lower bounds are the highest known for any static data structure problem, significantly improving on previous records.

46 citations


Book ChapterDOI
06 Jul 2009
TL;DR: This work shows how to solve regular expression matching in linear space and O (nm (loglogn)/(logn )3/2 + n + m ) time where m is the length of the expression and n the lengthof the string.
Abstract: Regular expression matching is a key task (and often the computational bottleneck) in a variety of widely used software tools and applications, for instance, the unix grep and sed commands, scripting languages such as awk and perl , programs for analyzing massive data streams, etc. We show how to solve this ubiquitous task in linear space and O (nm (loglogn )/(logn )3/2 + n + m ) time where m is the length of the expression and n the length of the string. This is the first improvement for the dominant O (nm /logn ) term in Myers' O (nm /logn + (n + m )logn ) bound [JACM 1992]. We also get improved bounds for external memory.

38 citations


Proceedings ArticleDOI
04 Jan 2009
TL;DR: The authors' contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5-universal.
Abstract: Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC'07, Pagh et al. presented data sets where the standard implementation of 2-universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5-universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5-universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5-universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32-bit integers, e.g., 64-bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al.

29 citations


Proceedings ArticleDOI
04 Jan 2009
TL;DR: This work presents algorithms for finding optimal strategies for discounted, infinite-horizon, Determinsitc Markov Decision Processes (DMDPs), and presents a randomized algorithm for finding Discounted All-Pairs Shortest Paths (DAPSP), improving an recent bound of O(mn) obtained by Andersson and Vorbyov [2006].
Abstract: We present two new algorithms for finding optimal strategies for discounted, infinite-horizon, Deterministic Markov Decision Processes (DMDP). The first one is an adaptation of an algorithm of Young, Tarjan and Orlin for finding minimum mean weight cycles. It runs in O(mn + n2 log n) time, where n is the number of vertices (or states) and m is the number of edges (or actions). The second one is an adaptation of a classical algorithm of Karp for finding minimum mean weight cycles. It runs in O(mn) time. The first algorithm has a slightly slower worst-case complexity, but is faster than the first algorithm in many situations. Both algorithms improve on a recent O(mn2)-time algorithm of Andersson and Vorobyov. We also present a randomized O(m1/2n2)-time algorithm for finding Discounted All-Pairs Shortest Paths (DAPSP), improving several previous algorithms.

20 citations


Patent
Mikkel Thorup1, Philip Bille1
21 Dec 2009
TL;DR: In this paper, a regular expression matching system is presented for matching a regular expressions in an input string, where the system identifies all instances where an end occurrence of a particular substring matches a positive start state of the particular substring, and enters the instances as positive substring accept states in the regular expressions matching process.
Abstract: Disclosed herein are systems, methods, and computer-readable storage media for matching a regular expression in an input string. A system configured to practice the method first identifies a number of substrings (k) in a regular expression of length (m). The system receives a stream of start states for each of the substrings generated according to a regular expression matching process and receives a stream of end occurrences generated according to a multi-string matching process. The system identifies all instances where an end occurrence of a particular substring matches a positive start state of the particular substring, and enters the instances as positive substring accept states in the regular expression matching process on the input string. In one aspect, the system is much more efficient when (k) is much less than (m). The system can match the regular expression based on a bit offset between the first and second stream.

16 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: This work develops a summarization framework for unaggregated data where summarization is a scalable and composable operator, and as such, can be tailored to meet resource constraints.
Abstract: Many data sets occur as unaggregated data sets, where multiple data points are associated with each key. In the aggregate view of the data, the weight of a key is the sum of the weights of data points associated with the key. Examples are measurements of IP packet header streams, distributed data streams produced by events registered by sensor networks, and Web page or multimedia requests to context distribution servers. We aim to combine sampling and aggregation to provide accurate and efficient summaries of the aggregate view. However, data points are scattered in time or across multiple servers and hence aggregation is subject to resource constraints on the size of summaries that can be stored or transmitted.We develop a summarization framework for unaggregated data where summarization is a scalable and composable operator, and as such, can be tailored to meet resource constraints. Our summaries support unbiased estimates of the weight of subpopulations of keys specified using arbitrary selection predicates. While we prove that under such scenarios there is no variance optimal scheme, our estimators have the desirable properties that the variance is progressively closer to the minimum possible when applied to a "more" aggregated data set. An extensive evaluation using synthetic and real data sets shows that our summarization framework outperforms all existing schemes for this fundamental problem, even for the special and well-studied case of data streams.

9 citations


Patent
Edith Cohen1, Nick Duffield1, Haim Kaplan1, Carsten Lund1, Mikkel Thorup1 
18 Dec 2009
TL;DR: In this article, the authors propose a method for producing a summary of data points in an unaggregated data stream wherein the data points are in the form of weighted keys (a, w) where a is a key and w is a weight, and the summary is a sample of k keys with adjusted weights w a.
Abstract: A method for producing a summary A of data points in an unaggregated data stream wherein the data points are in the form of weighted keys (a, w) where a is a key and w is a weight, and the summary is a sample of k keys a with adjusted weights w a . A first reservoir L includes keys having adjusted weights which are additions of weights of individual data points of included keys and a second reservoir T includes keys having adjusted weights which are each equal to a threshold value τ whose value is adjusted based upon tests of new data points arriving in the data stream. The summary combines the keys and adjusted weights of the first reservoir L with the keys and adjusted weights of the second reservoir T to form the sample representing the data stream upon which further analysis may be performed. The method proceeds by first merging new data points in the stream into the reservoir L until the reservoir contains k different keys and thereafter applying a series of tests to new arriving data points to determine what keys and weights are to be added to or removed the reservoirs L and T to provide a summary with a variance that approaches the minimum possible for aggregated data sets. The method is composable, can be applied to high speed data streams such as those found on the Internet, and can be implemented efficiently.

5 citations