scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Storing a Sparse Table with 0(1) Worst Case Access Time

TL;DR: A data structure for representing a set of n items from a universe of m items, which uses space n+o(n) and accommodates membership queries in constant time and is easy to implement.
Abstract: A data structure for representing a set of n items from a umverse of m items, which uses space n + o(n) and accommodates membership queries m constant time is described. Both the data structure and the query algorithm are easy to ~mplement.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 May 1998
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Abstract: We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

4,478 citations

Book
01 Jan 2001
TL;DR: The complexity class P is formally defined as the set of concrete decision problems that are polynomial-time solvable, and encodings are used to map abstract problems to concrete problems.
Abstract: problems To understand the class of polynomial-time solvable problems, we must first have a formal notion of what a "problem" is. We define an abstract problem Q to be a binary relation on a set I of problem instances and a set S of problem solutions. For example, an instance for SHORTEST-PATH is a triple consisting of a graph and two vertices. A solution is a sequence of vertices in the graph, with perhaps the empty sequence denoting that no path exists. The problem SHORTEST-PATH itself is the relation that associates each instance of a graph and two vertices with a shortest path in the graph that connects the two vertices. Since shortest paths are not necessarily unique, a given problem instance may have more than one solution. This formulation of an abstract problem is more general than is required for our purposes. As we saw above, the theory of NP-completeness restricts attention to decision problems: those having a yes/no solution. In this case, we can view an abstract decision problem as a function that maps the instance set I to the solution set {0, 1}. For example, a decision problem related to SHORTEST-PATH is the problem PATH that we saw earlier. If i = G, u, v, k is an instance of the decision problem PATH, then PATH(i) = 1 (yes) if a shortest path from u to v has at most k edges, and PATH(i) = 0 (no) otherwise. Many abstract problems are not decision problems, but rather optimization problems, in which some value must be minimized or maximized. As we saw above, however, it is usually a simple matter to recast an optimization problem as a decision problem that is no harder. Encodings If a computer program is to solve an abstract problem, problem instances must be represented in a way that the program understands. An encoding of a set S of abstract objects is a mapping e from S to the set of binary strings. For example, we are all familiar with encoding the natural numbers N = {0, 1, 2, 3, 4,...} as the strings {0, 1, 10, 11, 100,...}. Using this encoding, e(17) = 10001. Anyone who has looked at computer representations of keyboard characters is familiar with either the ASCII or EBCDIC codes. In the ASCII code, the encoding of A is 1000001. Even a compound object can be encoded as a binary string by combining the representations of its constituent parts. Polygons, graphs, functions, ordered pairs, programs-all can be encoded as binary strings. Thus, a computer algorithm that "solves" some abstract decision problem actually takes an encoding of a problem instance as input. We call a problem whose instance set is the set of binary strings a concrete problem. We say that an algorithm solves a concrete problem in time O(T (n)) if, when it is provided a problem instance i of length n = |i|, the algorithm can produce the solution in O(T (n)) time. A concrete problem is polynomial-time solvable, therefore, if there exists an algorithm to solve it in time O(n) for some constant k. We can now formally define the complexity class P as the set of concrete decision problems that are polynomial-time solvable. We can use encodings to map abstract problems to concrete problems. Given an abstract decision problem Q mapping an instance set I to {0, 1}, an encoding e : I → {0, 1}* can be used to induce a related concrete decision problem, which we denote by e(Q). If the solution to an abstract-problem instance i I is Q(i) {0, 1}, then the solution to the concreteproblem instance e(i) {0, 1}* is also Q(i). As a technicality, there may be some binary strings that represent no meaningful abstract-problem instance. For convenience, we shall assume that any such string is mapped arbitrarily to 0. Thus, the concrete problem produces the same solutions as the abstract problem on binary-string instances that represent the encodings of abstract-problem instances. We would like to extend the definition of polynomial-time solvability from concrete problems to abstract problems by using encodings as the bridge, but we would like the definition to be independent of any particular encoding. That is, the efficiency of solving a problem should not depend on how the problem is encoded. Unfortunately, it depends quite heavily on the encoding. For example, suppose that an integer k is to be provided as the sole input to an algorithm, and suppose that the running time of the algorithm is Θ(k). If the integer k is provided in unary-a string of k 1's-then the running time of the algorithm is O(n) on length-n inputs, which is polynomial time. If we use the more natural binary representation of the integer k, however, then the input length is n = ⌊lg k⌋ + 1. In this case, the running time of the algorithm is Θ (k) = Θ(2), which is exponential in the size of the input. Thus, depending on the encoding, the algorithm runs in either polynomial or superpolynomial time. The encoding of an abstract problem is therefore quite important to our under-standing of polynomial time. We cannot really talk about solving an abstract problem without first specifying an encoding. Nevertheless, in practice, if we rule out "expensive" encodings such as unary ones, the actual encoding of a problem makes little difference to whether the problem can be solved in polynomial time. For example, representing integers in base 3 instead of binary has no effect on whether a problem is solvable in polynomial time, since an integer represented in base 3 can be converted to an integer represented in base 2 in polynomial time. We say that a function f : {0, 1}* → {0,1}* is polynomial-time computable if there exists a polynomial-time algorithm A that, given any input x {0, 1}*, produces as output f (x). For some set I of problem instances, we say that two encodings e1 and e2 are polynomially related if there exist two polynomial-time computable functions f12 and f21 such that for any i I , we have f12(e1(i)) = e2(i) and f21(e2(i)) = e1(i). That is, the encoding e2(i) can be computed from the encoding e1(i) by a polynomial-time algorithm, and vice versa. If two encodings e1 and e2 of an abstract problem are polynomially related, whether the problem is polynomial-time solvable or not is independent of which encoding we use, as the following lemma shows. Lemma 34.1 Let Q be an abstract decision problem on an instance set I , and let e1 and e2 be polynomially related encodings on I . Then, e1(Q) P if and only if e2(Q) P. Proof We need only prove the forward direction, since the backward direction is symmetric. Suppose, therefore, that e1(Q) can be solved in time O(nk) for some constant k. Further, suppose that for any problem instance i, the encoding e1(i) can be computed from the encoding e2(i) in time O(n) for some constant c, where n = |e2(i)|. To solve problem e2(Q), on input e2(i), we first compute e1(i) and then run the algorithm for e1(Q) on e1(i). How long does this take? The conversion of encodings takes time O(n), and therefore |e1(i)| = O(n), since the output of a serial computer cannot be longer than its running time. Solving the problem on e1(i) takes time O(|e1(i)|) = O(n), which is polynomial since both c and k are constants. Thus, whether an abstract problem has its instances encoded in binary or base 3 does not affect its "complexity," that is, whether it is polynomial-time solvable or not, but if instances are encoded in unary, its complexity may change. In order to be able to converse in an encoding-independent fashion, we shall generally assume that problem instances are encoded in any reasonable, concise fashion, unless we specifically say otherwise. To be precise, we shall assume that the encoding of an integer is polynomially related to its binary representation, and that the encoding of a finite set is polynomially related to its encoding as a list of its elements, enclosed in braces and separated by commas. (ASCII is one such encoding scheme.) With such a "standard" encoding in hand, we can derive reasonable encodings of other mathematical objects, such as tuples, graphs, and formulas. To denote the standard encoding of an object, we shall enclose the object in angle braces. Thus, G denotes the standard encoding of a graph G. As long as we implicitly use an encoding that is polynomially related to this standard encoding, we can talk directly about abstract problems without reference to any particular encoding, knowing that the choice of encoding has no effect on whether the abstract problem is polynomial-time solvable. Henceforth, we shall generally assume that all problem instances are binary strings encoded using the standard encoding, unless we explicitly specify the contrary. We shall also typically neglect the distinction between abstract and concrete problems. The reader should watch out for problems that arise in practice, however, in which a standard encoding is not obvious and the encoding does make a difference. A formal-language framework One of the convenient aspects of focusing on decision problems is that they make it easy to use the machinery of formal-language theory. It is worthwhile at this point to review some definitions from that theory. An alphabet Σ is a finite set of symbols. A language L over Σ is any set of strings made up of symbols from Σ. For example, if Σ = {0, 1}, the set L = {10, 11, 101, 111, 1011, 1101, 10001,...} is the language of binary representations of prime numbers. We denote the empty string by ε, and the empty language by Ø. The language of all strings over Σ is denoted Σ*. For example, if Σ = {0, 1}, then Σ* = {ε, 0, 1, 00, 01, 10, 11, 000,...} is the set of all binary strings. Every language L over Σ is a subset of Σ*. There are a variety of operations on languages. Set-theoretic operations, such as union and intersection, follow directly from the set-theoretic definitions. We define the complement of L by . The concatenation of two languages L1 and L2 is the language L = {x1x2 : x1 L1 and x2 L2}. The closure or Kleene star of a language L is the language L*= {ε} L L L ···, where Lk is the language obtained by

2,817 citations


Cites background from "Storing a Sparse Table with 0(1) Wo..."

  • ...Fredman, Komlós, and Szemerédi [96] developed the perfec t hashing scheme for static sets presented in Section 11....

    [...]

Book
05 Aug 2002
TL;DR: Digraphs is an essential, comprehensive reference for undergraduate and graduate students, and researchers in mathematics, operations research and computer science, and it will also prove invaluable to specialists in related areas, such as meteorology, physics and computational biology.
Abstract: The theory of directed graphs has developed enormously over recent decades, yet this book (first published in 2000) remains the only book to cover more than a small fraction of the results. New research in the field has made a second edition a necessity. Substantially revised, reorganised and updated, the book now comprises eighteen chapters, carefully arranged in a straightforward and logical manner, with many new results and open problems. As well as covering the theoretical aspects of the subject, with detailed proofs of many important results, the authors present a number of algorithms, and whole chapters are devoted to topics such as branchings, feedback arc and vertex sets, connectivity augmentations, sparse subdigraphs with prescribed connectivity, and also packing, covering and decompositions of digraphs. Throughout the book, there is a strong focus on applications which include quantum mechanics, bioinformatics, embedded computing, and the travelling salesman problem. Detailed indices and topic-oriented chapters ease navigation, and more than 650 exercises, 170 figures and 150 open problems are included to help immerse the reader in all aspects of the subject. Digraphs is an essential, comprehensive reference for undergraduate and graduate students, and researchers in mathematics, operations research and computer science. It will also prove invaluable to specialists in related areas, such as meteorology, physics and computational biology.

1,938 citations


Cites background from "Storing a Sparse Table with 0(1) Wo..."

  • ...Schmidt and Siegel [653] following Fredman, Komlós and Szemerédi [277] gave an explicit construction of a k-perfect family from {1, 2, ....

    [...]

Proceedings ArticleDOI
30 Oct 2006
TL;DR: In this paper, the authors proposed a searchable symmetric encryption (SSE) scheme for the multi-user setting, where queries to the server can be chosen adaptively during the execution of the search.
Abstract: Searchable symmetric encryption (SSE) allows a party to outsource the storage of its data to another party (a server) in a private manner, while maintaining the ability to selectively search over it. This problem has been the focus of active research in recent years. In this paper we show two solutions to SSE that simultaneously enjoy the following properties: Both solutions are more efficient than all previous constant-round schemes. In particular, the work performed by the server per returned document is constant as opposed to linear in the size of the data. Both solutions enjoy stronger security guarantees than previous constant-round schemes. In fact, we point out subtle but serious problems with previous notions of security for SSE, and show how to design constructions which avoid these pitfalls. Further, our second solution also achieves what we call adaptive SSE security, where queries to the server can be chosen adaptively (by the adversary) during the execution of the search; this notion is both important in practice and has not been previously considered.Surprisingly, despite being more secure and more efficient, our SSE schemes are remarkably simple. We consider the simplicity of both solutions as an important step towards the deployment of SSE technologies.As an additional contribution, we also consider multi-user SSE. All prior work on SSE studied the setting where only the owner of the data is capable of submitting search queries. We consider the natural extension where an arbitrary group of parties other than the owner can submit search queries. We formally define SSE in the multi-user setting, and present an efficient construction that achieves better performance than simply using access control mechanisms.

1,673 citations

Proceedings ArticleDOI
01 Jun 1997
TL;DR: This work develops the notion of mining rules that identify correlations (generalizing associations), and proposes measuring significance of associations via the chi-squared test for correlation from classical statistics, enabling the mining problem to reduce to the search for a border between correlated and uncorrelated itemsets in the lattice.
Abstract: One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing item A often also purchases item B.” Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of mining rules that identify correlations (generalizing associations), and we consider both the absence and presence of items as a basis for generating rules. We propose measuring significance of associations via the chi-squared test for correlation from classical statistics. This leads to a measure that is upward closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between correlated and uncorrelated itemsets in the lattice. We develop pruning strategies and devise an efficient algorithm for the resulting problem. We demonstrate its effectiveness by testing it on census data and finding term dependence in a corpus of text documents, as well as on synthetic data.

1,528 citations


Cites methods from "Storing a Sparse Table with 0(1) Wo..."

  • ...We propose an implementation based on perfect hash tables (see [10, 7] for a description of the perfect hash function we used)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: It is shown that, in a rather general model including al1 the commonly-used schemes, $\lceil $ lg(n+l) $\rceil$ probes to the table are needed in the worst case, provided the key space is sufficiently large.
Abstract: We examine optimality questions in the following information retrieval problem: Given a set S of n keys, store them so that queries of the form "Is x $\in$ S?" can be answered quickly. It is shown that, in a rather general model including al1 the commonly-used schemes, $\lceil$ lg(n+l) $\rceil$ probes to the table are needed in the worst case, provided the key space is sufficiently large. The effects of smaller key space and arbitrary encoding are also explored.

405 citations

Journal ArticleDOI
TL;DR: This work proposes a good worst-case method for storing a static table of n entries, each an integer between 0 and N - 1, and analysis shows why a simpler algorithm used for compressing LR parsing tables works so well.
Abstract: The problem of storing and searching large sparse tables is ubiquitous in computer science. The standard technique for storing such tables is hashing, but hashing has poor worst-case performance. We propose a good worst-case method for storing a static table of n entries, each an integer between 0 and N - 1. The method requires O(n) words of storage and allows O(lognN) access time. Although our method is a little complicated to use in practice, our analysis shows why a simpler algorithm used for compressing LR parsing tables works so well.

295 citations


"Storing a Sparse Table with 0(1) Wo..." refers background or methods in this paper

  • ...With Yao's method, the set S is stored in T by placing k in T[0], to invoke the coloring Ck, and placing x E S in T[j] where j is the color of x under Ck....

    [...]

  • ...The value k in T[O] induces a coloring of U with n colors, namely x ---, (kx mod p)mod n. Yao's two-probe method is likewise based on an indexed family x = {Ck}, I × I -- m, of n-colorings having the property that for each S C_ U, I S I = n, there exists a Ck in X that is one-to-one when restricted to S. Yao refers to such a family X as a separating system....

    [...]

  • ...To extend Yao's method when m = exp(o(n)), we resign ourselves to the fact that collisions are inevitable under the coloring induced by T[0]....

    [...]

  • ...Yao and Tarjan [3] show that O(n) storage and worst case query time O(log m/log n) can generally be attained....

    [...]

  • ...The design of this auxiliary data structure is a slight modification of a similar construction due to Tarjan and Yao [3]....

    [...]

Journal ArticleDOI
TL;DR: A refinement of hashing which allows retrieval of an item in a static table with a single probe is considered, and a rough comparison with ordinary hashing is given which shows that this method can be used conveniently in several practical applications.
Abstract: A refinement of hashing which allows retrieval of an item in a static table with a single probe is considered. Given a set I of identifiers, two methods are presented for building, in a mechanical way, perfect hashing functions, i.e. functions transforming the elements of I into unique addresses. The first method, the “quotient reduction” method, is shown to be complete in the sense that for every set I the smallest table in which the elements of I can be stored and from which they can be retrieved by using a perfect hashing function constructed by this method can be found. However, for nonuniformly distributed sets, this method can give rather sparse tables. The second method, the “remainder reduction” method, is not complete in the above sense, but it seems to give minimal (or almost minimal) tables for every kind of set. The two techniques are applicable directly to small sets. Some methods to extend these results to larger sets are also presented. A rough comparison with ordinary hashing is given which shows that this method can be used conveniently in several practical applications.

155 citations

Journal ArticleDOI
TL;DR: A method is presented for building minimal perfect hash functions, i.e., functions which allow single probe retrieval from minimally sized tables of identifier sets, and a proof of existence for minimalperfect hash functions of a special type (reciprocal type) is given.
Abstract: A method is presented for building minimal perfect hash functions, i.e., functions which allow single probe retrieval from minimally sized tables of identifier sets. A proof of existence for minimal perfect hash functions of a special type (reciprocal type) is given. Two algorithms for determining hash functions of reciprocal type are presented and their practical limitations are discussed. Further, some application results are given and compared with those of earlier approaches for perfect hashing.

95 citations


"Storing a Sparse Table with 0(1) Wo..." refers background in this paper

  • ...This question has been considered in various papers, for example, [1]-[4]....

    [...]