scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the ACM in 2010"


Journal ArticleDOI
TL;DR: The nested Chinese restaurant process (nCRP) as discussed by the authors is a stochastic process that assigns probability distributions to ensembles of infinitely deep, infinitely branching trees, and it can be used as a prior distribution in a Bayesian nonparametric model of document collections.
Abstract: We present the nested Chinese restaurant process (nCRP), a stochastic process that assigns probability distributions to ensembles of infinitely deep, infinitely branching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on flexible data structures.

613 citations


Journal ArticleDOI
TL;DR: The Lovasz Local Lemma algorithm as mentioned in this paper is a powerful tool to nonconstructively prove the existence of combinatorial objects meeting a prescribed collection of criteria, and it has been used in many applications.
Abstract: The Lovasz Local Lemma discovered by Erdos and Lovasz in 1975 is a powerful tool to non-constructively prove the existence of combinatorial objects meeting a prescribed collection of criteria. In 1991, Jozsef Beck was the first to demonstrate that a constructive variant can be given under certain more restrictive conditions, starting a whole line of research aimed at improving his algorithm's performance and relaxing its restrictions. In the present article, we improve upon recent findings so as to provide a method for making almost all known applications of the general Local Lemma algorithmic.

567 citations


Journal ArticleDOI
TL;DR: This issue inaugurates the Invited Articles section of JACM and is comprised of the article “Epistemic Privacy” by Alexandre Evfimievski, Ronald Fagin and David Woodruff, selected from the 27th ACM SIGMOD-SigACT-SIGART Symposium on Principles of Database Systems (PODS), held in Vancouver, Canada, June 9–11, 2008.
Abstract: This issue inaugurates the Invited Articles section of JACM. Each year, JACM invites a small number of articles to appear in the journal based on the recommendation of the Program Committees of several major conferences in Computer Science. All invited articles are reviewed according to the standard JACM refereeing process. In this issue, the Invited Articles section is comprised of the article “Epistemic Privacy” by Alexandre Evfimievski, Ronald Fagin and David Woodruff, selected from the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), held in Vancouver, Canada, June 9–11, 2008. I thank the Program Committee of PODS 2008 and the PC Chair, Maurizio Lenzerini, for their help in selecting this invited paper.

424 citations


Journal ArticleDOI
TL;DR: This work presents a complete complexity classification of the constraint satisfaction problem (CSP) for temporal constraint languages: if the constraint language is contained in one out of nine temporal constraint language, then the CSP can be solved in polynomial time; otherwise, the C SP is NP-complete.
Abstract: A temporal constraint language is a set of relations that has a first-order definition in(Q;

178 citations


Journal ArticleDOI
TL;DR: This work presents a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions and leads to simple randomized algorithms for the k-means, median and discrete problems.
Abstract: We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1+e) approximations with probability ≥ 1/2 and running times of O(2(k/e)O(1)dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and e are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications.

153 citations


Journal ArticleDOI
TL;DR: The Routing Betweenness Centrality (RBC) measure is defined that generalizes previously well known Betweenness measures such as the Shortest Path betweenness, Flow Betweenness, and Traffic Load Centrality by considering network flows created by arbitrary loop-free routing strategies.
Abstract: Betweenness-Centrality measure is often used in social and computer communication networks to estimate the potential monitoring and control capabilities a vertex may have on data flowing in the network. In this article, we define the Routing Betweenness Centrality (RBC) measure that generalizes previously well known Betweenness measures such as the Shortest Path Betweenness, Flow Betweenness, and Traffic Load Centrality by considering network flows created by arbitrary loop-free routing strategies.We present algorithms for computing RBC of all the individual vertices in the network and algorithms for computing the RBC of a given group of vertices, where the RBC of a group of vertices represents their potential to collaboratively monitor and control data flows in the network. Two types of collaborations are considered: (i) conjunctive—the group is a sequences of vertices controlling traffic where all members of the sequence process the traffic in the order defined by the sequence and (ii) disjunctive—the group is a set of vertices controlling traffic where at least one member of the set processes the traffic. The algorithms presented in this paper also take into consideration different sampling rates of network monitors, accommodate arbitrary communication patterns between the vertices (traffic matrices), and can be applied to groups consisting of vertices and/or edges.For the cases of routing strategies that depend on both the source and the target of the message, we present algorithms with time complexity of O(n2m) where n is the number of vertices in the network and m is the number of edges in the routing tree (or the routing directed acyclic graph (DAG) for the cases of multi-path routing strategies). The time complexity can be reduced by an order of n if we assume that the routing decisions depend solely on the target of the messages.Finally, we show that a preprocessing of O(n2m) time, supports computations of RBC of sequences in O(kn) time and computations of RBC of sets in O(n3n) time, where k in the number of vertices in the sequence or the set.

149 citations


Journal ArticleDOI
TL;DR: An augmented version of the PAC model designed for semi-supervised learning is described, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community and provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues.
Abstract: Supervised learning—that is, learning from labeled examples—is an area of Machine Learning that has reached substantial maturity. It has generated general-purpose and practically successful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as the PAC-learning model and the Statistical Learning theory framework. However, for many contemporary practical problems such as classifying web pages or detecting spam, there is often additional information available in the form of unlabeled data, which is often much cheaper and more plentiful than labeled data. As a consequence, there has recently been substantial interest in semi-supervised learning—using unlabeled data together with labeled data—since any useful information that reduces the amount of labeled data needed can be a significant benefit. Several techniques have been developed for doing this, along with experimental results on a variety of different learning problems. Unfortunately, the standard learning frameworks for reasoning about supervised learning do not capture the key aspects and the assumptions underlying these semi-supervised learning methods. In this article, we describe an augmented version of the PAC model designed for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues. The model can be viewed as an extension of the standard PAC model where, in addition to a concept class C, one also proposes a compatibility notion: a type of compatibility that one believes the target concept should have with the underlying distribution of data. Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. As we show, many of the assumptions underlying existing semi-supervised learning algorithms can be formulated in this framework. After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what the key quantities are that these numbers depend on. We also consider the algorithmic question of how to efficiently optimize for natural classes and compatibility notions, and provide several algorithmic results including an improved bound for Co-Training with linear separators when the distribution satisfies independence given the label.

132 citations


Journal ArticleDOI
TL;DR: In this article, a duality-based algorithm was proposed to approximate the restless bandit problem to a (2 + ϵ)-approximation, where ϵ is the Whittle index.
Abstract: The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty.In this article, we consider the Feedback MAB problem, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is also an instance of a Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the Feedback MAB problem does not admit to greedy index-based optimal policies.We develop a novel duality-based algorithmic technique that yields a surprisingly simple and intuitive (2+ϵ)-approximate greedy policy to this problem. We show that both in terms of approximation factor and computational efficiency, our policy is closely related to the Whittle index, which is widely used for its simplicity and efficiency of computation. Subsequently we define a multi-state generalization, that we term Monotone bandits, which remains subclass of the restless bandit problem. We show that our policy remains a 2-approximation in this setting, and further, our technique is robust enough to incorporate various side-constraints such as blocking plays, switching costs, and even models where determining the state of an arm is a separate operation from playing it.Our technique is also of independent interest for other restless bandit problems, and we provide an example in nonpreemptive machine replenishment. Interestingly, in this case, our policy provides a constant factor guarantee, whereas the Whittle index is provably polynomially worse.By presenting the first O(1) approximations for nontrivial instances of restless bandits as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts.

93 citations


Journal ArticleDOI
TL;DR: This is the first method that guarantees polylogarithmic update and query cost for arbitrary sequences of insertions and deletions, and improves the previous O(nϵ)-time method by Agarwal and Matoušek a decade ago.
Abstract: We present a fully dynamic randomized data structure that can answer queries about the convex hull of a set of n points in three dimensions, where insertions take O(log3n) expected amortized time, deletions take O(log6n) expected amortized time, and extreme-point queries take O(log2n) worst-case time. This is the first method that guarantees polylogarithmic update and query cost for arbitrary sequences of insertions and deletions, and improves the previous O(nϵ)-time method by Agarwal and Matousek a decade ago. As a consequence, we obtain similar results for nearest neighbor queries in two dimensions and improved results for numerous fundamental geometric problems (such as levels in three dimensions and dynamic Euclidean minimum spanning trees in the plane).

90 citations


Journal ArticleDOI
TL;DR: It is shown that problems with small uniform constant- depth circuits have algorithms that simultaneously have small space and time bounds, and makes use of known time-space tradeoff lower bounds to show that SAT requires uniform depth d TC-0 and AC-0 circuits of size n-1+c for some constant c depending on d.
Abstract: We observe that many important computational problems in NC1 share a simple self-reducibility property. We then show that, for any problem A having this self-reducibility property, A has polynomial-size TC0 circuits if and only if it has TC0 circuits of size n1+ϵ for every ϵ> 0 (counting the number of wires in a circuit as the size of the circuit). As an example of what this observation yields, consider the Boolean Formula Evaluation problem (BFE), which is complete for NC1 and has the self-reducibility property. It follows from a lower bound of Impagliazzo, Paturi, and Saks, that BFE requires depth d TC0 circuits of size n1+ϵd. If one were able to improve this lower bound to show that there is some constant ϵ> 0 (independent of the depth d) such that every TC0 circuit family recognizing BFE has size at least n1+ϵ, then it would follow that TC0 ≠ NC1. We show that proving lower bounds of the form n1+ϵ is not ruled out by the Natural Proof framework of Razborov and Rudich and hence there is currently no known barrier for separating classes such as ACC0, TC0 and NC1 via existing “natural” approaches to proving circuit lower bounds. We also show that problems with small uniform constant-depth circuits have algorithms that simultaneously have small space and time bounds. We then make use of known time-space tradeoff lower bounds to show that SAT requires uniform depth d TC0 and AC0[6] circuits of size n1+c for some constant c depending on d.

83 citations


Journal ArticleDOI
TL;DR: A theoretical framework for discovering relationships between two database instances over distinct and unknown schemata is introduced and it is shown that this definition yields “intuitive” results when applied on database instances derived from each other by basic operations.
Abstract: We introduce a theoretical framework for discovering relationships between two database instances over distinct and unknown schemata. This framework is grounded in the context of data exchange. We formalize the problem of understanding the relationship between two instances as that of obtaining a schema mapping so that a minimum repair of this mapping provides a perfect description of the target instance given the source instance. We show that this definition yields “intuitive” results when applied on database instances derived from each other by basic operations. We study the complexity of decision problems related to this optimality notion in the context of different logical languages and show that, even in very restricted cases, the problem is of high complexity.

Journal ArticleDOI
Gabriel Nivasch1
TL;DR: In this paper, the maximum length of a Davenport-schinzel sequence of order s on n distinct symbols was shown to be 2(1/t!) α(n)t + O(α( n)t-1) for s ≥ 4 even.
Abstract: We present several new results regarding λs(n), the maximum length of a Davenport--Schinzel sequence of order s on n distinct symbols. First, we prove that λs(n) ≤ n · 2(1/t!)α(n)t + O(α(n)t-1) for s ≥ 4 even, and λs(n) ≤ n · 2(1/t!)α(n)t log2 α(n) + O(α(n)t) for s≥ 3 odd, where t = ⌊(s-2)/2⌋, and α(n) denotes the inverse Ackermann function. The previous upper bounds, by Agarwal et al. [1989], had a leading coefficient of 1 instead of 1/t! in the exponent. The bounds for even s are now tight up to lower-order terms in the exponent. These new bounds result from a small improvement on the technique of Agarwal et al. More importantly, we also present a new technique for deriving upper bounds for λs(n). This new technique is very similar to the one we applied to the problem of stabbing interval chains [Alon et al. 2008]. With this new technique we: (1) re-derive the upper bound of λ3(n) ≤ 2n α(n) + O(n √α(n)) (first shown by Klazar [1999]); (2) re-derive our own new upper bounds for general s and (3) obtain improved upper bounds for the generalized Davenport--Schinzel sequences considered by Adamec et al. [1992]. Regarding lower bounds, we show that λ3(n) ≥ 2n α(n) - O(n) (the previous lower bound (Sharir and Agarwal, 1995) had a coefficient of 1/2), so the coefficient 2 is tight. We also present a simpler variant of the construction of Agarwal et al. [1989] that achieves the known lower bounds of λs(n) ≥ n · 2(1/t!) α(n)t - O(α(n)t-1) for s ≥ 4 even.

Journal ArticleDOI
TL;DR: A novel clock synchronization algorithm is presented and it is proved that the techniques are optimal also with respect to the maximum clock drift, the uncertainty in message delays, and the imposed bounds on the clock rates.
Abstract: We present a novel clock synchronization algorithm and prove tight upper and lower bounds on the worst-case clock skew that may occur between any two participants in any given distributed system. More importantly, the worst-case clock skew between neighboring nodes is (asymptotically) at most a factor of two larger than the best possible bound. While previous results solely focused on the dependency of the skew bounds on the network diameter, we prove that our techniques are optimal also with respect to the maximum clock drift, the uncertainty in message delays, and the imposed bounds on the clock rates. The presented results all hold in a general model where both the clock drifts and the message delays may vary arbitrarily within pre-specified bounds.Furthermore, our algorithm exhibits a number of other highly desirable properties. First, the algorithm ensures that the clock values remain in an affine linear envelope of real time. A better worst-case bound on the accuracy with respect to real time cannot be achieved in the absence of an external timer. Second, the algorithm minimizes the number and size of messages that need to be exchanged in a given time period. Moreover, only a small number of bits must be stored locally for each neighbor. Finally, our algorithm can easily be adapted for a variety of other prominent synchronization models.

Journal ArticleDOI
TL;DR: The weakest failure detector for the basic register object is determined and it is shown to be the same for all popular atomic objects including test- and-set, fetch-and-add, queue, consensus and compare-and -swap.
Abstract: This article determines the weakest failure detectors to implement shared atomic objects in a distributed system with crash-prone processes. We first determine the weakest failure detector for the basic register object. We then use that to determine the weakest failure detector for all popular atomic objects including test-and-set, fetch-and-add, queue, consensus and compare-and-swap, which we show is the same.

Journal ArticleDOI
TL;DR: This improves on Blum and Kannan's algorithm for the uniform distribution over a ball, in the time and sample complexity and in the generality of the input distribution.
Abstract: We give an algorithm to learn an intersection of k halfspaces in Rn whose normals span an l-dimensional subspace. For any input distribution with a logconcave density such that the bounding hyperplanes of the k halfspaces pass through its mean, the algorithm (ϵ,δ)-learns with time and sample complexity bounded by (nkl/ϵ)O(l) log 1/ϵ δ. The hypothesis found is an intersection of O(k log (1/ϵ)) halfspaces. This improves on Blum and Kannan's algorithm for the uniform distribution over a ball, in the time and sample complexity (previously doubly exponential) and in the generality of the input distribution.

Journal ArticleDOI
TL;DR: It is shown that, contrary to Kleene's method, Newton's method always terminates for arbitrary idempotent and commutative semirings, and the number of iterations required to solve a system of n equations is at most n.
Abstract: This article presents a novel generic technique for solving dataflow equations in interprocedural dataflow analysis. The technique is obtained by generalizing Newton's method for computing a zero of a differentiable function to ω-continuous semirings. Complete semilattices, the common program analysis framework, are a special class of ω-continuous semirings. We show that our generalized method always converges to the solution, and requires at most as many iterations as current methods based on Kleene's fixed-point theorem. We also show that, contrary to Kleene's method, Newton's method always terminates for arbitrary idempotent and commutative semirings. More precisely, in the latter setting the number of iterations required to solve a system of n equations is at most n.

Journal ArticleDOI
TL;DR: In this paper, the second eigenvalue of the Laplacian of graphs is upper bounded by using multi-commodity flows to deform the geometry of the graph and embed the resulting metric into Euclidean space.
Abstract: We present a new method for upper bounding the second eigenvalue of the Laplacian of graphs. Our approach uses multi-commodity flows to deform the geometry of the graph; we embed the resulting metric into Euclidean space to recover a bound on the Rayleigh quotient. Using this, we show that every n-vertex graph of genus g and maximum degree D satisfies λ2(G)=O((g+1)3D/n). This recovers the O(D/n) bound of Spielman and Teng for planar graphs, and compares to Kelner's bound of O((g+1)poly(D)/n), but our proof does not make use of conformal mappings or circle packings. We are thus able to extend this to resolve positively a conjecture of Spielman and Teng, by proving that λ2(G) = O(Dh6log h/n) whenever G is Kh-minor free. This shows, in particular, that spectral partitioning can be used to recover O(√n)-sized separators in bounded degree graphs that exclude a fixed minor. We extend this further by obtaining nearly optimal bounds on λ2 for graphs that exclude small-depth minors in the sense of Plotkin, Rao, and Smith. Consequently, we show that spectral algorithms find separators of sublinear size in a general class of geometric graphs. Moreover, while the standard “sweep” algorithm applied to the second eigenvector may fail to find good quotient cuts in graphs of unbounded degree, our approach produces a vector that works for arbitrary graphs. This yields an alternate proof of the well-known nonplanar separator theorem of Alon, Seymour, and Thomas that states that every excluded-minor family of graphs has O(√n)-node balanced separators.

Journal ArticleDOI
TL;DR: This work shows how factors such as schema information, the presence of node ids, and missing structural information affect the complexity of these main computational problems, and finds robust classes of incomplete XML descriptions that permit tractable query evaluation.
Abstract: We study models of incomplete information for XML, their computational properties, and query answering. While our approach is motivated by the study of relational incompleteness, incomplete information in XML documents may appear not only as null values but also as missing structural information. Our goal is to provide a classification of incomplete descriptions of XML documents, and separate features—or groups of features—that lead to hard computational problems from those that admit efficient algorithms. Our classification of incomplete information is based on the combination of null values with partial structural descriptions of documents. The key computational problems we consider are consistency of partial descriptions, representability of complete documents by incomplete ones, and query answering. We show how factors such as schema information, the presence of node ids, and missing structural information affect the complexity of these main computational problems, and find robust classes of incomplete XML descriptions that permit tractable query evaluation.

Journal ArticleDOI
TL;DR: New explicit constructions of deterministic randomness extractors, dispersers and related objects are presented, finding that objects that were designed to work with independent inputs sometimes perform well enough with correlated, high entropy inputs.
Abstract: We present new explicit constructions of deterministic randomness extractors, dispersers and related objects. We say that a distribution X on binary strings of length n is a δ-source if X assigns probability at most 2−δn to any string of length n. For every δ>0, we construct the following poly(n)-time computable functions:2-source disperser: D:({0, 1}n)2 → {0, 1} such that for any two independent δ-sources X1,X2 we have that the support of D(X1,X2) is {0, 1}.Bipartite Ramsey graph: Let N=2n. A corollary is that the function D is a 2-coloring of the edges of KN,N (the complete bipartite graph over two sets of N vertices) such that any induced subgraph of size Nδ by Nδ is not monochromatic.3-source extractor:E:({0, 1}n)3→ {0, 1} such that for any three independent δ-sources X1,X2,X3 we have that E(X1,X2,X3) is o(1)-close to being an unbiased random bit.No previous explicit construction was known for either of these for any δ

Journal ArticleDOI
TL;DR: In this paper, the authors introduce the first extensive axiomatic study of this setting, and explore a wide array of well-known and new personalized ranking systems, and fully classify the set of systems that satisfy all of these axioms.
Abstract: Personalized ranking systems and trust systems are an essential tool for collaboration in a multi-agent environment. In these systems, trust relations between many agents are aggregated to produce a personalized trust rating of the agents. In this article, we introduce the first extensive axiomatic study of this setting, and explore a wide array of well-known and new personalized ranking systems. We adapt several axioms (basic criteria) from the literature on global ranking systems to the context of personalized ranking systems, and fully classify the set of systems that satisfy all of these axioms. We further show that all these axioms are necessary for this result.

Journal ArticleDOI
TL;DR: A calculus of dependent types to serve as the semantic foundation for a family of languages called data description languages, designed to facilitate programming with ad hoc data, that is, data not in well-behaved relational or xml formats.
Abstract: In the spirit of Landin, we present a calculus of dependent types to serve as the semantic foundation for a family of languages called data description languages Such languages, which include pads, datascript, and packettypes, are designed to facilitate programming with ad hoc data, that is, data not in well-behaved relational or xml formats In the calculus, each type describes the physical layout and semantic properties of a data source In the semantics, we interpret types simultaneously as the in-memory representation of the data described and as parsers for the data source The parsing functions are robust, automatically detecting and recording errors in the data stream without halting parsing We show the parsers are type-correct, returning data whose type matches the simple-type interpretation of the specification We also prove the parsers are “error-correct,” accurately reporting the number of physical and semantic errors that occur in the returned data We use the calculus to describe the features of various data description languages, and we discuss how we have used the calculus to improve pads

Journal ArticleDOI
TL;DR: It is shown that entangled quantum measurements on at least Ω(n log n) coset states are necessary to get useful information for the case of graph isomorphism, matching an information theoretic upper bound.
Abstract: It has been known for some time that graph isomorphism reduces to the hidden subgroup problem (HSP). What is more, most exponential speedups in quantum computation are obtained by solving instances of the HSP. A common feature of the resulting algorithms is the use of quantum coset states, which encode the hidden subgroup. An open question has been how hard it is to use these states to solve graph isomorphism. It was recently shown by Moore et al. [2005] that only an exponentially small amount of information is available from one, or a pair of coset states. A potential source of power to exploit are entangled quantum measurements that act jointly on many states at once.We show that entangled quantum measurements on at least Ω(n log n) coset states are necessary to get useful information for the case of graph isomorphism, matching an information theoretic upper bound. This may be viewed as a negative result because in general it seems hard to implement a given highly entangled measurement. Our main theorem is very general and also rules out using joint measurements on few coset states for some other groups, such as GL(n,Fpm) and Gn where G is finite and satisfies a suitable property.

Journal ArticleDOI
TL;DR: It is shown that query evaluation be done in polynomial time, but that emptiness (or, satisfiability) is 2ExpTime-complete, and that the expressive power of this XPath dialect equals that of FO(MTC) for Boolean, unary and binary queries.
Abstract: We study FO(MTC), first-order logic with monadic transitive closure, a logical formalism in between FO and MSO on trees. We characterize the expressive power of FO(MTC) in terms of nested tree-walking automata. Using the latter, we show that FO(MTC) is strictly less expressive than MSO, solving an open problem. We also present a temporal logic on trees that is expressively complete for FO(MTC), in the form of an extension of the XML document navigation language XPath with two operators: the Kleene star for taking the transitive closure of path expressions, and a subtree relativisation operator, allowing one to restrict attention to a specific subtree while evaluating a subexpression. We show that the expressive power of this XPath dialect equals that of FO(MTC) for Boolean, unary and binary queries. We also investigate the complexity of the automata model as well as the XPath dialect. We show that query evaluation be done in polynomial time (combined complexity), but that emptiness (or, satisfiability) is 2ExpTime-complete.

Journal ArticleDOI
TL;DR: A linear expected time algorithm for finding maximum cardinality matchings in sparse random graphs is presented and improves on previous results by a logarithmic factor.
Abstract: We present a linear expected time algorithm for finding maximum cardinality matchings in sparse random graphs. This is optimal and improves on previous results by a logarithmic factor.

Journal ArticleDOI
Ronald Fagin1, Alan Nash1
TL;DR: The notion of “essential conjunctions” is introduced, and it is shown that they play a crucial role in the study of inverses, and are used to give greatly simplified proofs of some known results about inverse results.
Abstract: A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). The notion of an inverse of a schema mapping is subtle, because a schema mapping may associate many target instances with each source instance, and many source instances with each target instance. In PODS 2006, Fagin defined a notion of the inverse of a schema mapping. This notion is tailored to the types of schema mappings that commonly arise in practice (those specified by “source-to-target tuple-generating dependencies”, or s-t tgds). We resolve the key open problem of the complexity of deciding whether there is an inverse. We also explore a number of interesting questions, including: What is the structure of an inverseq When is the inverse uniqueq How many nonequivalent inverses can there beq When does an inverse have an inverseq How big must an inverse beq Surprisingly, these questions are all interrelated. We show that for schema mappings M specified by full s-t tgds (those with no existential quantifiers), if M has an inverse, then it has a polynomial-size inverse of a particularly nice form, and there is a polynomial-time algorithm for generating it. We introduce the notion of “essential conjunctions” (or “essential atoms” in the full case), and show that they play a crucial role in the study of inverses. We use them to give greatly simplified proofs of some known results about inverses. What emerges is a much deeper understanding about this fundamental and complex operator.

Journal ArticleDOI
TL;DR: This paper answers the question positively and shows that anyoubling metric embeds into low dimensional Euclidean spaces with small distortion, and gives a suite of embeddings with a smooth trade-off between distortion and dimension.
Abstract: We consider the problem of embedding a metric into low-dimensional Euclidean space The classical theorems of Bourgain, and of Johnson and Lindenstrauss say that any metric on n points embeds into an O(log n)-dimensional Euclidean space with O(log n) distortion Moreover, a simple “volume” argument shows that this bound is nearly tight: a uniform metric on n points requires nearly logarithmic number of dimensions to embed with logarithmic distortion It is natural to ask whether such a volume restriction is the only hurdle to low-dimensional embeddings In other words, do doubling metrics, that do not have large uniform submetrics, and thus no volume hurdles to low dimensional embeddings, embed in low dimensional Euclidean spaces with small distortionqIn this article, we give a positive answer to this question We show how to embed any doubling metrics into O(log log n) dimensions with O(log n) distortion This is the first embedding for doubling metrics into fewer than logarithmic number of dimensions, even allowing for logarithmic distortionThis result is one extreme point of our general trade-off between distortion and dimension: given an n-point metric (V,d) with doubling dimension dimD, and any target dimension T in the range Ω(dimD log log n) ≤ T ≤ O(log n), we show that the metric embeds into Euclidean space RT with O(log n √ dimD/T) distortion

Journal ArticleDOI
TL;DR: A new data structure for indexing protein 3-D structures based on the suffix tree, which can search for similar structures much faster than previous algorithms if the RMSD threshold is not larger than 1Å, and an efficient search algorithm is proposed.
Abstract: Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate index data structures for such 3-D structures are highly desired for research on proteins. This article proposes a new data structure for indexing protein 3-D structures. For strings, there are many efficient indexing structures such as suffix trees, but it has been considered very difficult to design such sophisticated data structures against 3-D structures like proteins. Our index structure is based on the suffix tree and is called the geometric suffix tree. By using the geometric suffix tree for a set of protein structures, we can exactly search for all of their substructures whose RMSDs (root mean square deviations) or URMSDs (unit-vector root mean square deviations) to a given query 3-D structure are not larger than a given bound. Though there are O(N2) substructures in a structure of size N, our data structure requires only O(N) space for indexing all the substructures. We propose an O(N2) construction algorithm for it, while a naive algorithm would require O(N3) time to construct it. Moreover we propose an efficient search algorithm. Experiments show that we can search for similar structures much faster than previous algorithms if the RMSD threshold is not larger than 1A. The experiments also show that the construction time of the geometric suffix tree is practically almost linear to the size of the database, when applied to a protein structure database.

Journal ArticleDOI
TL;DR: It is proved that the problem of existence of a solution of a system of set constraints with projections is in NEXPTIME, and thus that it is N EXPTIME-complete.
Abstract: Set constraints form a constraint system where variables range over the domain of sets of trees. They give a natural formalism for many problems in program analysis. Syntactically, set constraints are conjunctions of inclusions between expressions built over variables, constructors (constants and function symbols from a given signature) and a choice of set operators that defines the specific class of set constraints. In this article, we are interested in the class of set constraints with projections, which is the class with all Boolean operators (union, intersection and complement) and projections that in program analysis directly correspond to type destructors. We prove that the problem of existence of a solution of a system of set constraints with projections is in NEXPTIME, and thus that it is NEXPTIME-complete.

Journal ArticleDOI
TL;DR: The new notion of privacy is formalized using the well-known semantics for reasoning about knowledge, where logical properties correspond to sets of possible worlds (databases) that satisfy these properties, and characterization theorems for the possibilistic case and the probabilistic case are proved.
Abstract: We present a novel definition of privacy in the framework of offline (retroactive) database query auditing. Given information about the database, a description of sensitive data, and assumptions about users' prior knowledge, our goal is to determine if answering a past user's query could have led to a privacy breach. According to our definition, an audited property A is private, given the disclosure of property B, if no user can gain confidence in A by learning B, subject to prior knowledge constraints. Privacy is not violated if the disclosure of B causes a loss of confidence in A. The new notion of privacy is formalized using the well-known semantics for reasoning about knowledge, where logical properties correspond to sets of possible worlds (databases) that satisfy these properties. Database users are modeled as either possibilistic agents whose knowledge is a set of possible worlds, or as probabilistic agents whose knowledge is a probability distribution on possible worlds.We analyze the new privacy notion, show its relationship with the conventional approach, and derive criteria that allow the auditor to test privacy efficiently in some important cases. In particular, we prove characterization theorems for the possibilistic case, and study in depth the probabilistic case under the assumption that all database records are considered a-priori independent by the user, as well as under more relaxed (or absent) prior-knowledge assumptions. In the probabilistic case we show that for certain families of distributions there is no efficient algorithm to test whether an audited property A is private given the disclosure of a property B, assuming P ≠ NP. Nevertheless, for many interesting families, such as the family of product distributions, we obtain algorithms that are efficient both in theory and in practice.