Showing papers by "Srikanta Bedathur published in 2013"
••
08 Apr 2013TL;DR: A scalable and highly efficient index structure for the reachability problem over graphs that imposes an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized.
Abstract: In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.
108 citations
••
23 Jun 2013TL;DR: This paper presents a detailed account of integrating the recently proposed highly compact reachability index called FERRARI into the RDF-3X engine to support property path evaluation and develops a set of queries over real-world RDF data that can serve as benchmark set for evaluating the efficiency of property path queries.
Abstract: As Semantic Web efforts continue to gather steam, the RDF engines are faced with graphs with millions of nodes and billions of edges. While much recent work in addressing the resulting scalability issues in processing queries over these datasets have mainly considered SPARQL 1.0, the next-generation query language recommendations have proposed the addition of regular expression restricted navigation queries into SPARQL. We address the problem of supporting efficient processing of property paths into RDF-3X -- a high-performance RDF engine.In this paper, we restrict our attention to a restricted definition of property paths that is not only tractable but also most commonly used -- instead of enumerating all paths that satisfy the given query, we focus on regular expression based reachability queries. Based on this, we make the following three major technical contributions: first, we present a detailed account of integrating the recently proposed highly compact reachability index called FERRARI into the RDF-3X engine to support property path evaluation; second, we show how property path queries can be efficiently answered using multiple instances of this index -- one instance for each distinct label in the graph; and finally, we develop a set of queries over real-world RDF data that can serve as benchmark set for evaluating the efficiency of property path queries. Our experimental results over Yago2, a large RDF-based knowledge base, show that our proposed approach is highly scalable and flexible.
47 citations
••
18 Mar 2013TL;DR: This work study how n-gram statistics can be computed efficiently harnessing MapReduce for distributed data processing, and describes different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes.
Abstract: Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.
31 citations
•
01 Jan 2013TL;DR: A novel method is developed to determine search results consisting of documents that are relevant to thequery and were published at diverse times of interest to the query.
Abstract: We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.
26 citations
••
19 Dec 2013TL;DR: This short position paper outlines two different scenarios that result in slightly different formulations of the problem and pursues the idea of using entity-centric summarization as a way of closing the gap in the interpretability of these results.
Abstract: With the availability of large entity-relationship graphs, finding the best relationship between entities is a problem that has attracted a lot of attention. Given two or more entities, the goal of most algorithms is to produce a graph structure of varying complexity (i.e., a simple path, a minimal weighted tree, or a dense subgraph etc.) as a way of characterizing the relationship between given entities. However, no attention is paid to the interpretability of these results -- i.e., the ability of humans to read these and comprehend the context in which these relationships exist. A key obstacle in this direction is the lack of necessary linguistic context and natural textual result formulations.We pursue the idea of using entity-centric summarization as a way of closing this gap. We aim to turn the resulting graph structures into one or more coherent textual snippets (or summaries) that can be easily read and interpreted. In this short position paper, we first outline two different scenarios that result in slightly different formulations of the problem. Based on preliminary experimental results, we discuss the challenges that are inherent in this setting.
10 citations
••
27 Oct 2013TL;DR: SkIt index structure is developed, which supports a wide range of label constraints on paths, and returns an accurate estimation of the shortest path that satisfies the constraints.
Abstract: Shortest path querying is a fundamental graph problem which is computationally quite challenging when operating over massive scale graphs Recent results have addressed the problem of computing either exact or good approximate shortest path distances efficiently Some of these techniques also return the path corresponding to the estimated shortest path distance fastHowever, none of these techniques work very well when we have additional constraints on the labels associated with edges that constitute the path In this paper, we develop SkIt index structure, which supports a wide range of label constraints on paths, and returns an accurate estimation of the shortest path that satisfies the constraints We conduct experiments over graphs such as social networks, and knowledge graphs that contain millions of nodes/edges, and show that SkIt index is fast, accurate in the estimated distance and has a high recall for paths that satisfy the constraints
7 citations
•
01 Jan 2013TL;DR: D-Hive is put forward, a system facilitating analytics over RDF-style (SPO) triples augmented with text and (validity / transaction) time capable of addressing the functionality and scalability requirements which current solutions cannot meet.
Abstract: Although the problem of integrating IR and DB solutions is considered “old”, the increasing importance of big data analytics and its formidable demands for both enriched functionality and scalable performance creates the need to revisit the problem itself and to see possible solutions from a new perspective. Our goal is to develop a system that will make large corpora aware of entities and relationships (ER), addressing the challenges in searching and analyzing ER patterns in web data and social media. We put forward D-Hive, a system facilitating analytics over RDF-style (SPO) triples augmented with text and (validity / transaction) time capable of addressing the functionality and scalability requirements which current solutions cannot meet. We consider various alternatives for the data modeling, storage, indexing, and query processing engines of D-Hive paying attention to the challenges that must be met, which include i) scalable joint indexing of SPO-text-time tuples (quads, quints, octs, etc.), ii) efficient processing of complex queries that involve RDF star and path joins, filtering and grouping on text phrases, band joins over time, and more, as well as iii) optimizing the execution plans for such analytics.
1 citations
•
23 Oct 2013
TL;DR: This paper presents an intuitive and efficiently computable vertex centrality measure that captures the importance of a node with respect to the explanation of the relationship between the pair of query sets.
Abstract: Given two sets of entities - potentially the results of two queries on a knowledge-graph like YAGO or DBpedia- characterizing the relationship between these sets in the form of important people, events and organizations is an analytics task useful in many domains. In this paper, we present an intuitive and efficiently computable vertex centrality measure that captures the importance of a node with respect to the explanation of the relationship between the pair of query sets. Using a weighted link graph of entities contained in the English Wikipedia, we demonstrate the usefulness of the proposed measure.
1 citations
•
TL;DR: The results of a preliminary study that is presented show that the current search engines are sensitive in their rankings to the query formulation, and thus highlights the need for developing more robust ranking methods.
Abstract: Voice search is becoming a popular mode for interacting with search engines. As a result, research has gone into building better voice transcription engines, interfaces, and search engines that better handle inherent verbosity of queries. However, when one considers its use by non- native speakers of English, another aspect that becomes important is the formulation of the query by users. In this paper, we present the results of a preliminary study that we conducted with non-native English speakers who formulate queries for given retrieval tasks. Our results show that the current search engines are sensitive in their rankings to the query formulation, and thus highlights the need for developing more robust ranking methods.