Showing papers by "Srikanta Bedathur published in 2013"

PDF

Open Access

Proceedings Article•DOI•

FERRARI: Flexible and efficient reachability range assignment for graph indexing

[...]

Stephan Seufert¹, Avishek Anand¹, Srikanta Bedathur², Gerhard Weikum¹•Institutions (2)

Max Planck Society¹, Indraprastha Institute of Information Technology²

08 Apr 2013

TL;DR: A scalable and highly efficient index structure for the reachability problem over graphs that imposes an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized.

...read moreread less

Abstract: In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.

...read moreread less

108 citations

Proceedings Article•DOI•

Sparqling kleene: fast property paths in RDF-3X

[...]

Andrey Gubichev¹, Srikanta Bedathur², Stephan Seufert³•Institutions (3)

Technische Universität München¹, Indian Institutes of Information Technology², Max Planck Society³

23 Jun 2013

TL;DR: This paper presents a detailed account of integrating the recently proposed highly compact reachability index called FERRARI into the RDF-3X engine to support property path evaluation and develops a set of queries over real-world RDF data that can serve as benchmark set for evaluating the efficiency of property path queries.

...read moreread less

Abstract: As Semantic Web efforts continue to gather steam, the RDF engines are faced with graphs with millions of nodes and billions of edges. While much recent work in addressing the resulting scalability issues in processing queries over these datasets have mainly considered SPARQL 1.0, the next-generation query language recommendations have proposed the addition of regular expression restricted navigation queries into SPARQL. We address the problem of supporting efficient processing of property paths into RDF-3X -- a high-performance RDF engine.In this paper, we restrict our attention to a restricted definition of property paths that is not only tractable but also most commonly used -- instead of enumerating all paths that satisfy the given query, we focus on regular expression based reachability queries. Based on this, we make the following three major technical contributions: first, we present a detailed account of integrating the recently proposed highly compact reachability index called FERRARI into the RDF-3X engine to support property path evaluation; second, we show how property path queries can be efficiently answered using multiple instances of this index -- one instance for each distinct label in the graph; and finally, we develop a set of queries over real-world RDF data that can serve as benchmark set for evaluating the efficiency of property path queries. Our experimental results over Yago2, a large RDF-based knowledge base, show that our proposed approach is highly scalable and flexible.

...read moreread less

47 citations

Proceedings Article•DOI•

Computing n-gram statistics in MapReduce

[...]

Klaus Berberich¹, Srikanta Bedathur²•Institutions (2)

Max Planck Society¹, Indraprastha Institute of Information Technology²

18 Mar 2013

TL;DR: This work study how n-gram statistics can be computed efficiently harnessing MapReduce for distributed data processing, and describes different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes.

...read moreread less

Abstract: Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

...read moreread less

31 citations

Proceedings Article•

Temporal Diversification of Search Results

[...]

Klaus Berberich¹, Srikanta Bedathur¹•Institutions (1)

Max Planck Society¹

01 Jan 2013

TL;DR: A novel method is developed to determine search results consisting of documents that are relevant to thequery and were published at diverse times of interest to the query.

...read moreread less

Abstract: We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.

...read moreread less

26 citations

Proceedings Article•DOI•

Generating text summaries of graph snippets

[...]

Shruti Chhabra¹, Srikanta Bedathur¹•Institutions (1)

Indraprastha Institute of Information Technology¹

19 Dec 2013

TL;DR: This short position paper outlines two different scenarios that result in slightly different formulations of the problem and pursues the idea of using entity-centric summarization as a way of closing the gap in the interpretability of these results.

...read moreread less

Abstract: With the availability of large entity-relationship graphs, finding the best relationship between entities is a problem that has attracted a lot of attention. Given two or more entities, the goal of most algorithms is to produce a graph structure of varying complexity (i.e., a simple path, a minimal weighted tree, or a dense subgraph etc.) as a way of characterizing the relationship between given entities. However, no attention is paid to the interpretability of these results -- i.e., the ability of humans to read these and comprehend the context in which these relationships exist. A key obstacle in this direction is the lack of necessary linguistic context and natural textual result formulations.We pursue the idea of using entity-centric summarization as a way of closing this gap. We aim to turn the resulting graph structures into one or more coherent textual snippets (or summaries) that can be easily read and interpreted. In this short position paper, we first outline two different scenarios that result in slightly different formulations of the problem. Based on preliminary experimental results, we discuss the challenges that are inherent in this setting.

...read moreread less

10 citations

Proceedings Article•DOI•

Label constrained shortest path estimation

[...]

Ankita Likhyani¹, Srikanta Bedathur¹•Institutions (1)

Indraprastha Institute of Information Technology¹

27 Oct 2013

TL;DR: SkIt index structure is developed, which supports a wide range of label constraints on paths, and returns an accurate estimation of the shortest path that satisfies the constraints.

...read moreread less

Abstract: Shortest path querying is a fundamental graph problem which is computationally quite challenging when operating over massive scale graphs Recent results have addressed the problem of computing either exact or good approximate shortest path distances efficiently Some of these techniques also return the path corresponding to the estimated shortest path distance fastHowever, none of these techniques work very well when we have additional constraints on the labels associated with edges that constitute the path In this paper, we develop SkIt index structure, which supports a wide range of label constraints on paths, and returns an accurate estimation of the shortest path that satisfies the constraints We conduct experiments over graphs such as social networks, and knowledge graphs that contain millions of nodes/edges, and show that SkIt index is fast, accurate in the estimated distance and has a high recall for paths that satisfy the constraints

...read moreread less

7 citations

Proceedings Article•

D-Hive: Data Bees Pollinating RDF, Text, and Time

[...]

Srikanta Bedathur¹, Klaus Berberich¹, I. Patlakas¹, Peter Triantafillou², Gerhard Weikum¹ - Show less +1 more•Institutions (2)

Max Planck Society¹, University of Patras²

01 Jan 2013

TL;DR: D-Hive is put forward, a system facilitating analytics over RDF-style (SPO) triples augmented with text and (validity / transaction) time capable of addressing the functionality and scalability requirements which current solutions cannot meet.

...read moreread less

Abstract: Although the problem of integrating IR and DB solutions is considered “old”, the increasing importance of big data analytics and its formidable demands for both enriched functionality and scalable performance creates the need to revisit the problem itself and to see possible solutions from a new perspective. Our goal is to develop a system that will make large corpora aware of entities and relationships (ER), addressing the challenges in searching and analyzing ER patterns in web data and social media. We put forward D-Hive, a system facilitating analytics over RDF-style (SPO) triples augmented with text and (validity / transaction) time capable of addressing the functionality and scalability requirements which current solutions cannot meet. We consider various alternatives for the data modeling, storage, indexing, and query processing engines of D-Hive paying attention to the challenges that must be met, which include i) scalable joint indexing of SPO-text-time tuples (quads, quints, octs, etc.), ii) efficient processing of complex queries that involve RDF star and path joins, filtering and grouping on text phrases, band joins over time, and more, as well as iii) optimizing the execution plans for such analytics.

...read moreread less

1 citations

Proceedings Article•

Efficient computation of relationship-centrality in large entity-relationship graphs

[...]

Stephan Seufert¹, Srikanta Bedathur², Johannes Hoffart¹, Andrey Gubichev³, Klaus Berberich¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, Indraprastha Institute of Information Technology², Technische Universität München³

23 Oct 2013

TL;DR: This paper presents an intuitive and efficiently computable vertex centrality measure that captures the importance of a node with respect to the explanation of the relationship between the pair of query sets.

...read moreread less

Abstract: Given two sets of entities - potentially the results of two queries on a knowledge-graph like YAGO or DBpedia- characterizing the relationship between these sets in the form of important people, events and organizations is an analytics task useful in many domains. In this paper, we present an intuitive and efficiently computable vertex centrality measure that captures the importance of a node with respect to the explanation of the relationship between the pair of query sets. Using a weighted link graph of entities contained in the English Wikipedia, we demonstrate the usefulness of the proposed measure.

...read moreread less

1 citations

Posted Content•

Mind Your Language: Effects of Spoken Query Formulation on Retrieval Effectiveness

[...]

Apoorv Narang, Srikanta Bedathur

14 Dec 2013-arXiv: Information Retrieval

TL;DR: The results of a preliminary study that is presented show that the current search engines are sensitive in their rankings to the query formulation, and thus highlights the need for developing more robust ranking methods.

...read moreread less

Abstract: Voice search is becoming a popular mode for interacting with search engines. As a result, research has gone into building better voice transcription engines, interfaces, and search engines that better handle inherent verbosity of queries. However, when one considers its use by non- native speakers of English, another aspect that becomes important is the formulation of the query by users. In this paper, we present the results of a preliminary study that we conducted with non-native English speakers who formulate queries for given retrieval tasks. Our results show that the current search engines are sensitive in their rankings to the query formulation, and thus highlights the need for developing more robust ranking methods.

...read moreread less