scispace - formally typeset
Search or ask a question

Showing papers by "Srikanta Bedathur published in 2017"


Proceedings ArticleDOI
23 Apr 2017
TL;DR: This paper develops a joint model called LoCaTe, consisting of a user mobility model estimated using kernel density estimates; a model of the semantics of the location using topic models; and a models of time-gap between check-ins using exponential distribution that significantly outperforms state-of-the-art models for the same task.
Abstract: Location-based social networks (LBSNs) such as Foursquare offer a platform for users to share and be aware of each other’s physical movements. As a result of such a sharing of check-in information with each other, users can be influenced to visit at the locations visited by their friends. Quantifying such influences in these LBSNs is useful in various settings such as location promotion, personalized recommendations, mobility pattern prediction etc. In this paper, we focus on the problem of location promotion and develop a model to quantify the influence specific to a location between a pair of users. Specifically, we develop a joint model called LoCaTe, consisting of (i) user mobility model estimated using kernel density estimates; (ii) a model of the semantics of the location using topic models; and (iii) a model of time-gap between checkins using exponential distribution. We validate our model on a long-term crawl of Foursquare data collected between Jan 2015 – Feb 2016, as well as on publicly available LBSN datasets. Our experiments demonstrate that LoCaTe significantly outperforms state-of-the-art models for the same task.

21 citations


Proceedings ArticleDOI
06 Nov 2017
TL;DR: This work proposes a framework based on provenance polynomials to track the impact of knowledge graph changes on arbitrary SPARQL query results, and shows how to efficiently determine the queries impacted by the change, develop ways to incrementally maintain these polynomes, and present an efficient implementation on top of RDF graph databases.
Abstract: Critical business applications in domains ranging from technical support to healthcare increasingly rely on large-scale, automatically constructed knowledge graphs. These applications use the results of complex queries over knowledge graphs in order to help users in taking crucial decisions such as which drug to administer, or whether certain actions are compliant with all the regulatory requirements and so on. However, these knowledge graphs constantly evolve, and the newer versions may adversely impact the results of queries that the previously taken business decisions were based on. We propose a framework based on provenance polynomials to track the impact of knowledge graph changes on arbitrary SPARQL query results. Focusing on the deletion of facts, we show how to efficiently determine the queries impacted by the change, develop ways to incrementally maintain these polynomials, and present an efficient implementation on top of RDF graph databases. Our experimental evaluation over large-scale RDF/SPARQL benchmarks show the effectiveness of our proposal.

4 citations


Proceedings ArticleDOI
19 Apr 2017
TL;DR: This work introduces a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently.
Abstract: In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. In the case where the hash functions used in the Bloom filter implementation are partially invertible, in the sense that it is easy to calculate the set of elements that map to a particular hash value, we propose a second, more space-efficient method called HashInvert for the reconstruction. We study the properties of these two methods both analytically as well as experimentally. We provide bounds on run times for both methods and sample quality for the BloomSampleTree based algorithm, and show through an extensive experimental evaluation that our methods are efficient and effective.

3 citations


Posted Content
TL;DR: Streak is a RDF data management system that is designed to support a wide-range of queries with spatial filters including complex joins, top-k, higher-order relationships over spatially enriched databases and can scale to some of the largest publicly available semantic data resources which contain spatial entities and quantifiable predicates useful for result ranking.
Abstract: The importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc., has prompted the adoption of recent GeoSPARQL standard in many RDF processing engines. In addition to large repositories of geo-spatial data -- e.g., LinkedGeoData, OpenStreetMap, etc. -- spatial data is also routinely found in automatically constructed knowledgebases such as Yago and WikiData. While there have been research efforts for efficient processing of spatial data in RDF/SPARQL, very little effort has gone into building end-to-end systems that can holistically handle complex SPARQL queries along with spatial filters. In this paper, we present Streak, a RDF data management system that is designed to support a wide-range of queries with spatial filters including complex joins, top-k, higher-order relationships over spatially enriched databases. Streak introduces various novel features such as a careful identifier encoding strategy for spatial and non-spatial entities, the use of a semantics-aware Quad-tree index that allows for early-termination and a clever use of adaptive query processing with zero plan-switch cost. We show that Streak can scale to some of the largest publicly available semantic data resources such as Yago3 and LinkedGeoData which contain spatial entities and quantifiable predicates useful for result ranking. For experimental evaluations, we focus on top-k distance join queries and demonstrate that Streak outperforms popular spatial join algorithms as well as state of the art end-to-end systems like Virtuoso and PostgreSQL.

1 citations


Posted Content
TL;DR: DataVizard as mentioned in this paper is a system that automatically recommends the most appropriate visual presentation for the structured result of a structured query such as SQL and a data table with an associated short description (e.g., tables from the Web).
Abstract: Selecting the appropriate visual presentation of the data such that it preserves the semantics of the underlying data and at the same time provides an intuitive summary of the data is an important, often the final step of data analytics. Unfortunately, this is also a step involving significant human effort starting from selection of groups of columns in the structured results from analytics stages, to the selection of right visualization by experimenting with various alternatives. In this paper, we describe our \emph{DataVizard} system aimed at reducing this overhead by automatically recommending the most appropriate visual presentation for the structured result. Specifically, we consider the following two scenarios: first, when one needs to visualize the results of a structured query such as SQL; and the second, when one has acquired a data table with an associated short description (e.g., tables from the Web). Using a corpus of real-world database queries (and their results) and a number of statistical tables crawled from the Web, we show that DataVizard is capable of recommending visual presentations with high accuracy. We also present the results of a user survey that we conducted in order to assess user views of the suitability of the presented charts vis-a-vis the plain text captions of the data.

1 citations


Journal ArticleDOI
TL;DR: This work introduces a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently.
Abstract: In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. In the case where the hash functions used in the Bloom filter implementation are partially invertible, in the sense that it is easy to calculate the set of elements that map to a particular hash value, we propose a second, more space-efficient method called HashInvert for the reconstruction. We study the properties of these two methods both analytically as well as experimentally. We provide bounds on run times for both methods and sample quality for the BloomSampleTree based algorithm, and show through an extensive experimental evaluation that our methods are efficient and effective.

Dissertation
01 Dec 2017
TL;DR: This thesis develops an RDF database, named RQ-RDF-3X for efficiently querying these RDF graphs containing annotations over native RDF triples, and proposes indexing and query processing techniques for making top-k querying efficient.
Abstract: RDF data management has received a lot of attention in the past decade due to the widespread growth of Semantic Web and Linked Open Data initiatives. RDF data is expressed in the form of triples (as Subject Predicate Object), with SPARQL used for querying it. Many novel database systems such as RDF-3X, TripleBit, etc. – store RDF in its native form or within traditional relational storage – have demonstrated their ability to scale to large volumes of RDF content. However, it is increasingly becoming obvious from the knowledge representation applications of RDF that it is equally important to integrate with RDF triples additional information such as source, time and place of occurrence, uncertainty, etc. Consider an RDF fact (BarackObama, isPresidentOf, UnitedStates). While this fact is useful for finding information regarding president of United States, it does not provide sufficient information for answering many challenging questions like what is the temporal validity of this fact?, where did this fact come from?, etc. Annotations like confidence, geolocation, time, etc. can be modeled in RDF through a techniques called reification, which is also a W3C recommendations. Reification, retains the triple nature of RDF and associates annotations using blank nodes. The focus of this thesis is on database aspects of storing and querying RDF graphs containing annotations like confidence, etc. on RDF triples. In this thesis, we start by developing an RDF database, named RQ-RDF-3X for efficiently querying these RDF graphs containing annotations over native RDF triples. Next, we noticed that more than 62% facts in real-world RDF datasets like YAGO, DBpedia, etc. have numerical object values. Suggesting the use of queries containing ORDER-BY clause on traditional graph pattern queries of SPARQL. State-of-the-art RDF processing systems such as Virtuoso, Jena, etc. handle such queries by first collecting the results and then sorting them in-memory based on the userspecified function, making them not very scalable. In order to efficiently retrieve results of top-k queries, i.e. queries returning the top-k results ordered by a user-defined scoring function, we developed a top-k query processing database named Quark-X. In Quark-X we propose indexing and query processing techniques for making top-k querying efficient. Motivated by the importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc. In addition to its widespread use in knowl-