scispace - formally typeset
Search or ask a question

Showing papers by "Nikos Mamoulis published in 2016"


Proceedings ArticleDOI
13 Aug 2016
TL;DR: This paper proposes to use meta structure, which is a directed acyclic graph of object types with edge types connecting in between, to measure the proximity between objects, and develops three relevance measures based on meta structure.
Abstract: A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures can be used in various applications, including entity resolution, recommendation, and information retrieval. Several studies have investigated the use of HIN information for relevance computation, however, most of them only utilize simple structure, such as path, to measure the similarity between objects. In this paper, we propose to use meta structure, which is a directed acyclic graph of object types with edge types connecting in between, to measure the proximity between objects. The strength of meta structure is that it can describe complex relationship between two HIN objects (e.g., two papers in DBLP share the same authors and topics). We develop three relevance measures based on meta structure. Due to the computational complexity of these measures, we further design an algorithm with data structures proposed to support their evaluation. Our extensive experiments on YAGO and DBLP show that meta structure-based relevance is more effective than state-of-the-art approaches, and can be efficiently computed.

182 citations


Journal ArticleDOI
01 Jun 2016
TL;DR: AdPart is proposed, a distributed RDF system that starts faster than all existing systems, processes thousands of queries before other systems become online; and gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.
Abstract: State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.

87 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper forms the P2G problem, and it proposes probabilistic models that capture the preference of a group towards a package, incorporating factors such as user impact and package viability, and investigates the issue of recommendation fairness.
Abstract: The success of recommender systems has made them the focus of a massive research effort in both industry and academia. Recent work has generalized recommendations to suggest packages of items to single users, or single items to groups of users. However, to the best of our knowledge, the interesting problem of recommending a package to a group of users (P2G) has not been studied to date. This is a problem with several practical applications, such as recommending vacation packages to tourist groups, entertainment packages to groups of friends, or sets of courses to groups of students. In this paper, we formulate the P2G problem, and we propose probabilistic models that capture the preference of a group towards a package, incorporating factors such as user impact and package viability. We also investigate the issue of recommendation fairness. This is a novel consideration that arises in our setting, where we require that no user is consistently slighted by the item selection in the package. We present aggregation algorithms for finding the best packages and compare our suggested models with baseline approaches stemming from previous work. The results show that our models find packages of high quality which consider all special requirements of P2G recommendation.

35 citations


Proceedings ArticleDOI
26 Jun 2016
TL;DR: This work proposes and study a novel location-based keyword search query on RDF data and designs a basic method for the processing of kSP queries, and proposes two pruning approaches and a data preprocessing technique to accelerate kSP retrieval.
Abstract: RDF data are traditionally accessed using structured query languages, such as SPARQL. However, this requires users to understand the language as well as the RDF schema. Keyword search on RDF data aims at relieving the user from these requirements; the user only inputs a set of keywords and the goal is to find small RDF subgraphs which contain all keywords. At the same time, popular RDF knowledge bases also include spatial semantics, which opens the road to location-based search operations. In this work, we propose and study a novel location-based keyword search query on RDF data. The objective of top-k relevant semantic places (kSP) retrieval is to find RDF subgraphs which contain the query keywords and are rooted at spatial entities close to the query location. The novelty of kSP queries is that they are location-aware and that they do not rely on the use of structured query languages. We design a basic method for the processing of kSP queries. To further accelerate kSP retrieval, two pruning approaches and a data preprocessing technique are proposed. Extensive empirical studies on two real datasets demonstrate the superior and robust performance of our proposals compared to the basic method.

27 citations


Proceedings ArticleDOI
Wenting Tu1, David W. Cheung1, Nikos Mamoulis1, Min Yang1, Ziyu Lu1 
07 Jul 2016
TL;DR: Improve investment recommendation by modeling and using the quality of each investment opinion by using multiple categories of features generated from the author information, opinion content and the characteristics of stocks to which the opinion refers.
Abstract: Investor social media, such as StockTwist, are gaining increasing popularity. These sites allow users to post their investing opinions and suggestions in the form of microblogs. Given the growth of the posted data, a significant and challenging research problem is how to utilize the personal wisdom and different viewpoints in these opinions to help investment. Previous work aggregates sentiments related to stocks and generates buy or hold recommendations for stocks obtaining favorable votes while suggesting sell or short actions for stocks with negative votes. However, considering the fact that there always exist unreasonable or misleading posts, sentiment aggregation should be improved to be robust to noise. In this paper, we improve investment recommendation by modeling and using the quality of each investment opinion. To model the quality of an opinion, we use multiple categories of features generated from the author information, opinion content and the characteristics of stocks to which the opinion refers. Then, we discuss how to perform investment recommendation (including opinion recommendation and portfolio recommendation) with predicted qualities of investor opinions. Experimental results on real datasets demonstrate effectiveness of our work in recommending high-quality opinions and generating profitable investment decisions.

25 citations


Journal ArticleDOI
TL;DR: This paper designs a location-aware keyword query suggestion framework that captures both the semantic relevance between keyword queries and the spatial distance between the resulting documents and the user location and proposes a weighted keyword-document graph.
Abstract: Keyword suggestion in web search helps users to access relevant information without having to know how to precisely express their queries. Existing keyword suggestion techniques do not consider the locations of the users and the query results; i.e., the spatial proximity of a user to the retrieved results is not taken as a factor in the recommendation. However, the relevance of search results in many applications (e.g., location-based services) is known to be correlated with their spatial proximity to the query issuer. In this paper, we design a location-aware keyword query suggestion framework. We propose a weighted keyword-document graph, which captures both the semantic relevance between keyword queries and the spatial distance between the resulting documents and the user location. The graph is browsed in a random-walk-with-restart fashion, to select the keyword queries with the highest scores as suggestions. To make our framework scalable, we propose a partition-based approach that outperforms the baseline algorithm by up to an order of magnitude. The appropriateness of our framework and the performance of the algorithms are evaluated using real data.

24 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed an adaptive methodology based on a cost model to limit the prefix tree construction and reduce the space and time cost of the join, which significantly reduces the maximum memory requirements during the join.
Abstract: Given two collections of set objects R and S, the $$R\bowtie _{\subseteq }S$$Rź⊆S set containment join returns all object pairs $$(r,s) \in R\times S$$(r,s)źR×S such that $$r\subseteq s$$r⊆s. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins ($$\mathtt {PRETTI}$$PRETTI) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves $$\mathtt {PRETTI}$$PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms $$\mathtt {PRETTI}$$PRETTI by a wide margin.

19 citations


Journal ArticleDOI
TL;DR: In this article, a spatio-textual skyline (STS) query is proposed, in which the skyline points are selected not only based on their distances to a set of query locations, but also based on the relevance of query keywords.
Abstract: We study the modeling and evaluation of a spatio-textual skyline (STS) query, in which the skyline points are selected not only based on their distances to a set of query locations, but also based on their relevance to a set of query keywords. STS is especially relevant to modern applications, where points of interest are typically augmented with textual descriptions. We investigate three models for integrating textual relevance into the spatial skyline. Among them, model STD, which combines spatial distance with textual relevance in a derived dimensional space, is found to be the most effective one. STD computes a skyline which not only satisfies the intent of STS, but also has a small and easy-to-interpret size. We propose an efficient algorithm for computing STD-based skylines, which operates on an IR-tree that indexes the data. The effectiveness of our STD model and the efficiency of the proposed algorithm are evaluated on real data sets.

15 citations


Journal ArticleDOI
01 Dec 2016
TL;DR: This paper investigates the effective and efficient generation of two novel types of OS snippets, i.e., diverse and proportional size-l OSs, denoted as DSize-l and PSize-lOSs, and considers its pairwise relevance to the other nodes in the OS and the snippet.
Abstract: The abundance and ubiquity of graphs (e.g., online social networks such as Google$$+$$+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a data subject (DS), a recently proposed keyword search paradigm produces a set of object summaries (OSs) as results. An OS is a tree structure rooted at the DS node (i.e., a node containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. A size-l OS is a partial OS containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper, we investigate the effective and efficient generation of two novel types of OS snippets, i.e., diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, besides the importance of each node, we also consider its pairwise relevance (similarity) to the other nodes in the OS and the snippet. We conduct an extensive evaluation on two real graphs (DBLP and Google$$+$$+). We verify effectiveness by collecting user feedback, e.g., by asking DBLP authors (i.e., the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.

14 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: This work shows how applications like company and friend recommendation could significantly benefit from incorporating social and spatial proximity, and develops highly scalable algorithms for its processing, and enhances them with elaborate optimizations.
Abstract: The diffusion of social networks introduces new challenges and opportunities for advanced services, especially so with their ongoing addition of location-based features. We show how applications like company and friend recommendation could significantly benefit from incorporating social and spatial proximity, and study a query type that captures these two-fold semantics. We develop highly scalable algorithms for its processing, and use real social network data to empirically verify their efficiency and efficacy.

13 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: This paper designs a location-aware keyword query suggestion framework that captures both the semantic relevance between keyword queries and the spatial distance between the resulting documents and the user location and proposes a weighted keyword-document graph.
Abstract: Consider a user who has issued a keyword query to a search engine. We study the effective suggestion of alternative keyword queries to the user, which are semantically relevant to the original query and they have as results documents that correspond to objects near the user's location. For this purpose, we propose a weighted keyword-document graph which captures semantic and proximity relevance between queries and documents. Then, we use the graph to suggest queries that are near in terms of graph distance to the original queries. To make our framework scalable, we propose a partition-based approach that greatly outperforms the baseline algorithm.

Proceedings Article
01 Jan 2016
TL;DR: This paper addresses the problem of predicting the topic of a micro-review by a user who visits a new venue using pooling strategies, and proposes novel probabilistic models based on Latent Dirichlet Allocation (LDA) for extracting the topics related to a user-venue pair.
Abstract: Location-based social sites, such as Foursquare or Yelp, are gaining increasing popularity. These sites allow users to check in at venues and leave a short commentary in the form of a micro-review. Micro-reviews are rich in content as they offer a distilled and concise account of user experience. In this paper we consider the problem of predicting the topic of a micro-review by a user who visits a new venue. Such a prediction can help users make informed decisions, and also help venue owners personalize users’ experiences. However, topic modeling for micro-reviews is particularly difficult, due to their short and fragmented nature. We address this issue using pooling strategies, which aggregate micro-reviews at the venue or user level, and we propose novel probabilistic models based on Latent Dirichlet Allocation (LDA) for extracting the topics related to a user-venue pair. Our best topic model integrates influences from both venue inherent properties and user preferences, considering at the same the sentiment orientation of the users. Experimental results on real datasets demonstrate the superiority of this model compared to simpler models and previous work; they also show that venue-inherent properties have higher influence on the topics of micro-reviews.

Journal ArticleDOI
TL;DR: A framework which reduces not only the join cost, but also the maximum memory requirements during the join, and partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted.
Abstract: Given two collections of set objects $R$ and $S$, the $R \bowtie_{\subseteq} S$ set containment join returns all object pairs $(r, s) \in R \times S$ such that $r \subseteq s$. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (PRETTI) builds an inverted index on the right-hand collection $S$ and a prefix tree on the left-hand collection $R$ that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms PRETTI by a wide margin.

Proceedings ArticleDOI
16 May 2016
TL;DR: A linear-time algorithm for determining the optimal selection range for an ordinal attribute and techniques for choosing and prioritizing the most promising selection predicates to apply are proposed.
Abstract: Given a database table with records that can be ranked, an interesting problem is to identify selection conditions, which are qualified by an input record and render its ranking as high as possible among the qualifying tuples. In this paper, we study this standing maximization problem, which finds application in object promotion and characterization. We propose greedy methods, which are experimentally shown to achieve high accuracy compared to exhaustive enumeration, while scaling very well to the problem size. Our contributions include a lineartime algorithm for determining the optimal selection range for an attribute and techniques for choosing and prioritizing the most promising selection predicates to apply. Experiments on real datasets confirm the effectiveness and efficiency of our techniques.