scispace - formally typeset
Search or ask a question

Showing papers by "Nikos Mamoulis published in 2015"


Proceedings ArticleDOI
16 Sep 2015
TL;DR: This paper proposes two alternative models that incorporate the overlapping community regularization into the matrix factorization framework and shows that these approaches outperform the state-of-the-art algorithms in both traditional and social-network based recommender systems regarding both cold-start users and normal users.
Abstract: Recommender systems have become de facto tools for suggesting items that are of potential interest to users. Predicting a user's rating on an item is the fundamental recommendation task. Traditional methods that generate predictions by analyzing the user-item rating matrix perform poorly when the matrix is sparse. Recent approaches use data from social networks to improve accuracy. However, most of the social-network based recommender systems only consider direct friendships and they are less effective when the targeted user has few social connections. In this paper, we propose two alternative models that incorporate the overlapping community regularization into the matrix factorization framework. Our empirical study on four real datasets shows that our approaches outperform the state-of-the-art algorithms in both traditional and social-network based recommender systems regarding both cold-start users and normal users.

73 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: A generalized framework for fair reviewer assignment is proposed, which first extracts the domain knowledge from the reviewers' published papers and model this knowledge as a set of topics, and proposes a greedy algorithm that achieves a 1/2-approximation ratio compared to the exact solution.
Abstract: Peer reviewing is a standard process for assessing the quality of submissions at academic conferences and journals. A very important task in this process is the assignment of reviewers to papers. However, achieving an appropriate assignment is not easy, because all reviewers should have similar load and the subjects of the assigned papers should be consistent with the reviewers' expertise. In this paper, we propose a generalized framework for fair reviewer assignment. We first extract the domain knowledge from the reviewers' published papers and model this knowledge as a set of topics. Then, we perform a group assignment of reviewers to papers, which is a generalization of the classic Reviewer Assignment Problem (RAP), considering the relevance of the papers to topics as weights. We study a special case of the problem, where reviewers are to be found for just one paper (Journal Assignment Problem) and propose an exact algorithm which is fast in practice, as opposed to brute-force solutions. For the general case of having to assign multiple papers, which is too hard to be solved exactly, we propose a greedy algorithm that achieves a 1/2-approximation ratio compared to the exact solution. This is a great improvement compared to the 1/3-approximation solution proposed in previous work for the simpler coverage-based reviewer assignment problem, where there are no weights on topics. We theoretically prove the approximation bound of our solution and experimentally show that it is superior to the current state-of-the-art.

43 citations


Journal ArticleDOI
01 Aug 2015
TL;DR: AdHash is proposed, a distributed RDF system which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication, and gracefully adapts to the query load.
Abstract: Distributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs.In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.

38 citations


Journal ArticleDOI
01 Aug 2015
TL;DR: This paper demonstrates the Reviewer Assignment System (RAS), which has advanced features compared to broadly used CMSs, and includes a recently published assignment model by the research group, which maximizes the coverage of its topics by the profiles of its reviewers.
Abstract: Peer reviewing is a widely accepted mechanism for assessing the quality of submitted articles to scientific conferences or journals. Conference management systems (CMS) are used by conference organizers to invite appropriate reviewers and assign them to submitted papers. Typical CMS rely on paper bids entered by the reviewers and apply simple matching algorithms to compute the paper assignment. In this paper, we demonstrate our Reviewer Assignment System (RAS), which has advanced features compared to broadly used CMSs. First, RAS automatically extracts the profiles of reviewers and submissions in the form of topic vectors. These profiles can be used to automatically assign reviewers to papers without relying on a bidding process, which can be tedious and error-prone. Second, besides supporting classic assignment models (e.g., stable marriage and optimal assignment), RAS includes a recently published assignment model by our research group, which maximizes, for each paper, the coverage of its topics by the profiles of its reviewers. The features of the demonstration include (1) automatic extraction of paper and reviewer profiles, (2) assignment computation by different models, and (3) visualization of the results by different models, in order to assess their effectiveness.

31 citations


Proceedings Article
01 Jan 2015
TL;DR: A location recommendation framework that combines results from various recommenders that consider different factors, and estimates, for each individual user, the underlying influence of each factor to her.
Abstract: Location recommendation is an important feature of social network applications and location-based services. Most existing studies focus on developing one single method or model for all users. By analyzing data from two real location-based social networks (Foursquare and Gowalla), in this paper we reveal that the decisions of users on place visits depend on multiple factors, and different users may be affected differently by these factors. We design a location recommendation framework that combines results from various recommenders that consider different factors. Our framework estimates, for each individual user, the underlying influence of each factor to her. Based on the estimation, we aggregate suggestions from different recommenders to derive personalized recommendations. Experiments on Foursquare and Gowalla show that our proposed method outperforms the state-of-the-art methods on location recommendation.

31 citations


Journal ArticleDOI
TL;DR: This work shows how applications like company and friend recommendation could significantly benefit from incorporating social and spatial proximity, and develops highly scalable algorithms for its processing, and enhances them with elaborate optimizations.
Abstract: The diffusion of social networks introduces new challenges and opportunities for advanced services, especially so with their ongoing addition of location-based features. We show how applications like company and friend recommendation could significantly benefit from incorporating social and spatial proximity, and study a query type that captures these two-fold semantics. We develop highly scalable algorithms for its processing, and enhance them with elaborate optimizations. Finally, we use real social network data to empirically verify the efficiency and efficacy of our solutions.

29 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: This paper investigates the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-lOSs, and considers its frequency in the OS and its repetitions in the snippets.
Abstract: The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), a recently proposed relational keyword search paradigm produces, as a query result, a set of Object Summaries (OSs). An OS is a tree structure rooted at the DS node (i.e., a tuple containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. Size-l OSs are partial OSs containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, apart from the importance of each node, we also consider its frequency in the OS and its repetitions in the snippets. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g. by asking DBLP authors (i.e. the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.

24 citations


Journal ArticleDOI
01 Aug 2015
TL;DR: SPARTex is introduced, an RDF analytics framework based on the vertex-centric computation model that evaluates queries that combine SPARQL and generic graph computations orders of magnitude faster than existing RDF engines.
Abstract: A growing number of applications require combining SPARQL queries with generic graph search on RDF data. However, the lack of procedural capabilities in SPARQL makes it inappropriate for graph analytics. Moreover, RDF engines focus on SPARQL query evaluation whereas graph management frameworks perform only generic graph computations. In this work, we bridge the gap by introducing SPARTex, an RDF analytics framework based on the vertex-centric computation model. In SPARTex, user-defined vertex centric programs can be invoked from SPARQL as stored procedures. SPARTex allows the execution of a pipeline of graph algorithms without the need for multiple reads/writes of input data and intermediate results. We use a cost-based optimizer for minimizing the communication cost. SPARTex evaluates queries that combine SPARQL and generic graph computations orders of magnitude faster than existing RDF engines. We demonstrate a real system prototype of SPARTex running on a local cluster using real and synthetic datasets. SPARTex has a real-time graphical user interface that allows the participants to write regular SPARQL queries, use our proposed SPARQL extension to declaratively invoke graph algorithms or combine/pipeline both SPARQL querying and generic graph analytics.

22 citations


Book ChapterDOI
Wenting Tu1, David W. Cheung1, Nikos Mamoulis1, Min Yang1, Ziyu Lu1 
19 May 2015
TL;DR: The usefulness of suggesting activity partners together with items in recommender systems is identified and several methods for activity-partner recommendation are proposed and compared.
Abstract: In many activities, such as watching movies or having dinner, people prefer to find partners before participation. Therefore, when recommending activity items (e.g., movie tickets) to users, it makes sense to also recommend suitable activity partners. This way, (i) the users save time for finding activity partners, (ii) the effectiveness of the item recommendation is increased (users may prefer activity items more if they can find suitable activity partners), (iii) recommender systems become more interesting and enkindle users’ social enthusiasm. In this paper, we identify the usefulness of suggesting activity partners together with items in recommender systems. In addition, we propose and compare several methods for activity-partner recommendation. Our study includes experiments that test the practical value of activity-partner recommendation and evaluate the effectiveness of all suggested methods as well as some alternative strategies.

22 citations


Book ChapterDOI
26 Aug 2015
TL;DR: This paper proposes and study the practical variant of bounded distance-based search, which takes into account the temporal characteristics of the searched trajectories, and shows that the range-based approach outperforms previous methods by at least one order of magnitude.
Abstract: Trajectory data capture the traveling history of moving objects such as people or vehicles. With the proliferation of GPS and tracking technology, huge volumes of trajectories are rapidly generated and collected. Under this, applications such as route recommendation and traveling behavior mining call for efficient trajectory retrieval. In this paper, we first focus on distance-based trajectory search; given a collection of trajectories and a set query points, the goal is to retrieve the top-k trajectories that pass as close as possible to all query points. We advance the state-of-the-art by combining existing approaches to a hybrid method and also proposing an alternative, more efficient range-based approach. Second, we propose and study the practical variant of bounded distance-based search, which takes into account the temporal characteristics of the searched trajectories. Through an extensive experimental analysis with real trajectory data, we show that our range-based approach outperforms previous methods by at least one order of magnitude.

15 citations


Proceedings ArticleDOI
31 May 2015
TL;DR: This work introduces geo-social co-location mining, the problem of finding social groups that are frequently found at the same location, and proposes a probabilistic model to estimate the probability of a user to be located at agiven location at a given time, creating the notion of probabilistically co-locations.
Abstract: Modern technology to capture geo-spatial information produces a huge flood of geo-spatial and geo-spatio-temporal data with a new user mentality of utilizing this technology to voluntarily share information. This location information, enriched with social information, is a new source to discover new and useful knowledge. This work introduces geo-social co-location mining, the problem of finding social groups that are frequently found at the same location. This problem has applications in social sciences, allowing to research interactions between social groups and permitting social-link prediction. It can be divided into two sub-problems. The first sub-problem of finding spatial co-location instances, requires to properly address the inherent uncertainty in geo-social network data, which is a consequence of generally very sparse check-in data, and thus very sparse trajectory information. For this purpose, we propose a probabilistic model to estimate the probability of a user to be located at a given location at a given time, creating the notion of probabilistic co-locations. The second sub-problem of mining the resulting probabilistic co-location instances requires efficient methods for large databases having a high degree of uncertainty. Our approach solves this problem by extending solutions for probabilistic frequent itemset mining. Our experimental evaluation performed on real (but anonymized) geo-social network data shows the high efficiency of our approach, and its ability to find new social interactions.

Proceedings Article
25 Jan 2015
TL;DR: The preliminary work on extracting reference time tags and integrating them into an opinion mining model, in order to improve the accuracy of future event prediction is presented.
Abstract: Users commonly use Web 2.0 platforms to post their opinions and their predictions about future events (e.g., the movement of a stock). Therefore, opinion mining can be used as a tool for predicting future events. Previous work on opinion mining extracts from the text only the polarity of opinions as sentiment indicators. We observe that a typical opinion post also contains temporal references which can improve prediction. This short paper presents our preliminary work on extracting reference time tags and integrating them into an opinion mining model, in order to improve the accuracy of future event prediction. We conduct an experimental evaluation using a collection of microblogs posted by investors to demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: A divide-and-conquer based framework is proposed, which outperforms a baseline approach in terms of not only execution time but also space complexity, and an approximation solution is studied, which provides a good trade-off between computation cost and quality of result.
Abstract: Creating a new product that dominates all its competitors is one of the main objectives in marketing. Nevertheless, this might not be feasible since in practice the development process is confined by some constraints, e.g., limited funding or low target selling price. We model these constraints by a constraint function, which determines the feasible characteristics of a new product. Given such a budget, our task is to decide the best possible features of the new product that maximize its profitability. In general, a product is marketable if it dominates a large set of existing products, while it is not dominated by many. Based on this, we define dominance relationship analysis and use it to measure the profitability of the new product. The decision problem is then modeled as a budget constrained optimization query (BOQ). Computing BOQ is challenging due to the exponential increase in the search space with dimensionality. We propose a divide-and-conquer based framework, which outperforms a baseline approach in terms of not only execution time but also space complexity. Based on the proposed framework, we further study an approximation solution, which provides a good trade-off between computation cost and quality of result.

Journal ArticleDOI
TL;DR: This paper designs efficient algorithms for computing safe regions and examines the shapes of safe regions in the problem’s context and proposes feasible approximations for them, and studies a variant of the problem called the sum-optimal meeting point.
Abstract: In applications like social networking services and online games, multiple moving users which form a group may wish to be continuously notified about the best meeting point from their locations. A promising technique for reducing the communication frequency of the application server is to employ safe regions, which capture the validity of query results with respect to the users’ locations. Unfortunately, the safe regions in our problem exhibit characteristics such as irregular shapes and inter-dependencies, which render existing methods that compute a single safe region inapplicable to our problem. To tackle these challenges, we first examine the shapes of safe regions in our problem’s context and propose feasible approximations for them. We design efficient algorithms for computing these safe regions. We also study a variant of the problem called the sum-optimal meeting point and extend our solutions to solve this variant. Experiments with both real and synthetic data demonstrate the effectiveness of our proposal in terms of computational and communication costs.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper proposes a novel similarity join approach, which is based on the dynamic decomposition of the tree objects into subgraphs, according to the similarity threshold, and shows that it outperforms the state-of-the-art methods by up to an order of magnitude.
Abstract: Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-of-the-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join result based on distance bounds derived by the approximations. In this paper, we propose a novel similarity join approach, which is based on the dynamic decomposition of the tree objects into subgraphs, according to the similarity threshold. Our technique avoids computing the exact distance between two tree objects, if the objects do not share at least one common subgraph. In order to scale up the join, the computed subgraphs are managed in a two-layer index. Our experimental results on real and synthetic data collections show that our approach outperforms the state-of-the-art methods by up to an order of magnitude.

Proceedings Article
Wenting Tu1, David W. Cheung1, Nikos Mamoulis1, Min Yang1, Ziyu Lu1 
01 Oct 2015
TL;DR: This work proposes a real-time sorting strategy that orders the detected news microblogs using a translational approach, and demonstrates the effectiveness of this approach on a large-scale microblogging dataset.
Abstract: Due to the increasing popularity of microblogging platforms (e.g., Twitter), detecting realtime news from microblogs (e.g., tweets) has recently drawn a lot of attention. Most of the previous work on this subject detect news by analyzing propagation patterns of microblogs. This approach has two limitations: (i) many non-news microblogs (e.g. marketing activities) have propagation patterns similar to news microblogs and therefore they can be falsely reported as news; (ii) using propagation patterns to identify news involves a time delay until the pattern is formed, therefore news are not detected in real time. We propose an alternative approach, which, motivated by the necessity of real-time detection of news, does not rely on propagation of posts. Moreover, we propose a real-time sorting strategy that orders the detected news microblogs using a translational approach. An experimental evaluation on a large-scale microblogging dataset demonstrates the effectiveness of our approach.

Posted Content
TL;DR: AdHash is proposed, a distributed RDF system that starts faster than all existing systems, processes thousands of queries before other systems become online, and gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.
Abstract: Distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data preprocessing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; hence, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning. In this paper, we propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i)support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.

Journal Article
TL;DR: A model for clustering geographic locations, based on GeoSN data, is presented and it is discussed how this model can be extended to consider temporal information from checkins and how the accuracy of community detection approaches can be improved by taking into account the checkins of users in a GeoSN.
Abstract: The rapid growth of Geo-Social Networks (GeoSNs) provides a new and rich form of data. Users of GeoSNs can capture their geographic locations and share them with other users via an operation named checkin. Thus, GeoSNs can track the connections (and the time of these connections) of geographic data to their users. In addition, the users are organized in a social network, which can be extended to a heterogeneous network if the connections to places via checkins are also considered. The goal of this paper is to analyze the opportunities in clustering this rich form of data. We first present a model for clustering geographic locations, based on GeoSN data. Then, we discuss how this model can be extended to consider temporal information from checkins. Finally, we study how the accuracy of community detection approaches can be improved by taking into account the checkins of users in a GeoSN.

Journal ArticleDOI
TL;DR: A linear-time algorithm for determining the optimal selection range for an ordinal attribute and techniques for choosing and prioritizing the most promising selection predicates to apply are proposed.
Abstract: Given a database table with records that can be ranked, an interesting problem is to identify selection conditions for the table, which are qualified by an input record and render its ranking as high as possible among the qualifying tuples. In this paper, we study this standing maximization problem, which finds application in object promotion and characterization. After showing the hardness of the problem, we propose greedy methods, which are experimentally shown to achieve high accuracy compared to exhaustive enumeration, while scaling very well to the problem input size. Our contributions include a linear-time algorithm for determining the optimal selection range for an ordinal attribute and techniques for choosing and prioritizing the most promising selection predicates to apply. Experiments on real datasets confirm the effectiveness and efficiency of our techniques.

Proceedings Article
25 Jan 2015
TL;DR: This work proposes a methodology that constructs a simulated microblog-ging corpus rather than directly building a model from the exterior corpus, and demonstrates the superiority of this technique compared to the previous approaches.
Abstract: A large-scale training corpus consisting of microblogs belonging to a desired category is important for high-accuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and labor-consuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog-specific features. To alleviate this issue, we propose a methodology that constructs a simulated microblog-ging corpus rather than directly building a model from the exterior corpus. The performance of our model is better since the microblog-special knowledge of the microblogging corpus is used in the end by the retrieval model. Experimental results on real-world microblogs demonstrate the superiority of our technique compared to the previous approaches.

01 Jan 2015
TL;DR: A novel evaluation paradigm for top-k joins, which aims at minimizing the computations cost, without compromising the access cost, is proposed and is evaluated by extensive experimentation on both real and synthetic data.
Abstract: Consider two collections of objects R and S, where each object is assigned a score (e.g., a rating). Given a join predicate and an integer k, a top-k join query returns the k pairs of objects which have the highest combined score (based on an aggregate scoring function ) among all object pairs in R S that qualify . This query type has been extensively studied in the relational database context where the join predicate is equality, with the main goal of minimizing the number of tuples accessed from relations R and S. However, if the top-k join involves a non-equijoin predicate on complex data types, the computational cost can easily become the bottleneck of query evaluation. In view of this, we propose a novel evaluation paradigm for top-k joins, which aims at minimizing the computations cost, without compromising the access cost. The main idea behind our paradigm is to examine blocks of data from R and S ordered by the object scores; by performing the top-k join in a block-wise fashion, we avoid (i) building expensive indexes incrementally and (ii) comparing pairs of blocks that may not contain results (using appropriate bounds). We show how our paradigm can be applied for the cases of top-k spatial and string joins and conduct an analysis on how to derive the optimal block size for each case. Finally, we evaluate our proposal by extensive experimentation on both real and synthetic data.