scispace - formally typeset
Search or ask a question

Showing papers by "Wang-Chien Lee published in 2014"


Journal ArticleDOI
TL;DR: A new framework to tackle the influence maximization problem with an emphasis on the time efficiency issue, and shows that the proposed CIM algorithm significantly outperforms the state-of-the-art algorithms in terms of efficiency and scalability, with almost no compromise of effectiveness.
Abstract: Given a social graph, the problem of influence maximization is to determine a set of nodes that maximizes the spread of influences. While some recent research has studied the problem of influence maximization, these works are generally too time consuming for practical use in a large-scale social network. In this article, we develop a new framework, community-based influence maximization (CIM), to tackle the influence maximization problem with an emphasis on the time efficiency issue. Our proposed framework, CIM, comprises three phases: (i) community detection, (ii) candidate generation, and (iii) seed selection. Specifically, phase (i) discovers the community structure of the network; phase (ii) uses the information of communities to narrow down the possible seed candidates; and phase (iii) finalizes the seed nodes from the candidate set. By exploiting the properties of the community structures, we are able to avoid overlapped information and thus efficiently select the number of seeds to maximize information spreads. The experimental results on both synthetic and real datasets show that the proposed CIM algorithm significantly outperforms the state-of-the-art algorithms in terms of efficiency and scalability, with almost no compromise of effectiveness.

136 citations


Journal ArticleDOI
TL;DR: This article proposes a novel mining-based location prediction approach called Geographic-Temporal-Semantic-based Location Prediction (GTS-LP), which takes into account a user's geographic-triggered intentions, temporal-trigious intentions, and semantic-tracked intentions, to estimate the probability of the user in visiting a location.
Abstract: In recent years, research on location predictions by mining trajectories of users has attracted a lot of attention. Existing studies on this topic mostly treat such predictions as just a type of location recommendation, that is, they predict the next location of a user using location recommenders. However, an user usually visits somewhere for reasons other than interestingness. In this article, we propose a novel mining-based location prediction approach called Geographic-Temporal-Semantic-based Location Prediction (GTS-LP), which takes into account a user's geographic-triggered intentions, temporal-triggered intentions, and semantic-triggered intentions, to estimate the probability of the user in visiting a location. The core idea underlying our proposal is the discovery of trajectory patterns of users, namely GTS patterns, to capture frequent movements triggered by the three kinds of intentions. To achieve this goal, we define a new trajectory pattern to capture the key properties of the behaviors that are motivated by the three kinds of intentions from trajectories of users. In our GTS-LP approach, we propose a series of novel matching strategies to calculate the similarity between the current movement of a user and discovered GTS patterns based on various moving intentions. On the basis of similitude, we make an online prediction as to the location the user intends to visit. To the best of our knowledge, this is the first work on location prediction based on trajectory pattern mining that explores the geographic, temporal, and semantic properties simultaneously. By means of a comprehensive evaluation using various real trajectory datasets, we show that our proposed GTS-LP approach delivers excellent performance and significantly outperforms existing state-of-the-art location prediction methods.

128 citations


Proceedings ArticleDOI
14 Dec 2014
TL;DR: A unified framework is proposed, called PGT, that considers personal, global, and temporal factors to measure the strength of the relationship between two given mobile users and significantly outperforms the state-of-the-art methods.
Abstract: Rich location data of mobile users collected from smart phones and location-based social networking services enable us to measure the mobility relationship strength based on their interactions in the physical world. A commonly-used measure for such relationship is the frequency of meeting events (i.e., Co-locate at the same time). That is, the more frequently two persons meet, the stronger their mobility relationship is. However, we argue that not all the meeting events are equally important in measuring the mobility relationship and propose to consider personal and global factors to differentiate meeting events. Personal factor models the probability for an individual user to visit a certain location, whereas the global factor models the popularity of a location based on the behavior of general public. In addition, we introduce the temporal factor to further consider the time gaps between meeting events. Accordingly, we propose a unified framework, called PGT, that considers personal, global, and temporal factors to measure the strength of the relationship between two given mobile users. Extensive experiments on real datasets validate our ideas and show that our method significantly outperforms the state-of-the-art methods.

75 citations


Proceedings ArticleDOI
03 Nov 2014
TL;DR: Experimental results show that the proposed algorithms can produce good quality summaries and scale well with increasing data sizes, and this is the first work to study distributed graph summarization methods.
Abstract: Graph has been a ubiquitous and essential data representation to model real world objects and their relationships. Today, large amounts of graph data have been generated by various applications. Graph summarization techniques are crucial in uncovering useful insights about the patterns hidden in the underlying data. However, all existing works in graph summarization are single-process solutions, and as a result cannot scale to large graphs. In this paper, we introduce three distributed graph summarization algorithms to address this problem. Experimental results show that the proposed algorithms can produce good quality summaries and scale well with increasing data sizes. To the best of our knowledge, this is the first work to study distributed graph summarization methods.

41 citations


Proceedings ArticleDOI
11 Aug 2014
TL;DR: This paper proposes three different schemes that can efficiently improve the performance, reduce the memory energy consumption, and improve the lifetime for PCM memory by making B+-tree PCM-friendly by reducing the write accesses.
Abstract: Phase change memory (PCM) is a promising technology for building future large-scale and low-power main memory systems. Main memory databases (MMDBs) can benefit from the high density of PCM. However, its long write latency, high write energy, and limited lifetime, bring challenges to database algorithm design for PCM-based memory systems. In this paper, we focus on making B+-tree PCM-friendly by reducing the write accesses to PCM. We propose three different schemes. Experimental results show that they can efficiently improve the performance, reduce the memory energy consumption, and improve the lifetime for PCM memory.

38 citations


Journal ArticleDOI
TL;DR: A prefetching-based approach is developed that enables clients to compute new LASQ results locally during movement, without frequently contacting the server for query re-evaluation, and a basic Merkle Skyline R-tree method and a novel Partial S4- tree method to authenticate one-shot LASZs are proposed.
Abstract: With the ever-increasing use of smartphones and tablet devices, location-based services (LBSs) have experienced explosive growth in the past few years. To scale up services, there has been a rising trend of outsourcing data management to Cloud service providers, which provide query services to clients on behalf of data owners. However, in this data-outsourcing model, the service provider can be untrustworthy or compromised, thereby returning incorrect or incomplete query results to clients, intentionally or not. Therefore, empowering clients to authenticate query results is imperative for outsourced databases. In this paper, we study the authentication problem for location-based arbitrary-subspace skyline queries (LASQs), which represent an important class of LBS applications. We propose a basic Merkle Skyline R-tree method and a novel Partial S4-tree method to authenticate one-shot LASQs. For the authentication of continuous LASQs, we develop a prefetching-based approach that enables clients to compute new LASQ results locally during movement, without frequently contacting the server for query re-evaluation. Experimental results demonstrate the efficiency of our proposed methods and algorithms under various system settings.

36 citations


Journal ArticleDOI
TL;DR: This paper presents a novel key design based on an R + -tree ( KR + -index) for retrieving skewed spatial data efficiently and shows that the KR -index outperforms the-state-of-the-art methods.

35 citations


Book ChapterDOI
13 May 2014
TL;DR: A novel algorithm, namely, Correlation Pattern Miner (CoPMiner), is developed to capture the usage patterns and correlations among appliances probabilistically and can reduce the search space effectively and efficiently.
Abstract: Since the great advent of sensor technology, the usage data of appliances in a house can be logged and collected easily today. However, it is a challenge for the residents to visualize how these appliances are used. Thus, mining algorithms are much needed to discover appliance usage patterns. Most previous studies on usage pattern discovery are mainly focused on analyzing the patterns of single appliance rather than mining the usage correlation among appliances. In this paper, a novel algorithm, namely, Correlation Pattern Miner (CoPMiner), is developed to capture the usage patterns and correlations among appliances probabilistically. With several new optimization techniques, CoPMiner can reduce the search space effectively and efficiently. Furthermore, the proposed algorithm is applied on a real-world dataset to show the practicability of correlation pattern mining.

34 citations


Book ChapterDOI
13 May 2014
TL;DR: This work proposed a trajectory recommendation framework and developed three recommendation methods, namely, Activity-Based Recommendation (ABR), GPS-Based recommendation (GBR) and Hybrid Recommendation, which turned out the hybrid solution displays the best performance.
Abstract: The wide use of GPS sensors in smart phones encourages people to record their personal trajectories and share them with others in the Internet. A recommendation service is needed to help people process the large quantity of trajectories and select potentially interesting ones. The GPS trace data is a new format of information and few works focus on building user preference profiles on it. In this work we proposed a trajectory recommendation framework and developed three recommendation methods, namely, Activity-Based Recommendation (ABR), GPS-Based Recommendation (GBR) and Hybrid Recommendation. The ABR recommends trajectories purely relying on activity tags. For GBR, we proposed a generative model to construct user profiles based on GPS traces. The Hybrid recommendation combines the ABR and GBR. We finally conducted extensive experiments to evaluate these proposed solutions and it turned out the hybrid solution displays the best performance.

20 citations


Book ChapterDOI
13 May 2014
TL;DR: This paper categorizes twitter accounts into two types, namely Personal Communication Account (PCA) and Public Dissemination Account (PDA), and develops probabilistic models based on these features to identify PDAs.
Abstract: There are millions of accounts in Twitter. In this paper, we categorize twitter accounts into two types, namely Personal Communication Account (PCA) and Public Dissemination Account (PDA). PCAs are accounts operated by individuals and are used to express that individual’s thoughts and feelings. PDAs, on the other hand, refer to accounts owned by non-individuals such as companies, governments, etc. Generally, Tweets in PDA (i) disseminate a specific type of information (e.g., job openings, shopping deals, car accidents) rather than sharing an individual’s personal life; and (ii) may be produced by non-human entities (e.g., bots). We aim to develop techniques for identifying PDAs so as to (i) facilitate social scientists to reduce “noise” in their study of human behaviors, and (ii) to index them for potential recommendation to users looking for specific types of information. Through analysis, we find these two types of accounts follow different temporal, spatial and textual patterns. Accordingly we develop probabilistic models based on these features to identify PDAs. We also conduct a series of experiments to evaluate those algorithms for cleaning the Twitter data stream.

15 citations


Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a new family of geo-social group queries with minimum acquaintance constraint (GSGQs), which are more appealing than existing geo-Social group queries in terms of producing a cohesive group that guarantees the worst-case acquaintance level.
Abstract: The prosperity of location-based social networking services enables geo-social group queries for group-based activity planning and marketing. This paper proposes a new family of geo-social group queries with minimum acquaintance constraint (GSGQs), which are more appealing than existing geo-social group queries in terms of producing a cohesive group that guarantees the worst-case acquaintance level. GSGQs, also specified with various spatial constraints, are more complex than conventional spatial queries; particularly, those with a strict $k$NN spatial constraint are proved to be NP-hard. For efficient processing of general GSGQ queries on large location-based social networks, we devise two social-aware index structures, namely SaR-tree and SaR*-tree. The latter features a novel clustering technique that considers both spatial and social factors. Based on SaR-tree and SaR*-tree, efficient algorithms are developed to process various GSGQs. Extensive experiments on real-world Gowalla and Dianping datasets show that our proposed methods substantially outperform the baseline algorithms based on R-tree.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: Four DoK models are proposed and integrated with three SRI methods under the proposed Expert Ranking (ER) framework to rank the candidate expert collaborators based on their likelihood of collaborating in response to a query formulated by the social network of a query initiator and certain required skills to a project/task.
Abstract: We consider the experts recommendation problem for open collaborative projects in large-scale Open Source Software (OSS) communities. In large-scale online community, recommending expert collaborators to a project coordinator or lead developer has two prominent challenges: (i) the "cold shoulder"' problem, which is the lack of interest from the experts to collaborate and share their skills, and (ii) the "cold start" problem, which is an issue with community members who has scarce data history. In this paper, we consider the Degree of Knowledge (DoK) which imposes the knowledge of the skills factor, and the Social Relative Importance (SRI) which imposes the social distance factor to tackle the aforementioned challenges. We propose four DoK models and integrate them with three SRI methods under our proposed Expert Ranking (ER) framework to rank the candidate expert collaborators based on their likelihood of collaborating in response to a query formulated by the social network of a query initiator and certain required skills to a project/task. We evaluate our proposal using a dataset collected from Github.com, which is one of the most fast-growing, large-scale online OSS community. In addition, we test the models under different data scarcity levels. The experiment shows promising results of recommending expert collaborators who tend to make real collaborations to projects.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper analyzes the social interactions of users and investigates the development of their social ties using data trail of ‘how social ties grow’ left in mobile and social networking services and develops a Social‐aware Hidden Markov Model (SaHMM) that explicitly takes into account the factor of common friends in measure of the social tie development.
Abstract: Understanding social tie development among users is crucial for user engagement in social networking services. In this paper, we analyze the social interactions, both online and offline, of users and investigate the development of their social ties using data trail of 'how social ties grow' left in mobile and social networking services. To the best of our knowledge, this is the first research attempt on studying social tie development by considering both online and offline interactions in a heterogeneous yet realistic relationship. In this study, we aim to answer three key questions: i is there a correlation between online and offline interactions? ii how is the social tie developed via heterogeneous interaction channels? and iii would the development of social tie between two users be affected by their common friends? To achieve our goal, we develop a Social-aware Hidden Markov Model SaHMM that explicitly takes into account the factor of common friends in measure of the social tie development. Our experiments show that, comparing with results obtained using HMM and other heuristic methods, the social tie development captured by our SaHMM is significantly more consistent to lifetime profiles of users.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: It is argued that patent citations can either be technological citations that indicate knowledge transfer or be legal citations that delimit the legal scope of citing patents, and a probabilistic citation network based algorithm and a prediction model for patent valuation are proposed.
Abstract: Effective patent valuation is important for patent holders. Forward patent citations, widely used in assessing patent value, have been considered as reflecting knowledge flows, just like paper citations. However, patent citations also carry legal implication, which is important for patent valuation. We argue that patent citations can either be technological citations that indicate knowledge transfer or be legal citations that delimit the legal scope of citing patents. In this paper, we first develop citation-network based methods to infer patent quality measures at either the legal or technological dimension. Then we propose a probabilistic mixture approach to incorporate both the legal and technological dimensions in patent citations, and an iterative learning process that integrates a temporal decay function on legal citations, a probabilistic citation network based algorithm and a prediction model for patent valuation. We learn all the parameters together and use them for patent valuation. We demonstrate the effectiveness of our approach by using patent maintenance status as an indicator of patent value and discuss the insights we learned from this study.

Journal ArticleDOI
TL;DR: An API query language that allows mobile mashup applications to readily specify and obtain desired information by instructing a proxy to filter unnecessary information returned from Web API servers is designed and an image multi-get module is devised, which results in mobile mashups applications with smaller transfer sizes.
Abstract: Recently, the proliferation of smartphones and the extensive coverage of wireless networks have enabled numerous mobile users to access Web resources with smartphones. Mobile mashup applications are very attractive to smartphone users due to specialized services and user-friendly GUIs. However, to offer new services through the integration of Web resources via Web API invocations, mobile mashup applications suffer from high energy consumption and long response time. In this paper, we propose a proxy system and two techniques to reduce the size of data transfer, thereby enabling mobile mashup applications to achieve energy-efficient and cost-effective Web API invocations. Specifically, we design an API query language that allows mobile mashup applications to readily specify and obtain desired information by instructing a proxy to filter unnecessary information returned from Web API servers. We also devise an image multi-get module, which results in mobile mashup applications with smaller transfer sizes by combining multiple images and adjusting the quality, scale, or resolution of the images. With the proposed proxy and techniques, a mobile mashup application can rapidly retrieve Web resources via Web API invocations with lower energy consumption due to a smaller number of HTTP requests and responses as well as smaller response bodies. Experimental results show that the proposed proxy system and techniques significantly reduce transfer size, response time, and energy consumption of mobile mashup applications.

Proceedings ArticleDOI
12 Mar 2014
TL;DR: An extensive analysis on the developers of Open Source Software (OSS) projects is conducted, finding that a significant ratio of developers share the same affiliation and location in a team for a project that is being developed by remote collaborators.
Abstract: We conduct an extensive analysis on the developers of Open Source Software (OSS) projects. Our goal is to discover trends that govern the developers' behavior in contributing to OSS projects. To achieve our goal, we define and analyze a set of developer and OSS project features. Moreover, we study the behavior of the developers on selecting OSS projects to participate in by analyzing the project features that dictate the developers' selection. In addition, we study the difference between developers who seek a job and who do not seek a job in developing social ties. We, also, analyze the developers' affiliation (e.g., corporate, university, Institute, etc.) and location (e.g., city) statistics. It is found that a significant ratio of developers share the same affiliation and location in a team for a project that is being developed by remote collaborators. We use a dataset collected from Github.com, which is one of the most fast-growing and large-scale online OSS community. This study is substantial for future works of recommender systems targeting the OSS community.

Book ChapterDOI
13 May 2014
TL;DR: Experimental results show that the models created based on the proposed approach significantly enhance those using the baseline features or patent backward citations, and also exploit trends in temporal patterns of relevant prior patents, which are highly related to patent values.
Abstract: It is a challenging task for firms to assess the importance of a patent and identify valuable patents as early as possible Counting the number of citations received is a widely used method to assess the value of a patent However, recently granted patents have few citations received, which makes the use of citation counts infeasible In this paper, we propose a novel idea to evaluate the value of new or recently granted patents using recommended relevant prior patents Our approach is to exploit trends in temporal patterns of relevant prior patents, which are highly related to patent values We evaluate the proposed approach using two patent value evaluation tasks with a large-scale collection of US patents Experimental results show that the models created based on our idea significantly enhance those using the baseline features or patent backward citations

Book ChapterDOI
21 Apr 2014
TL;DR: A new index structure and query processing algorithms for distance-based top-k queries, called SKY R-tree, which drives on the strengths of R- tree and Skyline algorithm to efficiently prune the search space by exploring both the spatial proximity and non-spatial attributes.
Abstract: Searches for objects associated with location information and non-spatial attributes have increased significantly over the years. To address this need, a top-k query may be issued by taking into account both the location information and non-spatial attributes. This paper focuses on a distance-based top-k query which retrieves the best objects based on distance from candidate objects to a query point as well as other non-spatial attributes. In this paper, we propose a new index structure and query processing algorithms for distance-based top-k queries. This new index, called SKY R-tree, drives on the strengths of R-tree and Skyline algorithm to efficiently prune the search space by exploring both the spatial proximity and non-spatial attributes. Moreover, we propose a variant of SKY R-tree, called S2KY R-tree which incorporates a similarity measure of non-spatial attributes. We demonstrate, through extensive experimentation, that our proposals perform very well in terms of I/O costs and CPU time.

Proceedings ArticleDOI
10 Mar 2014
TL;DR: This paper proposes a recommendation system for missing citations for newly granted patents, based on the patent citation network of a newly granted query patent, which ranks candidate patents via a RankSVM model learned by using those relevancy scores as features.
Abstract: The U.S. recently adopted a post-grant opposition procedure to encourage third parties to challenge the validity of newly granted patents by providing relevant prior patents that are missed during patent examination (i.e., missing citations). In this paper, we propose a recommendation system for missing citations for newly granted patents. The recommendation system, based on the patent citation network of a newly granted query patent, focuses on paths that start with the references of the query patent in the network. Our approach is to identify the relevancy of a candidate patent to the query patent by its citation relationship (paths) that are distinguished based on the direction, topology and semantics of the paths in the network. We consider six different types of paths between a candidate patent and a query patent based on their citation relationship and define a relevancy score for each path type. Accordingly, we rank candidate patents via a RankSVM model learned by using those relevancy scores as features. The experimental results show our approach significantly improves the average precision and recall performance compared to two baseline methods, i.e., Katz distance and text similarity.

Book ChapterDOI
16 Jun 2014
TL;DR: A cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework and can achieve more efficiency than previous algorithms on the entity resolution and clustering.
Abstract: Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.

Journal ArticleDOI
TL;DR: The site-based and the area-based approach for efficiently processing range and k-nearest-neighbor queries on distributed BSDs are developed and an optimal division is proved and a practical heuristic is derived to partition a query and select the best processing site for each partition, hence achieving even better efficiency.
Abstract: This paper studies the problem of querying Bounded Spatial Datasets (BSDs). A BSD contains objects with known locations, and unknown regions, each of which bounds an unknown number of objects, within a coverage area. We consider applications where each BSD is hosted on a site connected to a communication network and the BSDs overlap in their coverage areas. The challenge is to query the distributed BSDs to retrieve all objects and to minimize the unknown regions which may contain objects satisfying the query, while minimizing the data transmission volume and number of interactions between the query client and the sites. We develop the site-based approach and the area-based approach for efficiently processing range and kNN queries on distributed BSDs. Accordingly, optimal site selection and the corresponding site querying methods are important problems studied in this paper. In the area-based approach, we prove an optimal division and derive a practical heuristic to partition a query and select the best processing site for each partition, hence achieving even better efficiency than the site-based approach. Simulation results based on three real spatial datasets show that our proposed approaches significantly outperform the baseline in terms of data transmission volume and the number of interactions.

Proceedings ArticleDOI
10 Mar 2014
TL;DR: This paper proposes to identify patent technological trends, which carries information about technology evolution and trajectories among patents, to enable more effective and precise patent evaluation and demonstrates that the identified technological trends are able to capture patent value precisely.
Abstract: Patents are very important intangible assets that protect firm technologies and maintain market competitiveness. Thus, patent evaluation is critical for firm business strategy and innovation management. Currently patent evaluation mostly relies on some meta information of patents, such as number of forward/backward citations and number of claims. In this paper, we propose to identify patent technological trends, which carries information about technology evolution and trajectories among patents, to enable more effective and precise patent evaluation. We explore features to capture both the value of trends and the quality of patents within a trend, and perform patent evaluation to validate the extracted trends and features using patents in the United States Patent and Trademark Office (USPTO) dataset. Experimental results demonstrate that the identified technological trends are able to capture patent value precisely. With the proposed trend related features extracted from our identified trends, we can improve patent evaluation performance significantly over the baseline using conventional features.

Proceedings ArticleDOI
10 Mar 2014
TL;DR: A framework for business location planning that takes into account both factors of geographical proximity and social influence is proposed and a suite of algorithms based on Targeted Region-oriented strategy is designed to enhance the processing efficiency.
Abstract: Business location planning, critical to success of many businesses, can be addressed by reverse nearest neighbors (RNN) query using geographical proximity to the customers as the main metric to find a store location which is the closest to many customers Nevertheless, we argue that other marketing factors such as social influence could be considered in the process of business location planning In this paper, we propose a framework for business location planning that takes into account both factors of geographical proximity and social influence An essential task in this framework is to compute the “influence spread” of RNNs for candidate locations However, excessive computational overhead and long latency hinder its feasibility for our framework Thus, we trade storage overhead for the processing speed by precomputing and storing the social influences between pairs of customers and design a suite of algorithms based on Targeted Region-oriented strategy Various ordering and pruning techniques have been incorporated in these algorithms to enhance the processing efficiency of our framework Experiments also show that the proposed algorithms efficiently support the task of location planning under various parameter settings