scispace - formally typeset
Search or ask a question

Showing papers by "Jiawei Han published in 2010"


Proceedings Article
23 Aug 2010
TL;DR: A novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions that have better agreement with human summaries compared to the baseline extractive method.
Abstract: We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.

500 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: The experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks.
Abstract: The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such large-scale graph-structured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph structures efficiently within a large network? Unfortunately, the graph query is hard due to the NP-complete nature of subgraph isomorphism. It becomes even challenging when the network examined is large and diverse. In this paper, we present a high performance graph indexing mechanism, SPath, to address the graph query problem on large networks. SPath leverages decomposed shortest paths around vertex neighborhood as basic indexing units, which prove to be both effective in graph search space pruning and highly scalable in index construction and deployment. Via SPath, a graph query is processed and optimized beyond the traditional vertex-at-a-time fashion to a more efficient path-at-a-time way: the query is first decomposed to a set of shortest paths, among which a subset of candidates with good selectivity is picked by a query plan optimizer; Candidate paths are further joined together to help recover the query graph to finalize the graph query processing. We evaluate SPath with the state-of-the-art GraphQL on both real and synthetic data sets. Our experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks.

359 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-the-fly.
Abstract: Recent improvements in positioning technology make massive moving object data widely available. One important analysis is to find the moving objects that travel together. Existing methods put a strong constraint in defining moving object cluster, that they require the moving objects to stick together for consecutive timestamps. Our key observation is that the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps.Motivated by this, we propose the concept of swarm which captures the moving objects that move within arbitrary shape of clusters for certain timestamps that are possibly non-consecutive. The goal of our paper is to find all discriminative swarms, namely closed swarm. While the search space for closed swarms is prohibitively huge, we design a method, ObjectGrowth, to efficiently retrieve the answer. In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-the-fly. Empirical studies on the real data as well as large synthetic data demonstrate the effectiveness and efficiency of our methods.

344 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: A two-stage algorithm, Periodica, is proposed to solve the problem of mining periodic behaviors for moving objects, and it is proposed that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations.
Abstract: Periodicity is a frequently happening phenomenon for moving objects. Finding periodic behaviors is essential to understanding object movements. However, periodic behaviors could be complicated, involving multiple interleaving periods, partial time span, and spatiotemporal noises and outliers. In this paper, we address the problem of mining periodic behaviors for moving objects. It involves two sub-problems: how to detect the periods in complex movement, and how to mine periodic movement behaviors. Our main assumption is that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations. Based on this assumption, we propose a two-stage algorithm, Periodica, to solve the problem. At the first stage, the notion of observation spot is proposed to capture the reference locations. Through observation spots, multiple periods in the movement can be retrieved using a method that combines Fourier transform and autocorrelation. At the second stage, a probabilistic model is proposed to characterize the periodic behaviors. For a specific period, periodic behaviors are statistically generalized from partial movement sequences through hierarchical clustering. Empirical studies on both synthetic and real data sets demonstrate the effectiveness of our method.

308 citations


Proceedings ArticleDOI
26 Oct 2010
TL;DR: A generative graphical model is proposed which utilizes the heterogeneous link information and the textual content associated with each node in the network to mine topic-level direct influence and a topic- level influence propagation and aggregation algorithm is proposed to derive the indirect influence between nodes.
Abstract: Influence is a complex and subtle force that governs the dynamics of social networks as well as the behaviors of involved users. Understanding influence can benefit various applications such as viral marketing, recommendation, and information retrieval. However, most existing works on social influence analysis have focused on verifying the existence of social influence. Few works systematically investigate how to mine the strength of direct and indirect influence between nodes in heterogeneous networks. To address the problem, we propose a generative graphical model which utilizes the heterogeneous link information and the textual content associated with each node in the network to mine topic-level direct influence. Based on the learned direct influence, a topic-level influence propagation and aggregation algorithm is proposed to derive the indirect influence between nodes. We further study how the discovered topic-level influence can help the prediction of user behaviors. We validate the approach on three different genres of data sets: Twitter, Digg, and citation networks. Qualitatively, our approach can discover interesting influence patterns in heterogeneous networks. Quantitatively, the learned topic-level influence can greatly improve the accuracy of user behavior prediction.

278 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: This paper proposes an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers, and applies the model on both synthetic data and DBLP data sets to demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.
Abstract: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

260 citations


Book ChapterDOI
20 Sep 2010
TL;DR: This paper considers the transductive classification problem on heterogeneous networked data which share a common topic and proposes a novel graph-based regularization framework, GNetMine, to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types.
Abstract: A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for some objects. Learning from such labeled and unlabeled data via transductive classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only some objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graph-based regularization framework, GNetMine, is proposed to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing state-of-the-art methods.

247 citations


Journal ArticleDOI
09 Nov 2010
TL;DR: Different techniques employed to study various aspects of tagging are summarized, including properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage.
Abstract: Social tagging on online portals has become a trend now. It has emerged as one of the best ways of associating metadata with web objects. With the increase in the kinds of web objects becoming available, collaborative tagging of such objects is also developing along new dimensions. This popularity has led to a vast literature on social tagging. In this survey paper, we would like to summarize different techniques employed to study various aspects of tagging. Broadly, we would discuss about properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage. We would discuss topics like why people tag, what influences the choice of tags, how to model the tagging process, kinds of tags, different power laws observed in tagging domain, how tags are created, how to choose the right tags for recommendation, etc. We conclude with thoughts on future work in the area.

224 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: A time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function is proposed and an efficient learning algorithm is designed to optimize the objective function.
Abstract: Information network contains abundant knowledge about relationships among people or entities. Unfortunately, such kind of knowledge is often hidden in a network where different kinds of relationships are not explicitly categorized. For example, in a research publication network, the advisor-advisee relationships among researchers are hidden in the coauthor network. Discovery of those relationships can benefit many interesting applications such as expert finding and research community analysis. In this paper, we take a computer science bibliographic network as an example, to analyze the roles of authors and to discover the likely advisor-advisee relationships. In particular, we propose a time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function. We further design an efficient learning algorithm to optimize the objective function. Based on that our model suggests and ranks probable advisors for every author. Experimental results show that the proposed approach infer advisor-advisee relationships efficiently and achieves a state-of-the-art accuracy (80-90%). We also apply the discovered advisor-advisee relationships to bole search, a specific expert finding task and empirical study shows that the search performance can be effectively improved (+4.09% by NDCG@5).

212 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: A family of novel approximate SimRank computation algorithms for static and dynamic information networks are developed and their corresponding theoretical justification and analysis are given.
Abstract: Information networks are ubiquitous in many applications and analysis on such networks has attracted significant attention in the academic communities. One of the most important aspects of information network analysis is to measure similarity between nodes in a network. SimRank is a simple and influential measure of this kind, based on a solid theoretical "random surfer" model. Existing work computes SimRank similarity scores in an iterative mode. We argue that the iterative method can be infeasible and inefficient when, as in many real-world scenarios, the networks change dynamically and frequently. We envision non-iterative method to bridge the gap. It allows users not only to update the similarity scores incrementally, but also to derive similarity scores for an arbitrary subset of nodes. To enable the non-iterative computation, we propose to rewrite the SimRank equation into a non-iterative form by using the Kronecker product and vectorization operators. Based on this, we develop a family of novel approximate SimRank computation algorithms for static and dynamic information networks, and give their corresponding theoretical justification and analysis. The non-iterative method supports efficient processing of various node analysis including similarity tracking and centrality tracking on evolving information networks. The effectiveness and efficiency of our proposed methods are evaluated on synthetic and real data sets.

171 citations


Proceedings ArticleDOI
25 Jul 2010
TL;DR: This paper formally defines the problem of popular event tracking in online communities (PET) and proposes a novel statistical method that models the the popularity of events over time, taking into consideration the burstiness of user interest, information diffusion on the network structure, and the evolution of textual topics.
Abstract: User generated information in online communities has been characterized with the mixture of a text stream and a network structure both changing over time. A good example is a web-blogging community with the daily blog posts and a social network of bloggers. An important task of analyzing an online community is to observe and track the popular events, or topics that evolve over time in the community. Existing approaches usually focus on either the burstiness of topics or the evolution of networks, but ignoring the interplay between textual topics and network structures. In this paper, we formally define the problem of popular event tracking in online communities (PET), focusing on the interplay between texts and networks. We propose a novel statistical method that models the the popularity of events over time, taking into consideration the burstiness of user interest, information diffusion on the network structure, and the evolution of textual topics. Specifically, a Gibbs Random Field is defined to model the influence of historic status and the dependency relationships in the graph; thereafter a topic model generates the words in text content of the event, regularized by the Gibbs Random Field. We prove that two classic models in information diffusion and text burstiness are special cases of our model under certain situations. Empirical experiments with two different communities and datasets (i.e., Twitter and DBLP) show that our approach is effective and outperforms existing approaches.

Book ChapterDOI
01 Apr 2010
TL;DR: Wang et al. as discussed by the authors proposed an incremental clustering framework for trajectories, which contains two parts: online micro-cluster maintenance and offline macro-clusters creation, where each trajectory is simplified into a set of directed line segments in order to find clusters of trajectory subparts.
Abstract: Trajectory clustering has played a crucial role in data analysis since it reveals underlying trends of moving objects. Due to their sequential nature, trajectory data are often received incrementally, e.g., continuous new points reported by GPS system. However, since existing trajectory clustering algorithms are developed for static datasets, they are not suitable for incremental clustering with the following two requirements. First, clustering should be processed efficiently since it can be frequently requested. Second, huge amounts of trajectory data must be accommodated, as they will accumulate constantly. An incremental clustering framework for trajectories is proposed in this paper. It contains two parts: online micro-cluster maintenance and offline macro-cluster creation. For online part, when a new bunch of trajectories arrives, each trajectory is simplified into a set of directed line segments in order to find clusters of trajectory subparts. Micro-clusters are used to store compact summaries of similar trajectory line segments, which take much smaller space than raw trajectories. When new data are added, micro-clusters are updated incrementally to reflect the changes. For offline part, when a user requests to see current clustering result, macro-clustering is performed on the set of micro-clusters rather than on all trajectories over the whole time span. Since the number of micro-clusters is smaller than that of original trajectories, macro-clusters are generated efficiently to show clustering result of trajectories. Experimental results on both synthetic and real data sets show that our framework achieves high efficiency as well as high clustering quality.

Proceedings ArticleDOI
09 Aug 2010
TL;DR: This approach estimates link relevance by using random walk algorithm on an augmented social graph with both attribute and structure information and outperforms state-of-the-art methods for link recommendation.
Abstract: The phenomenal success of social networking sites, such as Facebook, Twitter and LinkedIn, has revolutionized the way people communicate. This paradigm has attracted the attention of researchers that wish to study the corresponding social and technological problems. Link recommendation is a critical task that not only helps increase the linkage inside the network and also improves the user experience. In an effective link recommendation algorithm it is essential to identify the factors that influence link creation. This paper enumerates several of these intuitive criteria and proposes an approach which satisfies these factors. This approach estimates link relevance by using random walk algorithm on an augmented social graph with both attribute and structure information. The global and local influences of the attributes are leveraged in the framework as well. Other than link recommendation, our framework can also rank the attributes in the network. Experiments on DBLP and IMDB data sets demonstrate that our method outperforms state-of-the-art methods for link recommendation.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: This paper takes a new path to explore the global trends and sentiments that can be drawn by analyzing the sharing patterns of uploaded and downloaded social multimedia, revealing the wisdom that is embedded in social multimedia sites for social science applications such as politics, economics, and marketing.
Abstract: Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack and Photobucket, are increasingly popular around the globe. A major trend in the current studies on social multimedia is using the social media sites as a source of huge amount of labeled data for solving large scale computer science problems in computer vision, data mining and multimedia. In this paper, we take a new path to explore the global trends and sentiments that can be drawn by analyzing the sharing patterns of uploaded and downloaded social multimedia. In a sense, each time an image or video is uploaded or viewed, it constitutes an implicit vote for (or against) the subject of the image. This vote carries along with it a rich set of associated data including time and (often) location information. By aggregating such votes across millions of Internet users, we reveal the wisdom that is embedded in social multimedia sites for social science applications such as politics, economics, and marketing. We believe that our work opens a brand new arena for the multimedia research community with a potentially big impact on society and social sciences.

Journal ArticleDOI
TL;DR: This study re-examine a set of null-invariant interestingness measures and finds that they can be expressed as the generalized mathematical mean, leading to a total ordering of them and proposes a new measure called Imbalance Ratio to gauge the degree of skewness of a data set.
Abstract: Numerous interestingness measures have been proposed in statistics and data mining to assess object relationships. This is especially important in recent studies of association or correlation pattern mining. However, it is still not clear whether there is any intrinsic relationship among many proposed measures, and which one is truly effective at gauging object relationships in large data sets. Recent studies have identified a critical property, null-(transaction) invariance, for measuring associations among events in large data sets, but many measures do not have this property. In this study, we re-examine a set of null-invariant interestingness measures and find that they can be expressed as the generalized mathematical mean, leading to a total ordering of them. Such a unified framework provides insights into the underlying philosophy of the measures and helps us understand and select the proper measure for different applications. Moreover, we propose a new measure called Imbalance Ratio to gauge the degree of skewness of a data set. We also discuss the efficient computation of interesting patterns of different null-invariant interestingness measures by proposing an algorithm, GAMiner, which complements previous studies. Experimental evaluation verifies the effectiveness of the unified framework and shows that GAMiner speeds up the state-of-the-art algorithm by an order of magnitude.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work describes how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas, and shows that CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
Abstract: We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper proposes a parameter-free hierarchical network clustering algorithm SHRINK, which can effectively reveal the embedded hierarchical community structure with multiresolution in large-scale weighted undirected networks, and identify hubs and outliers as well.
Abstract: Community detection is an important task for mining the structure and function of complex networks. Generally, there are several different kinds of nodes in a network which are cluster nodes densely connected within communities, as well as some special nodes like hubs bridging multiple communities and outliers marginally connected with a community. In addition, it has been shown that there is a hierarchical structure in complex networks with communities embedded within other communities. Therefore, a good algorithm is desirable to be able to not only detect hierarchical communities, but also identify hubs and outliers. In this paper, we propose a parameter-free hierarchical network clustering algorithm SHRINK by combining the advantages of density-based clustering and modularity optimization methods. Based on the structural connectivity information, the proposed algorithm can effectively reveal the embedded hierarchical community structure with multiresolution in large-scale weighted undirected networks, and identify hubs and outliers as well. Moreover, it overcomes the sensitive threshold problem of density-based clustering algorithms and the resolution limit possessed by other modularity-based methods. To illustrate our methodology, we conduct experiments with both real-world and synthetic datasets for community detection, and compare with many other baseline methods. Experimental results demonstrate that SHRINK achieves the best performance with consistent improvements.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This paper proposes an adaptive threshold for outlier detection, and proposes a probabilistic approach for novel class detection using discrete Gini Coefficient, and proves its effectiveness both theoretically and empirically.
Abstract: The problem of data stream classification is challenging because of many practical aspects associated with efficient processing and temporal behavior of the stream. Two such well studied aspects are infinite length and concept-drift. Since a data stream may be considered a continuous process, which is theoretically infinite in length, it is impractical to store and use all the historical data for training. Data streams also frequently experience concept-drift as a result of changes in the underlying concepts. However, another important characteristic of data streams, namely, concept-evolution is rarely addressed in the literature. Concept-evolution occurs as a result of new classes evolving in the stream. This paper addresses concept-evolution in addition to the existing challenges of infinite-length and concept-drift. In this paper, the concept-evolution phenomenon is studied, and the insights are used to construct superior novel class detection techniques. First, we propose an adaptive threshold for outlier detection, which is a vital part of novel class detection. Second, we propose a probabilistic approach for novel class detection using discrete Gini Coefficient, and prove its effectiveness both theoretically and empirically. Finally, we address the issue of simultaneous multiple novel class occurrence, and provide an elegant solution to detect more than one novel classes at the same time. We also consider feature-evolution in text data streams, which occurs because new features (i.e., words) evolve in the stream. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: This work aims to build a system to suggest tourist destinations based on visual matching and minimal user input, and cluster a large-scale geotagged web photo collection into groups by location and then find the representative images for each group.
Abstract: This work aims to build a system to suggest tourist destinations based on visual matching and minimal user input. A user can provide either a photo of the desired scenary or a keyword describing the place of interest, and the system will look into its database for places that share the visual characteristics. To that end, we first cluster a large-scale geotagged web photo collection into groups by location and then find the representative images for each group. Tourist destination recommendations are produced by comparing the query against the representative tags or representative images under the premise of ”if you like that place, you may also like these places“.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: This paper proposes several link recommendation criteria, based on both user attributes and graph structure, which outperforms state-of-the-art methods based on network structure and node attribute information for link recommendation.
Abstract: With the phenomenal success of networking sites (e.g., Facebook, Twitter and LinkedIn), social networks have drawn substantial attention. On online social networking sites, link recommendation is a critical task that not only helps improve user experience but also plays an essential role in network growth. In this paper we propose several link recommendation criteria, based on both user attributes and graph structure. To discover the candidates that satisfy these criteria, link relevance is estimated using a random walk algorithm on an augmented social graph with both attribute and structure information. The global and local influence of the attributes is leveraged in the framework as well. Besides link recommendation, our framework can also rank attributes in a social network. Experiments on DBLP and IMDB data sets demonstrate that our method outperforms state-of-the-art methods based on network structure and node attribute information for link recommendation.

Proceedings ArticleDOI
06 Jun 2010
TL;DR: The system, MoveMine, is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining and a user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods.
Abstract: With the maturity of GPS, wireless, and Web technologies, increasing amounts of movement data collected from various moving objects, such as animals, vehicles, mobile devices, and climate radars, have become widely available. Analyzing such data has broad applications, e.g., in ecological study, vehicle control, mobile communication management, and climatological forecast. However, few data mining tools are available for flexible and scalable analysis of massive-scale moving object data. Our system, MoveMine, is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining. We explore the state-of-the-art and novel techniques at implementation of the selected functions. A user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods. Since MoveMine is tested on multiple kinds of real data sets, it will benefit users to carry out versatile analysis on these kinds of data. At the same time, it will benefit researchers to realize the importance and limitations of current techniques as well as the potential future studies in moving object data mining.

Proceedings ArticleDOI
Yizhou Sun1, Jie Tang2, Jiawei Han1, Manish Gupta1, Bo Zhao1 
24 Jul 2010
TL;DR: A Dirichlet Process Mixture Model-based generative model is proposed to model the community generations and the evolution structure can be read from the model, which can help users better understand the birth, split and death of communities.
Abstract: As the rapid development of all kinds of online databases, huge heterogeneous information networks thus derived are ubiquitous. Detecting evolutionary communities in these networks can help people better understand the structural evolution of the networks. However, most of the current community evolution analysis is based on the homogeneous networks, while a real community usually involves different types of objects in a heterogeneous network. For example, when referring to a research community, it contains a set of authors, a set of conferences or journals and a set of terms.In this paper, we study the problem of detecting evolutionary multi-typed communities defined as net-clusters in dynamic heterogeneous networks. A Dirichlet Process Mixture Model-based generative model is proposed to model the community generations. At each time stamp, a clustering of communities with the best cluster number that can best explain the current and historical networks are automatically detected. A Gibbs sampling-based inference algorithm is provided to inference the model. Also, the evolution structure can be read from the model, which can help users better understand the birth, split and death of communities. Experiments on two real datasets, namely DBLP and Delicious.com, have shown the effectiveness of the algorithm.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This paper proposes a method called Tru-Alarm which finds out trustworthy alarms and increases the feasibility of CPS, and estimates the locations of objects causing alarms, constructs an object-alarm graph and carries out trustworthiness inferences based on linked information in the graph.
Abstract: A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational)components to form a situation-integrated analytical system that responds intelligently to dynamic changes of the real-world scenarios. One key issue in CPS research is trustworthiness analysis of the observed data: Due to technology limitations and environmental influences, the CPS data are inherently noisy that may trigger many false alarms. It is highly desirable to sift meaningful information from a large volume of noisy data. In this paper, we propose a method called Tru-Alarm which finds out trustworthy alarms and increases the feasibility of CPS. Tru-Alarm estimates the locations of objects causing alarms, constructs an object-alarm graph and carries out trustworthiness inferences based on linked information in the graph. Extensive experiments show that Tru-Alarm filters out noises and false information efficiently and guarantees not missing any meaningful alarms.

Proceedings ArticleDOI
03 Nov 2010
TL;DR: The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction.
Abstract: Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public, the individual input and output data traces used to construct it may be private data of participants (e.g., their individual food intake, lifestyle choices, and resulting weight). This paper proposes and experimentally studies a technique that attempts to keep such input and output data traces private, while allowing accurate model construction. This is significantly different from perturbation-based techniques in that no noise is added. The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction. We particularly focus on linear regression models which are widely used in participatory sensing applications. We use the data set from a map-based participatory sensing service to evaluate our scheme. The service in question is a green navigation service that constructs regression models from participant data to predict the fuel consumption of vehicles on road segments. We evaluate our proposed mechanism by providing empirical evidence that: i) an individual data trace is generally hard to reconstruct with any reasonable accuracy, and ii) the regression model constructed using the transformed traces has a much smaller error than one based on additive data-perturbation schemes.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This paper proposes a novel density-based network clustering algorithm, called gSkeletonClu (graph-skeleton based clustering), which can find the optimal parameter $\varepsilon$ and detect communities, hubs and outliers in large-scale undirected networks automatically without any user interaction.
Abstract: Community detection is an important task for mining the structure and function of complex networks. Many pervious approaches are difficult to detect communities with arbitrary size and shape, and are unable to identify hubs and outliers. A recently proposed network clustering algorithm, SCAN, is effective and can overcome this difficulty. However, it depends on a sensitive parameter: minimum similarity threshold $\varepsilon$, but provides no automated way to find it. In this paper, we propose a novel density-based network clustering algorithm, called gSkeletonClu (graph-skeleton based clustering). By projecting a network to its Core-Connected Maximal Spanning Tree (CCMST), the network clustering problem is converted to finding core-connected components in the CCMST. We discover that all possible values of the parameter $\varepsilon$ lie in the edge weights of the corresponding CCMST. By means of tree divisive or agglomerative clustering, our algorithm can find the optimal parameter $\varepsilon$ and detect communities, hubs and outliers in large-scale undirected networks automatically without any user interaction. Extensive experiments on both real-world and synthetic networks demonstrate the superior performance of gSkeletonClu over the baseline methods.

Book ChapterDOI
20 Sep 2010
TL;DR: DXMiner considers the dynamic nature of the feature space and provides an elegant solution for classification and novel class detection when the featureSpace is dynamic, and outperforms state-of-the-art stream classification techniques in classifying and detecting novel classes in real data streams.
Abstract: Data stream classification poses many challenges, most of which are not addressed by the state-of-the-art. We present DXMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and feature-evolution. Data streams are assumed to be infinite in length, which necessitates single-pass incremental learning techniques. Concept-drift occurs in a data stream when the underlying concept changes over time. Most existing data stream classification techniques address only the infinite length and concept-drift problems. However, concept-evolution and feature- evolution are also major challenges, and these are ignored by most of the existing approaches. Concept-evolution occurs in the stream when novel classes arrive, and feature-evolution occurs when new features emerge in the stream. Our previous work addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Most of the existing data stream classification techniques, including our previous work, assume that the feature space of the data points in the stream is static. This assumption may be impractical for some type of data, for example text data. DXMiner considers the dynamic nature of the feature space and provides an elegant solution for classification and novel class detection when the feature space is dynamic. We show that our approach outperforms state-of-the-art stream classification techniques in classifying and detecting novel classes in real data streams.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work study one traditional text mining task on such new form of text, that is extraction of meaningful keywords, and proposes several intuitive yet useful features and experiment with various classification models.
Abstract: Today, a huge amount of text is being generated for social purposes on social networking services on the Web. Unlike traditional documents, such text is usually extremely short and tends to be informal. Analysis of such text benefit many applications such as advertising, search, and content filtering. In this work, we study one traditional text mining task on such new form of text, that is extraction of meaningful keywords. We propose several intuitive yet useful features and experiment with various classification models. Evaluation is conducted on Facebook data. Performances of various features and models are reported and compared.

Book ChapterDOI
21 Jun 2010
TL;DR: ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems, and outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.
Abstract: We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data Most of the existing data stream classification techniques address only the infinite length and concept-drift problems Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems Concept-evolution occurs in the stream when novel classes arrive However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training This assumption is impractical, since data labeling is both time consuming and costly Therefore, it is impossible to label a majority of the data points in a high-speed data stream This scarcity of labeled data naturally leads to poorly trained classifiers ActMiner actively selects only those data points for labeling for which the expected classification error is high Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

Journal ArticleDOI
TL;DR: It is shown that the inverse classification problem is a powerful and general model which encompasses a number of different criteria, which can be used for a variety of decision support applications which have pre-determined task criteria.
Abstract: In this paper, we examine an emerging variation of the classification problem, which is known as the inverse classification problem. In this problem, we determine the features to be used to create a record which will result in a desired class label. Such an approach is useful in applications in which it is an objective to determine a set of actions to be taken in order to guide the data mining application towards a desired solution. This system can be used for a variety of decision support applications which have pre-determined task criteria. We will show that the inverse classification problem is a powerful and general model which encompasses a number of different criteria. We propose a number of algorithms for the inverse classification problem, which use an inverted list representation for intermediate data structure representation and classification. We validate our approach over a number of real datasets.

Journal ArticleDOI
TL;DR: In this paper, a graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an application-oriented topological structure, and an efficient cubing algorithm that performs simultaneous aggregation of both spatio-temporal and item dimensions on a partitioned movement graph, guided by such a topology structure.
Abstract: Massive radio frequency identification (RFID) data sets are expected to become commonplace in supply chain management systems. Warehousing and mining this data is an essential problem with great potential benefits for inventory management, object tracking, and product procurement processes. Since RFID tags can be used to identify each individual item, enormous amounts of location-tracking data are generated. With such data, object movements can be modeled by movement graphs, where nodes correspond to locations and edges record the history of item transitions between locations. In this study, we develop a movement graph model as a compact representation of RFID data sets. Since spatiotemporal as well as item information can be associated with the objects in such a model, the movement graph can be huge, complex, and multidimensional in nature. We show that such a graph can be better organized around gateway nodes, which serve as bridges connecting different regions of the movement graph. A graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an application-oriented topological structure. Moreover, we propose an efficient cubing algorithm that performs simultaneous aggregation of both spatiotemporal and item dimensions on a partitioned movement graph, guided by such a topological structure.