scispace - formally typeset
Search or ask a question

Showing papers by "Jiawei Han published in 2011"


Journal ArticleDOI
TL;DR: In GNMF, an affinity graph is constructed to encode the geometrical information and a matrix factorization is sought, which respects the graph structure, and the empirical study shows encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-world problems.
Abstract: Matrix factorization techniques have been frequently applied in information retrieval, computer vision, and pattern recognition. Among them, Nonnegative Matrix Factorization (NMF) has received considerable attention due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts based in the human brain. On the other hand, from the geometric perspective, the data is usually sampled from a low-dimensional manifold embedded in a high-dimensional ambient space. One then hopes to find a compact representation,which uncovers the hidden semantics and simultaneously respects the intrinsic geometric structure. In this paper, we propose a novel algorithm, called Graph Regularized Nonnegative Matrix Factorization (GNMF), for this purpose. In GNMF, an affinity graph is constructed to encode the geometrical information and we seek a matrix factorization, which respects the graph structure. Our empirical study shows encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-world problems.

1,870 citations


Journal ArticleDOI
01 Aug 2011
TL;DR: Under the meta path framework, a novel similarity measure called PathSim is defined that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures.
Abstract: Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks.In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept of meta path-based similarity, where a meta path is a path consisting of a sequence of relations defined between different object types (i.e., structural paths at the meta level). No matter whether a user would like to explicitly specify a path combination given sufficient domain knowledge, or choose the best path by experimental trials, or simply provide training examples to learn it, meta path forms a common base for a network-based similarity search engine. In particular, under the meta path framework we define a novel similarity measure called PathSim that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures. In order to support fast online query processing for PathSim queries, we develop an efficient solution that partially materializes short meta paths and then concatenates them online to compute top-k results. Experiments on real data sets demonstrate the effectiveness and efficiency of our proposed paradigm.

1,583 citations


Proceedings Article
01 Dec 2011
TL;DR: This work investigates the effectiveness of several unsupervised disaggregation methods on low frequency power measurements collected in real homes and indicates that a conditional factorial hidden semi-Markov model, which integrates additional features related to when and how appliances are used in the home and more accurately represents the power use of individual appliances, outperforms the other unsuper supervision methods.
Abstract: Fear of increasing prices and concern about climate change are motivating residential power conservation efforts. We investigate the effectiveness of several unsupervised disaggregation methods on low frequency power measurements collected in real homes. Specifically, we consider variants of the factorial hidden Markov model. Our results indicate that a conditional factorial hidden semi-Markov model, which integrates additional features related to when and how appliances are used in the home and more accurately represents the power use of individual appliances, outperforms the other unsupervised disaggregation methods. Our results show that unsupervised techniques can provide perappliance power usage information in a non-invasive manner, which is ideal for enabling power conservation efforts.

596 citations


Proceedings Article
14 Jul 2011
TL;DR: In this paper, a generalized Fisher score was proposed to jointly select features, which maximizes the lower bound of traditional Fisher score by solving a quadratically constrained linear programming (QCLP) problem.
Abstract: Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. In this paper, we present a generalized Fisher score to jointly select features. It aims at finding an subset of features, which maximize the lower bound of traditional Fisher score. The resulting feature selection problem is a mixed integer programming, which can be reformulated as a quadratically constrained linear programming (QCLP). It is solved by cutting plane algorithm, in each iteration of which a multiple kernel learning problem is solved alternatively by multivariate ridge regression and projected gradient descent. Experiments on benchmark data sets indicate that the proposed method outperforms Fisher score as well as many other state-of-the-art feature selection methods.

472 citations


Proceedings ArticleDOI
25 Jul 2011
TL;DR: Experiments are presented on a real bibliographic network, the DBLP network, which show that metapath-based heterogeneousTopological features can generate more accurate prediction results as compared to homogeneous topological features.
Abstract: The problem of predicting links or interactions between objects in a network, is an important task in network analysis. Along this line, link prediction between co-authors in a co-author network is a frequently studied problem. In most of these studies, authors are considered in a homogeneous network, \i.e., only one type of objects(author type) and one type of links (co-authorship) exist in the network. However, in a real bibliographic network, there are multiple types of objects (\e.g., venues, topics, papers) and multiple types of links among these objects. In this paper, we study the problem of co-author relationship prediction in the heterogeneous bibliographic network, and a new methodology called\emph{Path Predict}, \i.e., meta path-based relationship prediction model, is proposed to solve this problem. First, meta path-based topological features are systematically extracted from the network. Then, a supervised model is used to learn the best weights associated with different topological features in deciding the co-author relationships. We present experiments on a real bibliographic network, the DBLP network, which show that metapath-based heterogeneous topological features can generate more accurate prediction results as compared to homogeneous topological features. In addition, the level of significance of each topological feature can be learned from the model, which is helpful in understanding the mechanism behind the relationship building.

456 citations


Journal ArticleDOI
TL;DR: This work proposes a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive.
Abstract: Most existing data stream classification techniques ignore one important aspect of stream data: arrival of a novel class. We address this issue and propose a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive. Novel class detection problem becomes more challenging in the presence of concept-drift, when the underlying data distributions evolve in streams. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. A maximum allowable wait time Tc is imposed as a time constraint to classify a test instance. Furthermore, most existing stream classification approaches assume that the true label of a data point can be accessed immediately after the data point is classified. In reality, a time delay Tl is involved in obtaining the true label of a data point since manual labeling is time consuming. We show how to make fast and correct classification decisions under these constraints and apply them to real benchmark data. Comparison with state-of-the-art stream classification techniques prove the superiority of our approach.

362 citations


Journal ArticleDOI
TL;DR: A new approach to extract the document concepts which are consistent with the manifold geometry such that each concept corresponds to a connected component is proposed, which is called Locally Consistent Concept Factorization (LCCF).
Abstract: Previous studies have demonstrated that document clustering performance can be improved significantly in lower dimensional linear subspaces. Recently, matrix factorization-based techniques, such as Nonnegative Matrix Factorization (NMF) and Concept Factorization (CF), have yielded impressive results. However, both of them effectively see only the global euclidean geometry, whereas the local manifold geometry is not fully considered. In this paper, we propose a new approach to extract the document concepts which are consistent with the manifold geometry such that each concept corresponds to a connected component. Central to our approach is a graph model which captures the local geometry of the document submanifold. Thus, we call it Locally Consistent Concept Factorization (LCCF). By using the graph Laplacian to smooth the document-to-concept mapping, LCCF can extract concepts with respect to the intrinsic manifold structure and thus documents associated with the same concept can be well clustered. The experimental results on TDT2 and Reuters-21578 have shown that the proposed approach provides a better representation and achieves better clustering results in terms of accuracy and mutual information.

335 citations


Proceedings ArticleDOI
28 Mar 2011
TL;DR: The results confirm the hypothesis that the geographical distributions can help modeling topics, while topics provide important cues to group different geographical regions.
Abstract: This paper studies the problem of discovering and comparing geographical topics from GPS-associated documents. GPS-associated documents become popular with the pervasiveness of location-acquisition technologies. For example, in Flickr, the geo-tagged photos are associated with tags and GPS locations. In Twitter, the locations of the tweets can be identified by the GPS locations from smart phones. Many interesting concepts, including cultures, scenes, and product sales, correspond to specialized geographical distributions. In this paper, we are interested in two questions: (1) how to discover different topics of interests that are coherent in geographical regions? (2) how to compare several topics across different geographical locations? To answer these questions, this paper proposes and compares three ways of modeling geographical topics: location-driven model, text-driven model, and a novel joint model called LGTA (Latent Geographical Topic Analysis) that combines location and text. To make a fair comparison, we collect several representative datasets from Flickr website including Landscape, Activity, Manhattan, National park, Festival, Car, and Food. The results show that the first two methods work in some datasets but fail in others. LGTA works well in all these datasets at not only finding regions of interests but also providing effective comparisons of the topics across different locations. The results confirm our hypothesis that the geographical distributions can help modeling topics, while topics provide important cues to group different geographical regions.

325 citations



Proceedings ArticleDOI
16 Jul 2011
TL;DR: This paper reformulate the subspace learning problem and uses L2,1-norm on the projection matrix to achieve row-sparsity, which leads to selecting relevant features and learning transformation simultaneously.
Abstract: Dimensionality reduction is a very important topic in machine learning. It can be generally classified into two categories: feature selection and subspace learning. In the past decades, many methods have been proposed for dimensionality reduction. However, most of these works study feature selection and subspace learning independently. In this paper, we present a framework for joint feature selection and subspace learning. We reformulate the subspace learning problem and use L2,1-norm on the projection matrix to achieve row-sparsity, which leads to selecting relevant features and learning transformation simultaneously. We discuss two situations of the proposed framework, and present their optimization algorithms. Experiments on benchmark face recognition data sets illustrate that the proposed framework outperforms the state of the art methods overwhelmingly.

211 citations


Journal ArticleDOI
TL;DR: This paper introduces a regularized probabilistic model based on manifold structure for data clustering, called Laplacian regularized Gaussian Mixture Model (LapGMM), which is modeled by a nearest neighbor graph, and the graph structure is incorporated in the maximum likelihood objective function.
Abstract: Gaussian Mixture Models (GMMs) are among the most statistically mature methods for clustering. Each cluster is represented by a Gaussian distribution. The clustering process thereby turns to estimate the parameters of the Gaussian mixture, usually by the Expectation-Maximization algorithm. In this paper, we consider the case where the probability distribution that generates the data is supported on a submanifold of the ambient space. It is natural to assume that if two points are close in the intrinsic geometry of the probability distribution, then their conditional probability distributions are similar. Specifically, we introduce a regularized probabilistic model based on manifold structure for data clustering, called Laplacian regularized Gaussian Mixture Model (LapGMM). The data manifold is modeled by a nearest neighbor graph, and the graph structure is incorporated in the maximum likelihood objective function. As a result, the obtained conditional probability distribution varies smoothly along the geodesics of the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.

Journal ArticleDOI
01 Feb 2011
TL;DR: Spectral Regression Kernel Discriminant Analysis is presented, which casts discriminant analysis into a regression framework, which facilitates both efficient computation and the use of regularization techniques.
Abstract: Linear discriminant analysis (LDA) has been a popular method for dimensionality reduction, which preserves class separability. The projection vectors are commonly obtained by maximizing the between-class covariance and simultaneously minimizing the within-class covariance. LDA can be performed either in the original input space or in the reproducing kernel Hilbert space (RKHS) into which data points are mapped, which leads to kernel discriminant analysis (KDA). When the data are highly nonlinear distributed, KDA can achieve better performance than LDA. However, computing the projective functions in KDA involves eigen-decomposition of kernel matrix, which is very expensive when a large number of training samples exist. In this paper, we present a new algorithm for kernel discriminant analysis, called Spectral Regression Kernel Discriminant Analysis (SRKDA). By using spectral graph analysis, SRKDA casts discriminant analysis into a regression framework, which facilitates both efficient computation and the use of regularization techniques. Specifically, SRKDA only needs to solve a set of regularized regression problems, and there is no eigenvector computation involved, which is a huge save of computational cost. The new formulation makes it very easy to develop incremental version of the algorithm, which can fully utilize the computational results of the existing training samples. Moreover, it is easy to produce sparse projections (Sparse KDA) with a L 1-norm regularizer. Extensive experiments on spoken letter, handwritten digit image and face image data demonstrate the effectiveness and efficiency of the proposed algorithm.

Proceedings ArticleDOI
21 Aug 2011
TL;DR: A novel ranking-based iterative classification framework that generates more accurate classes than the state-of-art classification methods on networked data, but also provides meaningful ranking of objects within each class, serving as a more informative view of the data than traditional classification.
Abstract: It has been recently recognized that heterogeneous information networks composed of multiple types of nodes and links are prevalent in the real world. Both classification and ranking of the nodes (or data objects) in such networks are essential for network analysis. However, so far these approaches have generally been performed separately. In this paper, we combine ranking and classification in order to perform more accurate analysis of a heterogeneous information network. Our intuition is that highly ranked objects within a class should play more important roles in classification. On the other hand, class membership information is important for determining a quality ranking over a dataset. We believe it is therefore beneficial to integrate classification and ranking in a simultaneous, mutually enhancing process, and to this end, propose a novel ranking-based iterative classification framework, called RankClass. Specifically, we build a graph-based ranking model to iteratively compute the ranking distribution of the objects within each class. At each iteration, according to the current ranking results, the graph structure used in the ranking algorithm is adjusted so that the sub-network corresponding to the specific class is emphasized, while the rest of the network is weakened. As our experiments show, integrating ranking with classification not only generates more accurate classes than the state-of-art classification methods on networked data, but also provides meaningful ranking of objects within each class, serving as a more informative view of the data than traditional classification.

Proceedings ArticleDOI
12 Jun 2011
TL;DR: This paper addresses the problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers, and provides an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboid.
Abstract: Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table's dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals' privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NP-hard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln|L| + 1)^2 of the optimal, where |L| is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 - 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.

Proceedings ArticleDOI
12 Jun 2011
TL;DR: Graph Cube is introduced, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks and is shown to be a powerful and efficient tool for decision support on large multi-dimensional networks.
Abstract: We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven to be effective tools for decision support on relational data. However, they are not well-equipped to handle the new yet important multidimensional networks. In this paper, we introduce Graph Cube, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks. By taking account of both attribute aggregation and structure summarization of the networks, Graph Cube goes beyond the traditional data cube model involved solely with numeric value based group-by's, thus resulting in a more insightful and structure-enriched aggregate network within every possible multidimensional space. Besides traditional cuboid queries, a new class of OLAP queries, crossboid, is introduced that is uniquely useful in multidimensional networks and has not been studied before. We implement Graph Cube by combining special characteristics of multidimensional networks with the existing well-studied data cube techniques. We perform extensive experimental studies on a series of real world data sets and Graph Cube is shown to be a powerful and efficient tool for decision support on large multidimensional networks.

Journal ArticleDOI
TL;DR: A moving object data mining system, MoveMine, which integrates multiple data mining functions, including sophisticated pattern mining and trajectory analysis is introduced, which will benefit scientists and other users to carry out versatile analysis tasks to analyze object movement regularities and anomalies.
Abstract: With the maturity and wide availability of GPS, wireless, telecommunication, and Web technologies, massive amounts of object movement data have been collected from various moving object targets, such as animals, mobile devices, vehicles, and climate radars. Analyzing such data has deep implications in many applications, such as, ecological study, traffic control, mobile communication management, and climatological forecast. In this article, we focus our study on animal movement data analysis and examine advanced data mining methods for discovery of various animal movement patterns. In particular, we introduce a moving object data mining system, MoveMine, which integrates multiple data mining functions, including sophisticated pattern mining and trajectory analysis. In this system, two interesting moving object pattern mining functions are newly developed: (1) periodic behavior mining and (2) swarm pattern mining. For mining periodic behaviors, a reference location-based method is developed, which first detects the reference locations, discovers the periods in complex movements, and then finds periodic patterns by hierarchical clustering. For mining swarm patterns, an efficient method is developed to uncover flexible moving object clusters by relaxing the popularly-enforced collective movement constraints.In the MoveMine system, a set of commonly used moving object mining functions are built and a user-friendly interface is provided to facilitate interactive exploration of moving object data mining and flexible tuning of the mining constraints and parameters. MoveMine has been tested on multiple kinds of real datasets, especially for MoveBank applications and other moving object data analysis. The system will benefit scientists and other users to carry out versatile analysis tasks to analyze object movement regularities and anomalies. Moreover, it will benefit researchers to realize the importance and limitations of current techniques and promote future studies on moving object data mining. As expected, a mastery of animal movement patterns and trends will improve our understanding of the interactions between and the changes of the animal world and the ecosystem and therefore help ensure the sustainability of our ecosystem.

Proceedings ArticleDOI
21 Aug 2011
TL;DR: This paper proposes a novel topic model with biased propagation (TMBP) algorithm to directly incorporate heterogeneous information network with topic modeling in a unified way and extensively evaluates the proposed approach and compares to the state-of-the-art techniques on several datasets.
Abstract: With the development of Web applications, textual documents are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about text-rich heterogeneous information networks. Topic models have been proposed and shown to be useful for document analysis, and the interactions among multi-typed objects play a key role at disclosing the rich semantics of the network. However, most of topic models only consider the textual information while ignore the network structures or can merely integrate with homogeneous networks. None of them can handle heterogeneous information network well. In this paper, we propose a novel topic model with biased propagation (TMBP) algorithm to directly incorporate heterogeneous information network with topic modeling in a unified way. The underlying intuition is that multi-typed objects should be treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network. A simple and unbiased topic propagation across such a heterogeneous network does not make much sense. Consequently, we investigate and develop two biased propagation frameworks, the biased random walk framework and the biased regularization framework, for the TMBP algorithm from different perspectives, which can discover latent topics and identify clusters of multi-typed objects simultaneously. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on several datasets. Experimental results demonstrate that the improvement in our proposed approach is consistent and promising.

Journal ArticleDOI
TL;DR: Four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced as an extension of the general data stream model.
Abstract: In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of data streams has emerged as an important research undertaking. Estimating the frequency of the items on these streams is an important aggregation and summary technique for both stream mining and data management systems with a broad range of applications. This paper reviews the state-of-the-art progress on methods of identifying frequent items from data streams. It describes different kinds of models for frequent items mining task. For general models such as cash register and Turnstile, we classify existing algorithms into sampling-based, counting-based, and hashing-based categories. The processing techniques and data synopsis structure of each algorithm are described and compared by evaluation measures. Accordingly, as an extension of the general data stream model, four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced. The characteristics and limitations of the algorithms of each model are presented, and open issues waiting for study and improvement are discussed.

Book
01 Jan 2011
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it’s still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition’s publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today’s most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of your data

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This paper investigates how to rank the trajectory patterns mined from the uploaded photos with geotags and timestamps to reveal the collective wisdom recorded in the seemingly isolated photos and the individual travel sequences reflected by the geo-tagged photos.
Abstract: Social media such as those residing in the popular photo sharing websites is attracting increasing attention in recent years. As a type of user-generated data, wisdom of the crowd is embedded inside such social media. In particular, millions of users upload to Flickr their photos, many associated with temporal and geographical information. In this paper, we investigate how to rank the trajectory patterns mined from the uploaded photos with geotags and timestamps. The main objective is to reveal the collective wisdom recorded in the seemingly isolated photos and the individual travel sequences reflected by the geo-tagged photos. Instead of focusing on mining frequent trajectory patterns from geo-tagged social media, we put more effort into ranking the mined trajectory patterns and diversifying the ranking results. Through leveraging the relationships among users, locations and trajectories, we rank the trajectory patterns. We then use an exemplar-based algorithm to diversify the results in order to discover the representative trajectory patterns. We have evaluated the proposed framework on 12 different cities using a Flickr dataset and demonstrated its effectiveness.

Proceedings ArticleDOI
25 Jul 2011
TL;DR: Empirical studies from both synthetic datasets and real-life dataset demonstrate the power of merging GPS data and social graph structure, and suggest the method outperforms other methods for friends recommendation in GPS-based cyber-physical social network.
Abstract: The popularization of GPS-enabled mobile devices provides social network researchers a taste of cyber-physical social network in advance. Traditional link prediction methods are designed to find friends solely relying on social network information. With location and trajectory data available, we can generate more accurate and geographically related results, and help web-based social service users find more friends in the real world. Aiming to recommend geographically related friends in social network, a three-step statistical recommendation approach is proposed for GPS-enabled cyber-physical social network. By combining GPS information and social network structures, we build a pattern-based heterogeneous information network. Links inside this network reflect both people's geographical information, and their social relationships. Our approach estimates link relevance and finds promising geo-friends by employing a random walk process on the heterogeneous information network. Empirical studies from both synthetic datasets and real-life dataset demonstrate the power of merging GPS data and social graph structure, and suggest our method outperforms other methods for friends recommendation in GPS-based cyber-physical social network.

Journal ArticleDOI
TL;DR: A framework for frequent pattern-based classification for trajectories on road networks is presented and it is contended that (frequent) sequential patterns are good feature candidates since they preserve this order information.
Abstract: Classification has been used for modeling many kinds of data sets, including sets of items, text documents, graphs, and networks. However, there is a lack of study on a new kind of data, trajectories on road networks. Modeling such data is useful with the emerging GPS and RFID technologies and is important for effective transportation and traffic planning. In this work, we study methods for classifying trajectories on road networks. By analyzing the behavior of trajectories on road networks, we observe that, in addition to the locations where vehicles have visited, the order of these visited locations is crucial for improving classification accuracy. Based on our analysis, we contend that (frequent) sequential patterns are good feature candidates since they preserve this order information. Furthermore, when mining sequential patterns, we propose to confine the length of sequential patterns to ensure high efficiency. Compared with closed sequential patterns, these partial (i.e., length-confined) sequential patterns allow us to significantly improve efficiency almost without losing accuracy. In this paper, we present a framework for frequent pattern-based classification for trajectories on road networks. Our comparative study over a broad range of classification approaches demonstrates that our method significantly improves accuracy over other methods in some synthetic and real trajectory data.

Journal ArticleDOI
01 Aug 2011
TL;DR: This work proposes SocialSpamGuard, a scalable and online social media spam detection system based on data mining for social network security, which employs the GAD clustering algorithm for large scale clustering and integrates it with the designed active learning algorithm to deal with the scalability and real-time detection challenges.
Abstract: We have entered the era of social media networks represented by Facebook, Twitter, YouTube and Flickr. Internet users now spend more time on social networks than search engines. Business entities or public figures set up social networking pages to enhance direct interactions with online users. Social media systems heavily depend on users for content contribution and sharing. Information is spread across social networks quickly and effectively. However, at the same time social media networks become susceptible to different types of unwanted and malicious spammer or hacker actions. There is a crucial need in the society and industry for security solution in social media. In this demo, we propose SocialSpamGuard, a scalable and online social media spam detection system based on data mining for social network security. We employ our GAD clustering algorithm for large scale clustering and integrate it with the designed active learning algorithm to deal with the scalability and real-time detection challenges.

Proceedings ArticleDOI
24 Oct 2011
TL;DR: A matrix-variate Normal prior distribution on the weight vectors of the classifier is introduced to model the label correlation and do feature selection simultaneously to find a subset of features, based on which thelabel correlation regularized loss of label ranking is minimized.
Abstract: Multi-label learning studies the problem where each instance is associated with a set of labels. There are two challenges in multi-label learning: (1) the labels are interdependent and correlated, and (2) the data are of high dimensionality. In this paper, we aim to tackle these challenges in one shot. In particular, we propose to learn the label correlation and do feature selection simultaneously. We introduce a matrix-variate Normal prior distribution on the weight vectors of the classifier to model the label correlation. Our goal is to find a subset of features, based on which the label correlation regularized loss of label ranking is minimized. The resulting multi-label feature selection problem is a mixed integer programming, which is reformulated as quadratically constrained linear programming (QCLP). It can be solved by cutting plane algorithm, in each iteration of which a minimax optimization problem is solved by dual coordinate descent and projected sub-gradient descent alternatively. Experiments on benchmark data sets illustrate that the proposed methods outperform single-label feature selection method and many other state-of-the-art multi-label learning methods.

Journal ArticleDOI
01 Aug 2011
TL;DR: Spider-Mine is proposed, a novel algorithm to efficiently mine top-K largest frequent patterns from a single massive network with any user-specified probability of 1 - ∈.
Abstract: With ever-growing popularity of social networks, web and bio-networks, mining large frequent patterns from a single huge network has become increasingly important. Yet the existing pattern mining methods cannot offer the efficiency desirable for large pattern discovery. We propose Spider-Mine, a novel algorithm to efficiently mine top-K largest frequent patterns from a single massive network with any user-specified probability of 1 - ∈. Deviating from the existing edge-by-edge (i.e., incremental) pattern-growth framework, SpiderMine achieves its efficiency by unleashing the power of small patterns of a bounded diameter, which we call "spiders". With the spider structure, our approach adopts a probabilistic mining framework to find the top-k largest patterns by (i) identifying an affordable set of promising growth paths toward large patterns, (ii) generating large patterns with much lower combinatorial complexity, and finally (iii) greatly reducing the cost of graph isomorphism tests with a new graph pattern representation by a multi-set of spiders. Extensive experimental studies on both synthetic and real data sets show that our algorithm outperforms existing methods.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: A probabilistic model in the Conditional Random Fields framework is proposed to predict the replying structure for a threaded online discussion to incorporate the features describing the structural dependency among the discussion content and learn their relationship.
Abstract: Online forum discussions are emerging as valuable information repository, where knowledge is accumulated by the interaction among users, leading to multiple threads with structures. Such replying structure in each thread conveys important information about the discussion content. Unfortunately, not all the online forum sites would explicitly record such replying relationship, making it hard to for both users and computers to digest the information buried in a thread discussion. In this paper, we propose a probabilistic model in the Conditional Random Fields framework to predict the replying structure for a threaded online discussion. Different from previous thread reconstruction methods, most of which fail to consider dependency between the posts, we cast the problem as a supervised structure learning problem to incorporate the features describing the structural dependency among the discussion content and learn their relationship. Experiment results on three different online forums show that the proposed method can well capture the replying structures in online discussion threads, and multiple tasks such as forum search and question answering can benefit from the reconstructed replying structures.

Book ChapterDOI
05 Sep 2011
TL;DR: This paper aims at finding a subset of features, based on which the learnt linear transformation via LDA maximizes the Fisher criterion, and proposes to integrate Fisher score and LDA in a unified framework, namely Linear Discriminant Dimensionality Reduction (LDDR).
Abstract: Fisher criterion has achieved great success in dimensionality reduction. Two representative methods based on Fisher criterion are Fisher Score and Linear Discriminant Analysis (LDA). The former is developed for feature selection while the latter is designed for subspace learning. In the past decade, these two approaches are often studied independently. In this paper, based on the observation that Fisher score and LDA are complementary, we propose to integrate Fisher score and LDA in a unified framework, namely Linear Discriminant Dimensionality Reduction (LDDR). We aim at finding a subset of features, based on which the learnt linear transformation via LDA maximizes the Fisher criterion. LDDR inherits the advantages of Fisher score and LDA and is able to do feature selection and subspace learning simultaneously. Both Fisher score and LDA can be seen as the special cases of the proposed method. The resultant optimization problem is a mixed integer programming, which is difficult to solve. It is relaxed into a L2,1-norm constrained least square problem and solved by accelerated proximal gradient descent algorithm. Experiments on benchmark face recognition data sets illustrate that the proposed method outperforms the state of the art methods arguably.

Proceedings ArticleDOI
11 Dec 2011
TL;DR: A novel and principled probabilistic model is proposed which casts this task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way and performs significantly better than existing methods.
Abstract: The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.

Proceedings ArticleDOI
11 Dec 2011
TL;DR: This paper proposes a more realistic novel class detection technique, which remembers a class and identifies it as "not novel" when it reappears after a long disappearance, and has shown significant reduction in classification error over state-of-the-art stream classification techniques on several benchmark data streams.
Abstract: Concept-evolution is one of the major challenges in data stream classification, which occurs when a new class evolves in the stream. This problem remains unaddressed by most state-of-the-art techniques. A recurring class is a special case of concept-evolution. This special case takes place when a class appears in the stream, then disappears for a long time, and again appears. Existing data stream classification techniques that address the concept-evolution problem, wrongly detect the recurring classes as novel class. This creates two main problems. First, much resource is wasted in detecting a recurring class as novel class, because novel class detection is much more computationally- and memory-intensive, as compared to simply recognizing an existing class. Second, when a novel class is identified, human experts are involved in collecting and labeling the instances of that class for future modeling. If a recurrent class is reported as novel class, it will be only a waste of human effort to find out whether it is really a novel class. In this paper, we address the recurring issue, and propose a more realistic novel class detection technique, which remembers a class and identifies it as "not novel" when it reappears after a long disappearance. Our approach has shown significant reduction in classification error over state-of-the-art stream classification techniques on several benchmark data streams.

Journal ArticleDOI
18 Oct 2011
TL;DR: This article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers.
Abstract: Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail.In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.