Showing papers by "Jiawei Han published in 2010"

PDF

Open Access

Proceedings Article•

Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions

[...]

Kavita Ganesan¹, ChengXiang Zhai¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

23 Aug 2010

TL;DR: A novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions that have better agreement with human summaries compared to the baseline extractive method.

...read moreread less

Abstract: We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.

...read moreread less

500 citations

Journal Article•DOI•

On graph query optimization in large networks

[...]

Peixiang Zhao¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Sep 2010

TL;DR: The experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks.

...read moreread less

Abstract: The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such large-scale graph-structured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph structures efficiently within a large network? Unfortunately, the graph query is hard due to the NP-complete nature of subgraph isomorphism. It becomes even challenging when the network examined is large and diverse. In this paper, we present a high performance graph indexing mechanism, SPath, to address the graph query problem on large networks. SPath leverages decomposed shortest paths around vertex neighborhood as basic indexing units, which prove to be both effective in graph search space pruning and highly scalable in index construction and deployment. Via SPath, a graph query is processed and optimized beyond the traditional vertex-at-a-time fashion to a more efficient path-at-a-time way: the query is first decomposed to a set of shortest paths, among which a subset of candidates with good selectivity is picked by a query plan optimizer; Candidate paths are further joined together to help recover the query graph to finalize the graph query processing. We evaluate SPath with the state-of-the-art GraphQL on both real and synthetic data sets. Our experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks.

...read moreread less

359 citations

Journal Article•DOI•

Swarm: mining relaxed temporal moving object clusters

[...]

Zhenhui Li¹, Bolin Ding¹, Jiawei Han¹, Roland Kays•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Sep 2010

TL;DR: In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-the-fly.

...read moreread less

Abstract: Recent improvements in positioning technology make massive moving object data widely available. One important analysis is to find the moving objects that travel together. Existing methods put a strong constraint in defining moving object cluster, that they require the moving objects to stick together for consecutive timestamps. Our key observation is that the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps.Motivated by this, we propose the concept of swarm which captures the moving objects that move within arbitrary shape of clusters for certain timestamps that are possibly non-consecutive. The goal of our paper is to find all discriminative swarms, namely closed swarm. While the search space for closed swarms is prohibitively huge, we design a method, ObjectGrowth, to efficiently retrieve the answer. In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-the-fly. Empirical studies on the real data as well as large synthetic data demonstrate the effectiveness and efficiency of our methods.

...read moreread less

344 citations

Proceedings Article•DOI•

Mining periodic behaviors for moving objects

[...]

Zhenhui Li¹, Bolin Ding¹, Jiawei Han¹, Roland Kays, Peter Nye² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, New York State Department of Environmental Conservation²

25 Jul 2010

TL;DR: A two-stage algorithm, Periodica, is proposed to solve the problem of mining periodic behaviors for moving objects, and it is proposed that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations.

...read moreread less

Abstract: Periodicity is a frequently happening phenomenon for moving objects. Finding periodic behaviors is essential to understanding object movements. However, periodic behaviors could be complicated, involving multiple interleaving periods, partial time span, and spatiotemporal noises and outliers. In this paper, we address the problem of mining periodic behaviors for moving objects. It involves two sub-problems: how to detect the periods in complex movement, and how to mine periodic movement behaviors. Our main assumption is that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations. Based on this assumption, we propose a two-stage algorithm, Periodica, to solve the problem. At the first stage, the notion of observation spot is proposed to capture the reference locations. Through observation spots, multiple periods in the movement can be retrieved using a method that combines Fourier transform and autocorrelation. At the second stage, a probabilistic model is proposed to characterize the periodic behaviors. For a specific period, periodic behaviors are statistically generalized from partial movement sequences through hierarchical clustering. Empirical studies on both synthetic and real data sets demonstrate the effectiveness of our method.

...read moreread less

308 citations

Proceedings Article•DOI•

Mining topic-level influence in heterogeneous networks

[...]

Lu Liu¹, Jie Tang², Jiawei Han³, Meng Jiang², Shiqiang Yang² - Show less +1 more•Institutions (3)

Capital Medical University¹, Tsinghua University², University of Illinois at Urbana–Champaign³

26 Oct 2010

TL;DR: A generative graphical model is proposed which utilizes the heterogeneous link information and the textual content associated with each node in the network to mine topic-level direct influence and a topic- level influence propagation and aggregation algorithm is proposed to derive the indirect influence between nodes.

...read moreread less

Abstract: Influence is a complex and subtle force that governs the dynamics of social networks as well as the behaviors of involved users. Understanding influence can benefit various applications such as viral marketing, recommendation, and information retrieval. However, most existing works on social influence analysis have focused on verifying the existence of social influence. Few works systematically investigate how to mine the strength of direct and indirect influence between nodes in heterogeneous networks. To address the problem, we propose a generative graphical model which utilizes the heterogeneous link information and the textual content associated with each node in the network to mine topic-level direct influence. Based on the learned direct influence, a topic-level influence propagation and aggregation algorithm is proposed to derive the indirect influence between nodes. We further study how the discovered topic-level influence can help the prediction of user behaviors. We validate the approach on three different genres of data sets: Twitter, Digg, and citation networks. Qualitatively, our approach can discover interesting influence patterns in heterogeneous networks. Quantitatively, the learned topic-level influence can greatly improve the accuracy of user behavior prediction.

...read moreread less

278 citations

Proceedings Article•DOI•

On community outliers and their efficient detection in information networks

[...]

Jing Gao¹, Feng Liang¹, Wei Fan², Chi Wang¹, Yizhou Sun¹, Jiawei Han¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

25 Jul 2010

TL;DR: This paper proposes an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers, and applies the model on both synthetic data and DBLP data sets to demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

...read moreread less

Abstract: Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

...read moreread less

260 citations

Book Chapter•DOI•

Graph regularized transductive classification on heterogeneous information networks

[...]

Ming Ji¹, Yizhou Sun¹, Marina Danilevsky¹, Jiawei Han¹, Jing Gao¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 Sep 2010

TL;DR: This paper considers the transductive classification problem on heterogeneous networked data which share a common topic and proposes a novel graph-based regularization framework, GNetMine, to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types.

...read moreread less

Abstract: A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for some objects. Learning from such labeled and unlabeled data via transductive classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only some objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graph-based regularization framework, GNetMine, is proposed to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing state-of-the-art methods.

...read moreread less

247 citations

Journal Article•DOI•

Survey on social tagging techniques

[...]

Manish Gupta¹, Rui Li¹, Zhijun Yin¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

09 Nov 2010

TL;DR: Different techniques employed to study various aspects of tagging are summarized, including properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage.

...read moreread less

Abstract: Social tagging on online portals has become a trend now. It has emerged as one of the best ways of associating metadata with web objects. With the increase in the kinds of web objects becoming available, collaborative tagging of such objects is also developing along new dimensions. This popularity has led to a vast literature on social tagging. In this survey paper, we would like to summarize different techniques employed to study various aspects of tagging. Broadly, we would discuss about properties of tag streams, tagging models, tag semantics, generating recommendations using tags, visualizations of tags, applications of tags and problems associated with tagging usage. We would discuss topics like why people tag, what influences the choice of tags, how to model the tagging process, kinds of tags, different power laws observed in tagging domain, how tags are created, how to choose the right tags for recommendation, etc. We conclude with thoughts on future work in the area.

...read moreread less

224 citations

Proceedings Article•DOI•

Mining advisor-advisee relationships from research publication networks

[...]

Chi Wang¹, Jiawei Han¹, Yuntao Jia¹, Jie Tang², Duo Zhang¹, Yintao Yu¹, Jingyi Guo² - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Tsinghua University²

25 Jul 2010

TL;DR: A time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function is proposed and an efficient learning algorithm is designed to optimize the objective function.

...read moreread less

Abstract: Information network contains abundant knowledge about relationships among people or entities. Unfortunately, such kind of knowledge is often hidden in a network where different kinds of relationships are not explicitly categorized. For example, in a research publication network, the advisor-advisee relationships among researchers are hidden in the coauthor network. Discovery of those relationships can benefit many interesting applications such as expert finding and research community analysis. In this paper, we take a computer science bibliographic network as an example, to analyze the roles of authors and to discover the likely advisor-advisee relationships. In particular, we propose a time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function. We further design an efficient learning algorithm to optimize the objective function. Based on that our model suggests and ranks probable advisors for every author. Experimental results show that the proposed approach infer advisor-advisee relationships efficiently and achieves a state-of-the-art accuracy (80-90%). We also apply the discovered advisor-advisee relationships to bole search, a specific expert finding task and empirical study shows that the search performance can be effectively improved (+4.09% by NDCG@5).

...read moreread less

212 citations

Proceedings Article•DOI•

Fast computation of SimRank for static and dynamic information networks

[...]

Cuiping Li¹, Jiawei Han², Guoming He¹, Xin Jin², Yizhou Sun², Yintao Yu², Tianyi Wu² - Show less +3 more•Institutions (2)

Renmin University of China¹, University of Illinois at Urbana–Champaign²

22 Mar 2010

TL;DR: A family of novel approximate SimRank computation algorithms for static and dynamic information networks are developed and their corresponding theoretical justification and analysis are given.

...read moreread less

Abstract: Information networks are ubiquitous in many applications and analysis on such networks has attracted significant attention in the academic communities. One of the most important aspects of information network analysis is to measure similarity between nodes in a network. SimRank is a simple and influential measure of this kind, based on a solid theoretical "random surfer" model. Existing work computes SimRank similarity scores in an iterative mode. We argue that the iterative method can be infeasible and inefficient when, as in many real-world scenarios, the networks change dynamically and frequently. We envision non-iterative method to bridge the gap. It allows users not only to update the similarity scores incrementally, but also to derive similarity scores for an arbitrary subset of nodes. To enable the non-iterative computation, we propose to rewrite the SimRank equation into a non-iterative form by using the Kronecker product and vectorization operators. Based on this, we develop a family of novel approximate SimRank computation algorithms for static and dynamic information networks, and give their corresponding theoretical justification and analysis. The non-iterative method supports efficient processing of various node analysis including similarity tracking and centrality tracking on evolving information networks. The effectiveness and efficiency of our proposed methods are evaluated on synthetic and real data sets.

...read moreread less

171 citations

Proceedings Article•DOI•

PET: a statistical model for popular events tracking in social communities

[...]

Cindy Xide Lin¹, Bo Zhao¹, Qiaozhu Mei², Jiawei Han¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Michigan²

25 Jul 2010

TL;DR: This paper formally defines the problem of popular event tracking in online communities (PET) and proposes a novel statistical method that models the the popularity of events over time, taking into consideration the burstiness of user interest, information diffusion on the network structure, and the evolution of textual topics.

...read moreread less

Abstract: User generated information in online communities has been characterized with the mixture of a text stream and a network structure both changing over time. A good example is a web-blogging community with the daily blog posts and a social network of bloggers. An important task of analyzing an online community is to observe and track the popular events, or topics that evolve over time in the community. Existing approaches usually focus on either the burstiness of topics or the evolution of networks, but ignoring the interplay between textual topics and network structures. In this paper, we formally define the problem of popular event tracking in online communities (PET), focusing on the interplay between texts and networks. We propose a novel statistical method that models the the popularity of events over time, taking into consideration the burstiness of user interest, information diffusion on the network structure, and the evolution of textual topics. Specifically, a Gibbs Random Field is defined to model the influence of historic status and the dependency relationships in the graph; thereafter a topic model generates the words in text content of the event, regularized by the Gibbs Random Field. We prove that two classic models in information diffusion and text burstiness are special cases of our model under certain situations. Empirical experiments with two different communities and datasets (i.e., Twitter and DBLP) show that our approach is effective and outperforms existing approaches.

...read moreread less

Book Chapter•DOI•

Incremental clustering for trajectories

[...]

Zhenhui Li¹, Jae-Gil Lee², Xiaolei Li³, Jiawei Han¹•Institutions (3)

University of Illinois at Urbana–Champaign¹, IBM², Microsoft³

01 Apr 2010

TL;DR: Wang et al. as discussed by the authors proposed an incremental clustering framework for trajectories, which contains two parts: online micro-cluster maintenance and offline macro-clusters creation, where each trajectory is simplified into a set of directed line segments in order to find clusters of trajectory subparts.

...read moreread less

Abstract: Trajectory clustering has played a crucial role in data analysis since it reveals underlying trends of moving objects. Due to their sequential nature, trajectory data are often received incrementally, e.g., continuous new points reported by GPS system. However, since existing trajectory clustering algorithms are developed for static datasets, they are not suitable for incremental clustering with the following two requirements. First, clustering should be processed efficiently since it can be frequently requested. Second, huge amounts of trajectory data must be accommodated, as they will accumulate constantly. An incremental clustering framework for trajectories is proposed in this paper. It contains two parts: online micro-cluster maintenance and offline macro-cluster creation. For online part, when a new bunch of trajectories arrives, each trajectory is simplified into a set of directed line segments in order to find clusters of trajectory subparts. Micro-clusters are used to store compact summaries of similar trajectory line segments, which take much smaller space than raw trajectories. When new data are added, micro-clusters are updated incrementally to reflect the changes. For offline part, when a user requests to see current clustering result, macro-clustering is performed on the set of micro-clusters rather than on all trajectories over the whole time span. Since the number of micro-clusters is smaller than that of original trajectories, macro-clusters are generated efficiently to show clustering result of trajectories. Experimental results on both synthetic and real data sets show that our framework achieves high efficiency as well as high clustering quality.

...read moreread less

Proceedings Article•DOI•

A Unified Framework for Link Recommendation Using Random Walks

[...]

Zhijun Yin¹, Manish Gupta¹, Tim Weninger¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

09 Aug 2010

TL;DR: This approach estimates link relevance by using random walk algorithm on an augmented social graph with both attribute and structure information and outperforms state-of-the-art methods for link recommendation.

...read moreread less

Abstract: The phenomenal success of social networking sites, such as Facebook, Twitter and LinkedIn, has revolutionized the way people communicate. This paradigm has attracted the attention of researchers that wish to study the corresponding social and technological problems. Link recommendation is a critical task that not only helps increase the linkage inside the network and also improves the user experience. In an effective link recommendation algorithm it is essential to identify the factors that influence link creation. This paper enumerates several of these intuitive criteria and proposes an approach which satisfies these factors. This approach estimates link relevance by using random walk algorithm on an augmented social graph with both attribute and structure information. The global and local influences of the attributes are leveraged in the framework as well. Other than link recommendation, our framework can also rank the attributes in the network. Experiments on DBLP and IMDB data sets demonstrate that our method outperforms state-of-the-art methods for link recommendation.

...read moreread less

Proceedings Article•DOI•

The wisdom of social multimedia: using flickr for prediction and forecast

[...]

Xin Jin¹, Andrew C. Gallagher², Liangliang Cao¹, Jiebo Luo², Jiawei Han¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Eastman Kodak Company²

25 Oct 2010

TL;DR: This paper takes a new path to explore the global trends and sentiments that can be drawn by analyzing the sharing patterns of uploaded and downloaded social multimedia, revealing the wisdom that is embedded in social multimedia sites for social science applications such as politics, economics, and marketing.

...read moreread less

Abstract: Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack and Photobucket, are increasingly popular around the globe. A major trend in the current studies on social multimedia is using the social media sites as a source of huge amount of labeled data for solving large scale computer science problems in computer vision, data mining and multimedia. In this paper, we take a new path to explore the global trends and sentiments that can be drawn by analyzing the sharing patterns of uploaded and downloaded social multimedia. In a sense, each time an image or video is uploaded or viewed, it constitutes an implicit vote for (or against) the subject of the image. This vote carries along with it a rich set of associated data including time and (often) location information. By aggregating such votes across millions of Internet users, we reveal the wisdom that is embedded in social multimedia sites for social science applications such as politics, economics, and marketing. We believe that our work opens a brand new arena for the multimedia research community with a potentially big impact on society and social sciences.

...read moreread less

Journal Article•DOI•

Re-examination of interestingness measures in pattern mining: a unified framework

[...]

Tianyi Wu¹, Yuguo Chen¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Nov 2010-Data Mining and Knowledge Discovery

TL;DR: This study re-examine a set of null-invariant interestingness measures and finds that they can be expressed as the generalized mathematical mean, leading to a total ordering of them and proposes a new measure called Imbalance Ratio to gauge the degree of skewness of a data set.

...read moreread less

Abstract: Numerous interestingness measures have been proposed in statistics and data mining to assess object relationships. This is especially important in recent studies of association or correlation pattern mining. However, it is still not clear whether there is any intrinsic relationship among many proposed measures, and which one is truly effective at gauging object relationships in large data sets. Recent studies have identified a critical property, null-(transaction) invariance, for measuring associations among events in large data sets, but many measures do not have this property. In this study, we re-examine a set of null-invariant interestingness measures and find that they can be expressed as the generalized mathematical mean, leading to a total ordering of them. Such a unified framework provides insights into the underlying philosophy of the measures and helps us understand and select the proper measure for different applications. Moreover, we propose a new measure called Imbalance Ratio to gauge the degree of skewness of a data set. We also discuss the efficient computation of interesting patterns of different null-invariant interestingness measures by proposing an algorithm, GAMiner, which complements previous studies. Experimental evaluation verifies the effectiveness of the unified framework and shows that GAMiner speeds up the state-of-the-art algorithm by an order of magnitude.

...read moreread less

Proceedings Article•DOI•

CETR: content extraction via tag ratios

[...]

Tim Weninger¹, William H. Hsu², Jiawei Han¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Kansas State University²

26 Apr 2010

TL;DR: This work describes how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas, and shows that CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

...read moreread less

Abstract: We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

...read moreread less

Proceedings Article•DOI•

SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks

[...]

Jianbin Huang¹, Heli Sun², Jiawei Han³, Hongbo Deng³, Yizhou Sun³, Yaguang Liu¹ - Show less +2 more•Institutions (3)

Xidian University¹, Xi'an Jiaotong University², University of Illinois at Urbana–Champaign³

26 Oct 2010

TL;DR: This paper proposes a parameter-free hierarchical network clustering algorithm SHRINK, which can effectively reveal the embedded hierarchical community structure with multiresolution in large-scale weighted undirected networks, and identify hubs and outliers as well.

...read moreread less

Abstract: Community detection is an important task for mining the structure and function of complex networks. Generally, there are several different kinds of nodes in a network which are cluster nodes densely connected within communities, as well as some special nodes like hubs bridging multiple communities and outliers marginally connected with a community. In addition, it has been shown that there is a hierarchical structure in complex networks with communities embedded within other communities. Therefore, a good algorithm is desirable to be able to not only detect hierarchical communities, but also identify hubs and outliers. In this paper, we propose a parameter-free hierarchical network clustering algorithm SHRINK by combining the advantages of density-based clustering and modularity optimization methods. Based on the structural connectivity information, the proposed algorithm can effectively reveal the embedded hierarchical community structure with multiresolution in large-scale weighted undirected networks, and identify hubs and outliers as well. Moreover, it overcomes the sensitive threshold problem of density-based clustering algorithms and the resolution limit possessed by other modularity-based methods. To illustrate our methodology, we conduct experiments with both real-world and synthetic datasets for community detection, and compare with many other baseline methods. Experimental results demonstrate that SHRINK achieves the best performance with consistent improvements.

...read moreread less

Proceedings Article•DOI•

Addressing Concept-Evolution in Concept-Drifting Data Streams

[...]

Mohammad M. Masud¹, Qing Chen¹, Latifur Khan¹, Charu C. Aggarwal², Jing Gao³, Jiawei Han³, Bhavani Thuraisingham¹ - Show less +3 more•Institutions (3)

University of Texas at Dallas¹, IBM², University of Illinois at Urbana–Champaign³

13 Dec 2010

TL;DR: This paper proposes an adaptive threshold for outlier detection, and proposes a probabilistic approach for novel class detection using discrete Gini Coefficient, and proves its effectiveness both theoretically and empirically.

...read moreread less

Abstract: The problem of data stream classification is challenging because of many practical aspects associated with efficient processing and temporal behavior of the stream. Two such well studied aspects are infinite length and concept-drift. Since a data stream may be considered a continuous process, which is theoretically infinite in length, it is impractical to store and use all the historical data for training. Data streams also frequently experience concept-drift as a result of changes in the underlying concepts. However, another important characteristic of data streams, namely, concept-evolution is rarely addressed in the literature. Concept-evolution occurs as a result of new classes evolving in the stream. This paper addresses concept-evolution in addition to the existing challenges of infinite-length and concept-drift. In this paper, the concept-evolution phenomenon is studied, and the insights are used to construct superior novel class detection techniques. First, we propose an adaptive threshold for outlier detection, which is a vital part of novel class detection. Second, we propose a probabilistic approach for novel class detection using discrete Gini Coefficient, and prove its effectiveness both theoretically and empirically. Finally, we address the issue of simultaneous multiple novel class occurrence, and provide an elegant solution to detect more than one novel classes at the same time. We also consider feature-evolution in text data streams, which occurs because new features (i.e., words) evolve in the stream. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.

...read moreread less

Proceedings Article•DOI•

Aworldwide tourism recommendation system based on geotaggedweb photos

[...]

Liangliang Cao¹, Jiebo Luo², Andrew C. Gallagher², Xin Jin¹, Jiawei Han¹, Thomas S. Huang¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Eastman Kodak Company²

14 Mar 2010

TL;DR: This work aims to build a system to suggest tourist destinations based on visual matching and minimal user input, and cluster a large-scale geotagged web photo collection into groups by location and then find the representative images for each group.

...read moreread less

Abstract: This work aims to build a system to suggest tourist destinations based on visual matching and minimal user input. A user can provide either a photo of the desired scenary or a keyword describing the place of interest, and the system will look into its database for places that share the visual characteristics. To that end, we first cluster a large-scale geotagged web photo collection into groups by location and then find the representative images for each group. Tourist destination recommendations are produced by comparing the query against the representative tags or representative images under the premise of ”if you like that place, you may also like these places“.

...read moreread less

Proceedings Article•DOI•

LINKREC: a unified framework for link recommendation with user attributes and graph structure

[...]

Zhijun Yin¹, Manish Gupta¹, Tim Weninger¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

26 Apr 2010

TL;DR: This paper proposes several link recommendation criteria, based on both user attributes and graph structure, which outperforms state-of-the-art methods based on network structure and node attribute information for link recommendation.

...read moreread less

Abstract: With the phenomenal success of networking sites (e.g., Facebook, Twitter and LinkedIn), social networks have drawn substantial attention. On online social networking sites, link recommendation is a critical task that not only helps improve user experience but also plays an essential role in network growth. In this paper we propose several link recommendation criteria, based on both user attributes and graph structure. To discover the candidates that satisfy these criteria, link relevance is estimated using a random walk algorithm on an augmented social graph with both attribute and structure information. The global and local influence of the attributes is leveraged in the framework as well. Besides link recommendation, our framework can also rank attributes in a social network. Experiments on DBLP and IMDB data sets demonstrate that our method outperforms state-of-the-art methods based on network structure and node attribute information for link recommendation.

...read moreread less

Proceedings Article•DOI•

MoveMine: mining moving object databases

[...]

Zhenhui Li¹, Ming Ji¹, Jae-Gil Lee², Lu-An Tang¹, Yintao Yu¹, Jiawei Han¹, Roland Kays - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

06 Jun 2010

TL;DR: The system, MoveMine, is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining and a user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods.

...read moreread less

Abstract: With the maturity of GPS, wireless, and Web technologies, increasing amounts of movement data collected from various moving objects, such as animals, vehicles, mobile devices, and climate radars, have become widely available. Analyzing such data has broad applications, e.g., in ecological study, vehicle control, mobile communication management, and climatological forecast. However, few data mining tools are available for flexible and scalable analysis of massive-scale moving object data. Our system, MoveMine, is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining. We explore the state-of-the-art and novel techniques at implementation of the selected functions. A user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods. Since MoveMine is tested on multiple kinds of real data sets, it will benefit users to carry out versatile analysis on these kinds of data. At the same time, it will benefit researchers to realize the importance and limitations of current techniques as well as the potential future studies in moving object data mining.

...read moreread less

Proceedings Article•DOI•

Community evolution detection in dynamic heterogeneous information networks

[...]

Yizhou Sun¹, Jie Tang², Jiawei Han¹, Manish Gupta¹, Bo Zhao¹ - Show less +1 more•Institutions (2)

Urbana University¹, Tsinghua University²

24 Jul 2010

TL;DR: A Dirichlet Process Mixture Model-based generative model is proposed to model the community generations and the evolution structure can be read from the model, which can help users better understand the birth, split and death of communities.

...read moreread less

Abstract: As the rapid development of all kinds of online databases, huge heterogeneous information networks thus derived are ubiquitous. Detecting evolutionary communities in these networks can help people better understand the structural evolution of the networks. However, most of the current community evolution analysis is based on the homogeneous networks, while a real community usually involves different types of objects in a heterogeneous network. For example, when referring to a research community, it contains a set of authors, a set of conferences or journals and a set of terms.In this paper, we study the problem of detecting evolutionary multi-typed communities defined as net-clusters in dynamic heterogeneous networks. A Dirichlet Process Mixture Model-based generative model is proposed to model the community generations. At each time stamp, a clustering of communities with the best cluster number that can best explain the current and historical networks are automatically detected. A Gibbs sampling-based inference algorithm is provided to inference the model. Also, the evolution structure can be read from the model, which can help users better understand the birth, split and death of communities. Experiments on two real datasets, namely DBLP and Delicious.com, have shown the effectiveness of the algorithm.

...read moreread less

Proceedings Article•DOI•

Tru-Alarm: Trustworthiness Analysis of Sensor Networks in Cyber-Physical Systems

[...]

Lu-An Tang¹, Xiao Yu¹, Sangkyum Kim¹, Jiawei Han¹, Chih-Chieh Hung², Wen-Chih Peng² - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, National Chiao Tung University²

13 Dec 2010

TL;DR: This paper proposes a method called Tru-Alarm which finds out trustworthy alarms and increases the feasibility of CPS, and estimates the locations of objects causing alarms, constructs an object-alarm graph and carries out trustworthiness inferences based on linked information in the graph.

...read moreread less

Abstract: A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational)components to form a situation-integrated analytical system that responds intelligently to dynamic changes of the real-world scenarios. One key issue in CPS research is trustworthiness analysis of the observed data: Due to technology limitations and environmental influences, the CPS data are inherently noisy that may trigger many false alarms. It is highly desirable to sift meaningful information from a large volume of noisy data. In this paper, we propose a method called Tru-Alarm which finds out trustworthy alarms and increases the feasibility of CPS. Tru-Alarm estimates the locations of objects causing alarms, constructs an object-alarm graph and carries out trustworthiness inferences based on linked information in the graph. Extensive experiments show that Tru-Alarm filters out noises and false information efficiently and guarantees not missing any meaningful alarms.

...read moreread less

Proceedings Article•DOI•

Privacy-aware regression modeling of participatory sensing data

[...]

Hossein Ahmadi¹, Nam Pham¹, Raghu K. Ganti², Tarek Abdelzaher¹, Suman Nath³, Jiawei Han¹ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, IBM², Microsoft³

03 Nov 2010

TL;DR: The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction.

...read moreread less

Abstract: Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public, the individual input and output data traces used to construct it may be private data of participants (e.g., their individual food intake, lifestyle choices, and resulting weight). This paper proposes and experimentally studies a technique that attempts to keep such input and output data traces private, while allowing accurate model construction. This is significantly different from perturbation-based techniques in that no noise is added. The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction. We particularly focus on linear regression models which are widely used in participatory sensing applications. We use the data set from a map-based participatory sensing service to evaluate our scheme. The service in question is a green navigation service that constructs regression models from participant data to predict the fuel consumption of vehicles on road segments. We evaluate our proposed mechanism by providing empirical evidence that: i) an individual data trace is generally hard to reconstruct with any reasonable accuracy, and ii) the regression model constructed using the transformed traces has a much smaller error than one based on additive data-perturbation schemes.

...read moreread less

Proceedings Article•DOI•

gSkeletonClu: Density-Based Network Clustering via Structure-Connected Tree Division or Agglomeration

[...]

Heli Sun¹, Jianbin Huang², Jiawei Han³, Hongbo Deng³, Peixiang Zhao³, Boqin Feng¹ - Show less +2 more•Institutions (3)

Xi'an Jiaotong University¹, Xidian University², University of Illinois at Urbana–Champaign³

13 Dec 2010

TL;DR: This paper proposes a novel density-based network clustering algorithm, called gSkeletonClu (graph-skeleton based clustering), which can find the optimal parameter $\varepsilon$ and detect communities, hubs and outliers in large-scale undirected networks automatically without any user interaction.

...read moreread less

Abstract: Community detection is an important task for mining the structure and function of complex networks. Many pervious approaches are difficult to detect communities with arbitrary size and shape, and are unable to identify hubs and outliers. A recently proposed network clustering algorithm, SCAN, is effective and can overcome this difficulty. However, it depends on a sensitive parameter: minimum similarity threshold $\varepsilon$, but provides no automated way to find it. In this paper, we propose a novel density-based network clustering algorithm, called gSkeletonClu (graph-skeleton based clustering). By projecting a network to its Core-Connected Maximal Spanning Tree (CCMST), the network clustering problem is converted to finding core-connected components in the CCMST. We discover that all possible values of the parameter $\varepsilon$ lie in the edge weights of the corresponding CCMST. By means of tree divisive or agglomerative clustering, our algorithm can find the optimal parameter $\varepsilon$ and detect communities, hubs and outliers in large-scale undirected networks automatically without any user interaction. Extensive experiments on both real-world and synthetic networks demonstrate the superior performance of gSkeletonClu over the baseline methods.

...read moreread less

Book Chapter•DOI•

Classification and novel class detection of data streams in a dynamic feature space

[...]

Mohammad M. Masud¹, Qing Chen¹, Jing Gao², Latifur Khan¹, Jiawei Han², Bhavani Thuraisingham¹ - Show less +2 more•Institutions (2)

University of Texas at Dallas¹, University of Illinois at Urbana–Champaign²

20 Sep 2010

TL;DR: DXMiner considers the dynamic nature of the feature space and provides an elegant solution for classification and novel class detection when the featureSpace is dynamic, and outperforms state-of-the-art stream classification techniques in classifying and detecting novel classes in real data streams.

...read moreread less

Abstract: Data stream classification poses many challenges, most of which are not addressed by the state-of-the-art. We present DXMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and feature-evolution. Data streams are assumed to be infinite in length, which necessitates single-pass incremental learning techniques. Concept-drift occurs in a data stream when the underlying concept changes over time. Most existing data stream classification techniques address only the infinite length and concept-drift problems. However, concept-evolution and feature- evolution are also major challenges, and these are ignored by most of the existing approaches. Concept-evolution occurs in the stream when novel classes arrive, and feature-evolution occurs when new features emerge in the stream. Our previous work addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Most of the existing data stream classification techniques, including our previous work, assume that the feature space of the data points in the stream is static. This assumption may be impractical for some type of data, for example text data. DXMiner considers the dynamic nature of the feature space and provides an elegant solution for classification and novel class detection when the feature space is dynamic. We show that our approach outperforms state-of-the-art stream classification techniques in classifying and detecting novel classes in real data streams.

...read moreread less

Proceedings Article•DOI•

Keyword extraction for social snippets

[...]

Zhenhui Li¹, Ding Zhou², Yun-Fang Juan², Jiawei Han¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Facebook²

26 Apr 2010

TL;DR: This work study one traditional text mining task on such new form of text, that is extraction of meaningful keywords, and proposes several intuitive yet useful features and experiment with various classification models.

...read moreread less

Abstract: Today, a huge amount of text is being generated for social purposes on social networking services on the Web. Unlike traditional documents, such text is usually extremely short and tends to be informal. Analysis of such text benefit many applications such as advertising, search, and content filtering. In this work, we study one traditional text mining task on such new form of text, that is extraction of meaningful keywords. We propose several intuitive yet useful features and experiment with various classification models. Evaluation is conducted on Facebook data. Performances of various features and models are reported and compared.

...read moreread less

Book Chapter•DOI•

Classification and novel class detection in data streams with active mining

[...]

Mohammad M. Masud¹, Jing Gao², Latifur Khan¹, Jiawei Han², Bhavani Thuraisingham¹ - Show less +1 more•Institutions (2)

University of Texas at Dallas¹, University of Illinois at Urbana–Champaign²

21 Jun 2010

TL;DR: ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems, and outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

...read moreread less

Abstract: We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data Most of the existing data stream classification techniques address only the infinite length and concept-drift problems Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems Concept-evolution occurs in the stream when novel classes arrive However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training This assumption is impractical, since data labeling is both time consuming and costly Therefore, it is impossible to label a majority of the data points in a high-speed data stream This scarcity of labeled data naturally leads to poorly trained classifiers ActMiner actively selects only those data points for labeling for which the expected classification error is high Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

...read moreread less

Journal Article•DOI•

The inverse classification problem

[...]

Charu C. Aggarwal¹, Chen Chen², Jiawei Han²•Institutions (2)

IBM¹, University of Illinois at Urbana–Champaign²

01 May 2010-Journal of Computer Science and Technology

TL;DR: It is shown that the inverse classification problem is a powerful and general model which encompasses a number of different criteria, which can be used for a variety of decision support applications which have pre-determined task criteria.

...read moreread less

Abstract: In this paper, we examine an emerging variation of the classification problem, which is known as the inverse classification problem. In this problem, we determine the features to be used to create a record which will result in a desired class label. Such an approach is useful in applications in which it is an objective to determine a set of actions to be taken in order to guide the data mining application towards a desired solution. This system can be used for a variety of decision support applications which have pre-determined task criteria. We will show that the inverse classification problem is a powerful and general model which encompasses a number of different criteria. We propose a number of algorithms for the inverse classification problem, which use an inverted list representation for intermediate data structure representation and classification. We validate our approach over a number of real datasets.

...read moreread less

Journal Article•DOI•

Modeling Massive RFID Data Sets: A Gateway-Based Movement Graph Approach

[...]

H. Gonzalez¹, Jiawei Han², Hong Cheng³, Xiaolei Li⁴, Diego Klabjan⁵, Tianyi Wu² - Show less +2 more•Institutions (5)

Google¹, University of Illinois at Urbana–Champaign², The Chinese University of Hong Kong³, Microsoft⁴, Northwestern University⁵

01 Jan 2010-IEEE Transactions on Knowledge and Data Engineering

TL;DR: In this paper, a graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an application-oriented topological structure, and an efficient cubing algorithm that performs simultaneous aggregation of both spatio-temporal and item dimensions on a partitioned movement graph, guided by such a topology structure.

...read moreread less

Abstract: Massive radio frequency identification (RFID) data sets are expected to become commonplace in supply chain management systems. Warehousing and mining this data is an essential problem with great potential benefits for inventory management, object tracking, and product procurement processes. Since RFID tags can be used to identify each individual item, enormous amounts of location-tracking data are generated. With such data, object movements can be modeled by movement graphs, where nodes correspond to locations and edges record the history of item transitions between locations. In this study, we develop a movement graph model as a compact representation of RFID data sets. Since spatiotemporal as well as item information can be associated with the objects in such a model, the movement graph can be huge, complex, and multidimensional in nature. We show that such a graph can be better organized around gateway nodes, which serve as bridges connecting different regions of the movement graph. A graph-based object movement cube can be constructed by merging and collapsing nodes and edges according to an application-oriented topological structure. Moreover, we propose an efficient cubing algorithm that performs simultaneous aggregation of both spatiotemporal and item dimensions on a partitioned movement graph, guided by such a topological structure.

...read moreread less