scispace - formally typeset
Search or ask a question

Showing papers by "Pang-Ning Tan published in 2011"


Journal ArticleDOI
TL;DR: It is shown that an ontology can be used to greatly reduce the number of features needed to do document clustering and that by using core semantic features for clustering, one can reduce thenumber of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
Abstract: Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

98 citations


Journal ArticleDOI
TL;DR: In this article, a review of approaches to climate downscaling is presented, focusing on three broad categories: dynamic, empirical-dynamic and disaggregation methods, and the fundamental considerations of different methods are highlighted and explained for non-climatologists.
Abstract: The majority of climate change impact assessments focus on potential impacts at the local ⁄ regional scale. Climate change scenarios with a fine spatial resolution are essential components of these assessments. Scenarios must be designed with the goals of the assessment in mind. Often the scientists and stakeholders leading, or participating in, impact assessments are unaware of the challenging and time-consuming nature of climate scenario development. The intent of this review, presented in two parts, is to strengthen the communication between the developers and users of climate scenarios and ultimately to improve the utility of climate impact assessments. In Part I, approaches to climate downscaling are grouped into three broad categories ‐ dynamic downscaling, empirical-dynamic downscaling and disaggregation downscaling methods ‐ and the fundamental considerations of the different methods are highlighted and explained for non-climatologists. Part II focuses on the application of climate change scenarios.

50 citations


Journal ArticleDOI
TL;DR: In this paper, the authors review for non-climate scientists a number of practical considerations when utilizing climate change scenarios, including sources of observational data for scenario evaluation, the advantages of scenario ensembles, adjusting for scenario biases, and the availability of archived downscaled scenarios.
Abstract: Although downscaling methods for deriving local ⁄regional climate change scenarios have been extensively studied, little guidance exists on how to use the downscaled scenarios in applications such as impact assessments. In this second part of a two-part communication, we review for nonclimate scientists a number of practical considerations when utilizing climate change scenarios. The issues discussed are drawn from questions frequently asked by our colleagues on assessment teams and include sources of observational data for scenario evaluation, the advantages of scenario ensembles, adjusting for scenario biases, and the availability of archived downscaled scenarios. Together with Part I, which reviews various downscaling methods, Part II is intended to improve the communication between suppliers and users of local ⁄regional climate change scenarios, with the overall goal of improving the utility of climate impact assessments through a better understanding by all assessment team members of the strengths and limitations of local ⁄regional climate change scenarios.

40 citations


Proceedings ArticleDOI
21 Aug 2011
TL;DR: This paper argues that, even in the face of encrypted traffic flows, botnets can still be detected by examining the set of server IP-addresses visited by a client machine in the past, and presents a novel incremental LS-SVM algorithm that is adaptive to both changes in the feature set and class labels of training instances.
Abstract: As botnets continue to proliferate and grow in sophistication, so does the need for more advanced security solutions to effectively detect and defend against such attacks. In particular, botnets such as Conficker have been known to encrypt the communication packets exchanged between bots and their command-and-control server, making it costly for existing botnet detection systems that rely on deep packet inspection (DPI) methods to identify compromised machines. In this paper, we argue that, even in the face of encrypted traffic flows, botnets can still be detected by examining the set of server IP-addresses visited by a client machine in the past. However there are several challenges that must be addressed. First, the set of server IP-addresses visited by client machines may evolve dynamically. Second, the set of client machines used for training and their class labels may also change over time. To overcome these challenges, this paper presents a novel incremental LS-SVM algorithm that is adaptive to both changes in the feature set and class labels of training instances. To evaluate the performance of our algorithm, we have performed experiments on two large-scale datasets, including real-time data collected from peering routers at a large Tier-1 ISP. Experimental results showed that the proposed algorithm produces classification accuracy comparable to its batch counterpart, while consuming significantly less computational resources.

18 citations


Proceedings ArticleDOI
11 Dec 2011
TL;DR: This paper designs a boosting algorithm to minimize the loss function and presents an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition.
Abstract: Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.

17 citations


Book ChapterDOI
01 Jan 2011
TL;DR: This chapter presents an overview of the different techniques in network mining and suggests future research possibilities in the direction of graph theory.
Abstract: Network mining is a growing area of research within the data mining community that uses metrics and algorithms from graph theory. In this chapter we present an overview of the different techniques in network mining and suggest future research possibilities in the direction of graph theory.

9 citations


Proceedings Article
01 Jan 2011
TL;DR: A statistical downscaling framework that focuses on the accurate projection of future extreme values by estimating directly the conditional quantiles of the response variable is presented and extended to a semi-supervised learning setting and demonstrates its efficacy in terms of inferring the magnitude, frequency, and timing of climate extreme events.
Abstract: Statistical downscaling is commonly used in climate modeling to obtain high-resolution spatial projections of future climate scenarios from the coarse-resolution outputs projected by global climate models. Unfortunately, most of the statistical downscaling approaches using standard regression methods tend to emphasize projecting the conditional mean of the data while paying scant attention to the extreme values that are rare in occurrence yet critical for climate impact assessment and adaptation studies. This paper presents a statistical downscaling framework that focuses on the accurate projection of future extreme values by estimating directly the conditional quantiles of the response variable. We also extend the proposed framework to a semi-supervised learning setting and demonstrate its efficacy in terms of inferring the magnitude, frequency, and timing of climate extreme events. The proposed approach outperformed baseline statistical downscaling approaches in 85% of the 37 stations evaluated, in terms of the accuracy of the magnitude projected for extreme data points.

2 citations