scispace - formally typeset
Search or ask a question
JournalISSN: 1932-1864

Statistical Analysis and Data Mining 

Wiley-Blackwell
About: Statistical Analysis and Data Mining is an academic journal published by Wiley-Blackwell. The journal publishes majorly in the area(s): Computer science & Cluster analysis. It has an ISSN identifier of 1932-1864. Over the lifetime, 453 publications have been published receiving 9378 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This survey article discusses some important aspects of the ‘curse of dimensionality’ in detail and surveys specialized algorithms for outlier detection from both categories.
Abstract: High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.

699 citations

Journal ArticleDOI
TL;DR: RF imputation is revealed to be generally robust with performance improving with increasing correlation, and performance was good under moderate to high missingness, and even when data was missing not at random.
Abstract: Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

368 citations

Journal ArticleDOI
TL;DR: The aim of this survey is to provide a ‘user manual’ for the community discovery problem and to organize the main categories of community discovery methods based on the definition of community they adopt.
Abstract: Many real-world networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition, each proposed algorithm then extracts the communities, which typically reflect only part of the features of real communities. The aim of this survey is to provide a ‘user manual’ for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery methods based on the definition of community they adopt. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on. The proposed classification of community discovery methods is also useful for putting into perspective the many open directions for further research. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 512–546, 2011 © 2011 Wiley Periodicals, Inc.

342 citations

Journal IssueDOI
TL;DR: An alternative, possibly complementary methodology for comparing clustering validity criteria is described and an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest is made.
Abstract: Many different relative clustering validity criteria exist that are very useful in practice as quantitative measures for evaluating the quality of data partitions, and new criteria have still been proposed from time to time. These criteria are endowed with particular features that may make each of them able to outperform others in specific classes of problems. In addition, they may have completely different computational requirements. Then, it is a hard task for the user to choose a specific criterion when he or she faces such a variety of possibilities. For this reason, a relevant issue within the field of clustering analysis consists of comparing the performances of existing validity criteria and, eventually, that of a new criterion to be proposed. In spite of this, the comparison paradigm traditionally adopted in the literature is subject to some conceptual limitations. The present paper describes an alternative, possibly complementary methodology for comparing clustering validity criteria and uses it to make an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest. A detailed review of the relative criteria under investigation is also provided that includes an original comparative asymptotic analysis of their computational complexities. This work is intended to be a complement of the classic study reported in 1985 by Milligan and Cooper as well as a thorough extension of a preliminary paper by the authors themselves. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 209-235, 2010

243 citations

Journal ArticleDOI
TL;DR: This paper provides a change‐point detection algorithm based on direct density‐ratio estimation that can be computed very efficiently in an online manner and allows for nonparametric density estimation, which is known to be a difficult problem.
Abstract: Change-point detection is the problem of discovering time points at which properties of time-series data change. This covers a broad range of real-world problems and has been actively discussed in the community of statistics and data mining. In this paper, we present a novel nonparametric approach to detecting the change of probability distributions of sequence data. Our key idea is to estimate the ratio of probability densities, not the probability densities themselves. This formulation allows us to avoid nonparametric density estimation, which is known to be a difficult problem. We provide a change-point detection algorithm based on direct density-ratio estimation that can be computed very efficiently in an online manner. The usefulness of the proposed method is demonstrated through experiments using artificial and real-world datasets. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2011 © 2012 Wiley Periodicals, Inc.

201 citations

Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
202323
202238
202114
20208
201910
20188