Journal•ISSN: 1932-1864

Statistical Analysis and Data Mining

Wiley-Blackwell

About: Statistical Analysis and Data Mining is an academic journal published by Wiley-Blackwell. The journal publishes majorly in the area(s): Computer science & Cluster analysis. It has an ISSN identifier of 1932-1864. Over the lifetime, 453 publications have been published receiving 9378 citations.

...read moreread less

Topics: Computer science, Cluster analysis, Feature selection, Covariate, Support vector machine ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A survey on unsupervised outlier detection in high-dimensional numerical data

[...]

Arthur Zimek¹, Erich Schubert², Hans-Peter Kriegel²•Institutions (2)

University of Alberta¹, Ludwig Maximilian University of Munich²

01 Oct 2012-Statistical Analysis and Data Mining

TL;DR: This survey article discusses some important aspects of the ‘curse of dimensionality’ in detail and surveys specialized algorithms for outlier detection from both categories.

...read moreread less

Abstract: High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 © 2012 Wiley Periodicals, Inc.

...read moreread less

699 citations

Journal Article•DOI•

Random Forest Missing Data Algorithms.

[...]

Fei Tang¹, Hemant Ishwaran¹•Institutions (1)

University of Miami¹

13 Jun 2017-Statistical Analysis and Data Mining

TL;DR: RF imputation is revealed to be generally robust with performance improving with increasing correlation, and performance was good under moderate to high missingness, and even when data was missing not at random.

...read moreread less

Abstract: Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

...read moreread less

368 citations

Journal Article•DOI•

A classification for community discovery methods in complex networks

[...]

Michele Coscia¹, Fosca Giannotti², Dino Pedreschi³•Institutions (3)

Northeastern University¹, Istituto di Scienza e Tecnologie dell'Informazione², University of Pisa³

01 Oct 2011-Statistical Analysis and Data Mining

TL;DR: The aim of this survey is to provide a ‘user manual’ for the community discovery problem and to organize the main categories of community discovery methods based on the definition of community they adopt.

...read moreread less

Abstract: Many real-world networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition, each proposed algorithm then extracts the communities, which typically reflect only part of the features of real communities. The aim of this survey is to provide a ‘user manual’ for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery methods based on the definition of community they adopt. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on. The proposed classification of community discovery methods is also useful for putting into perspective the many open directions for further research. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 512–546, 2011 © 2011 Wiley Periodicals, Inc.

...read moreread less

342 citations

Journal Issue•DOI•

Relative clustering validity criteria: A comparative overview

[...]

Lucas Vendramin¹, Ricardo J. G. B. Campello¹, Eduardo R. Hruschka¹•Institutions (1)

University of São Paulo¹

01 Aug 2010-Statistical Analysis and Data Mining

TL;DR: An alternative, possibly complementary methodology for comparing clustering validity criteria is described and an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest is made.

...read moreread less

Abstract: Many different relative clustering validity criteria exist that are very useful in practice as quantitative measures for evaluating the quality of data partitions, and new criteria have still been proposed from time to time. These criteria are endowed with particular features that may make each of them able to outperform others in specific classes of problems. In addition, they may have completely different computational requirements. Then, it is a hard task for the user to choose a specific criterion when he or she faces such a variety of possibilities. For this reason, a relevant issue within the field of clustering analysis consists of comparing the performances of existing validity criteria and, eventually, that of a new criterion to be proposed. In spite of this, the comparison paradigm traditionally adopted in the literature is subject to some conceptual limitations. The present paper describes an alternative, possibly complementary methodology for comparing clustering validity criteria and uses it to make an extensive comparison of the performances of 40 criteria over a collection of 962,928 partitions derived from five well-known clustering algorithms and 1080 different data sets of a given class of interest. A detailed review of the relative criteria under investigation is also provided that includes an original comparative asymptotic analysis of their computational complexities. This work is intended to be a complement of the classic study reported in 1985 by Milligan and Cooper as well as a thorough extension of a preliminary paper by the authors themselves. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 209-235, 2010

...read moreread less

243 citations

Journal Article•DOI•

Sequential change-point detection based on direct density-ratio estimation

[...]

Yoshinobu Kawahara¹, Masashi Sugiyama²•Institutions (2)

Osaka University¹, Tokyo Institute of Technology²

01 Apr 2012-Statistical Analysis and Data Mining

TL;DR: This paper provides a change‐point detection algorithm based on direct density‐ratio estimation that can be computed very efficiently in an online manner and allows for nonparametric density estimation, which is known to be a difficult problem.

...read moreread less

Abstract: Change-point detection is the problem of discovering time points at which properties of time-series data change. This covers a broad range of real-world problems and has been actively discussed in the community of statistics and data mining. In this paper, we present a novel nonparametric approach to detecting the change of probability distributions of sequence data. Our key idea is to estimate the ratio of probability densities, not the probability densities themselves. This formulation allows us to avoid nonparametric density estimation, which is known to be a difficult problem. We provide a change-point detection algorithm based on direct density-ratio estimation that can be computed very efficiently in an online manner. The usefulness of the proposed method is demonstrated through experiments using artificial and real-world datasets. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2011 © 2012 Wiley Periodicals, Inc.

...read moreread less

201 citations

Collapse

Performance

Metrics

459

Papers

9,386

Citations

No. of papers from the Journal in previous years
Year	Papers
2023	23
2022	38
2021	14
2020	8
2019	10
2018	8