scispace - formally typeset
Search or ask a question
Author

Guilherme Oliveira Campos

Bio: Guilherme Oliveira Campos is an academic researcher from Universidade Federal de Minas Gerais. The author has contributed to research in topics: Anomaly detection & Bipartite graph. The author has an hindex of 4, co-authored 7 publications receiving 427 citations. Previous affiliations of Guilherme Oliveira Campos include University of São Paulo & University of Southern Denmark.

Papers
More filters
Journal ArticleDOI
TL;DR: An extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose, and provides a characterization of the datasets themselves.
Abstract: The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of different standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a significant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus difficult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. In this paper, we perform an extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of different methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results.

552 citations

Book ChapterDOI
03 Jun 2018
TL;DR: This work proposes a boosting strategy for combinations showing improvements on benchmark datasets and designs smaller ensembles out of a wealth of possible ensemble members to improve the diversity and accuracy of the ensemble.
Abstract: Ensemble techniques have been applied to the unsupervised outlier detection problem in some scenarios. Challenges are the generation of diverse ensemble members and the combination of individual results into an ensemble. For the latter challenge, some methods tried to design smaller ensembles out of a wealth of possible ensemble members, to improve the diversity and accuracy of the ensemble (relating to the ensemble selection problem in classification). We propose a boosting strategy for combinations showing improvements on benchmark datasets.

23 citations

01 Jan 2016
TL;DR: An extensive experimental study on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose, and provides a characterization of the datasets themselves and discuss their suitability as outlier Detection benchmark sets.
Abstract: The evaluation of unsupervised outlier detection algorithms is a constant challenge in data mining research. Little is known regarding the strengths and weaknesses of di erent standard outlier detection models, and the impact of parameter choices for these algorithms. The scarcity of appropriate benchmark datasets with ground truth annotation is a signi cant impediment to the evaluation of outlier methods. Even when labeled datasets are available, their suitability for the outlier detection task is typically unknown. Furthermore, the biases of commonly-used evaluation measures are not fully understood. It is thus di cult to ascertain the extent to which newly-proposed outlier detection methods improve over established methods. We performed an extensive experimental study [1] on the performance of a representative set of standard k nearest neighborhood-based methods for unsupervised outlier detection, across a wide variety of datasets prepared for this purpose. Based on the overall performance of the outlier detection methods, we provide a characterization of the datasets themselves, and discuss their suitability as outlier detection benchmark sets. We also examine the most commonly-used measures for comparing the performance of di erent methods, and suggest adaptations that are more suitable for the evaluation of outlier detection results. We present the results from our previous publication [1] as well as additional observations and measures added to the online repository.

7 citations

Proceedings ArticleDOI
25 Jun 2018
TL;DR: It is shown that assessing the similarity between graphs may be a guidance to determine effective combinations, as less similar graphs are complementary with respect to outlier information they provide and lead to better outlier detection.
Abstract: Various previous works proposed techniques to detect outliers in graph data. Usually, some complex dataset is modeled as a graph and a technique for detecting outliers in graphs is applied. The impact of the graph model on the outlier detection capabilities of any method has been ignored. Here we assess the impact of the graph model on the outlier detection performance and the gains that may be achieved by using multiple graph models and combining the results obtained by these models. We show that assessing the similarity between graphs may be a guidance to determine effective combinations, as less similar graphs are complementary with respect to outlier information they provide and lead to better outlier detection.

6 citations

Proceedings Article
01 Jan 2018
TL;DR: This paper proposes a boosting strategy to solve the ensemble selection problem, called BoostSelect, and evaluates it over a large benchmark of datasets for outlier detection, showing improvements over baseline approaches.
Abstract: Ensemble techniques have been applied to the unsupervised outlier detection problem in some scenarios. Challenges are the generation of diverse ensemble members and the combination of individual results into an ensemble. For the latter challenge, some methods tried to design smaller ensembles out of a wealth of possible ensemble members, to improve the diversity and accuracy of the ensemble (relating to the ensemble selection problem in classification). In this paper, We propose a boosting strategy to solve the ensemble selection problem, called BoostSelect. We evaluate BoostSelect over a large benchmark of datasets for outlier detection, showing improvements over baseline approaches.

5 citations


Cited by
More filters
Proceedings ArticleDOI
22 Jan 2006
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

7,116 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of deep anomaly detection with a comprehensive taxonomy is presented in this paper, covering advancements in 3 high-level categories and 11 fine-grained categories of the methods.
Abstract: Anomaly detection, a.k.a. outlier detection or novelty detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This article surveys the research of deep anomaly detection with a comprehensive taxonomy, covering advancements in 3 high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages, and disadvantages and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges.

560 citations

01 Jan 1981
TL;DR: In this article, Monte Carlo techniques are used to estimate the probability of a given set of variables for a particular set of classes of data, such as conditional probability and hypergeometric probability.
Abstract: 1. Introduction 1.1 An Overview 1.2 Some Examples 1.3 A Brief History 1.4 A Chapter Summary 2. Probability 2.1 Introduction 2.2 Sample Spaces and the Algebra of Sets 2.3 The Probability Function 2.4 Conditional Probability 2.5 Independence 2.6 Combinatorics 2.7 Combinatorial Probability 2.8 Taking a Second Look at Statistics (Monte Carlo Techniques) 3. Random Variables 3.1 Introduction 3.2 Binomial and Hypergeometric Probabilities 3.3 Discrete Random Variables 3.4 Continuous Random Variables 3.5 Expected Values 3.6 The Variance 3.7 Joint Densities 3.8 Transforming and Combining Random Variables 3.9 Further Properties of the Mean and Variance 3.10 Order Statistics 3.11 Conditional Densities 3.12 Moment-Generating Functions 3.13 Taking a Second Look at Statistics (Interpreting Means) Appendix 3.A.1 MINITAB Applications 4. Special Distributions 4.1 Introduction 4.2 The Poisson Distribution 4.3 The Normal Distribution 4.4 The Geometric Distribution 4.5 The Negative Binomial Distribution 4.6 The Gamma Distribution 4.7 Taking a Second Look at Statistics (Monte Carlo Simulations) Appendix 4.A.1 MINITAB Applications Appendix 4.A.2 A Proof of the Central Limit Theorem 5. Estimation 5.1 Introduction 5.2 Estimating Parameters: The Method of Maximum Likelihood and the Method of Moments 5.3 Interval Estimation 5.4 Properties of Estimators 5.5 Minimum-Variance Estimators: The Crami?½r-Rao Lower Bound 5.6 Sufficient Estimators 5.7 Consistency 5.8 Bayesian Estimation 5.9 Taking A Second Look at Statistics (Beyond Classical Estimation) Appendix 5.A.1 MINITAB Applications 6. Hypothesis Testing 6.1 Introduction 6.2 The Decision Rule 6.3 Testing Binomial Dataâ H0: p = po 6.4 Type I and Type II Errors 6.5 A Notion of Optimality: The Generalized Likelihood Ratio 6.6 Taking a Second Look at Statistics (Statistical Significance versus â Practicalâ Significance) 7. Inferences Based on the Normal Distribution 7.1 Introduction 7.2 Comparing Y-i?½ s/ vn and Y-i?½ S/ vn 7.3 Deriving the Distribution of Y-i?½ S/ vn 7.4 Drawing Inferences About i?½ 7.5 Drawing Inferences About s2 7.6 Taking a Second Look at Statistics (Type II Error) Appendix 7.A.1 MINITAB Applications Appendix 7.A.2 Some Distribution Results for Y and S2 Appendix 7.A.3 A Proof that the One-Sample t Test is a GLRT Appendix 7.A.4 A Proof of Theorem 7.5.2 8. Types of Data: A Brief Overview 8.1 Introduction 8.2 Classifying Data 8.3 Taking a Second Look at Statistics (Samples Are Not â Validâ !) 9. Two-Sample Inferences 9.1 Introduction 9.2 Testing H0: i?½X =i?½Y 9.3 Testing H0: s2X=s2Yâ The F Test 9.4 Binomial Data: Testing H0: pX = pY 9.5 Confidence Intervals for the Two-Sample Problem 9.6 Taking a Second Look at Statistics (Choosing Samples) Appendix 9.A.1 A Derivation of the Two-Sample t Test (A Proof of Theorem 9.2.2) Appendix 9.A.2 MINITAB Applications 10. Goodness-of-Fit Tests 10.1 Introduction 10.2 The Multinomial Distribution 10.3 Goodness-of-Fit Tests: All Parameters Known 10.4 Goodness-of-Fit Tests: Parameters Unknown 10.5 Contingency Tables 10.6 Taking a Second Look at Statistics (Outliers) Appendix 10.A.1 MINITAB Applications 11. Regression 11.1 Introduction 11.2 The Method of Least Squares 11.3 The Linear Model 11.4 Covariance and Correlation 11.5 The Bivariate Normal Distribution 11.6 Taking a Second Look at Statistics (How Not to Interpret the Sample Correlation Coefficient) Appendix 11.A.1 MINITAB Applications Appendix 11.A.2 A Proof of Theorem 11.3.3 12. The Analysis of Variance 12.1 Introduction 12.2 The F Test 12.3 Multiple Comparisons: Tukeyâ s Method 12.4 Testing Subhypotheses with Contrasts 12.5 Data Transformations 12.6 Taking a Second Look at Statistics (Putting the Subject of Statistics togetherâ the Contributions of Ronald A. Fisher) Appendix 12.A.1 MINITAB Applications Appendix 12.A.2 A Proof of Theorem 12.2.2 Appendix 12.A.3 The Distribution of SSTR/(kâ 1) SSE/(nâ k)When H1 is True 13. Randomized Block Designs 13.1 Introduction 13.2 The F Test for a Randomized Block Design 13.3 The Paired t Test 13.4 Taking a Second Look at Statistics (Choosing between a Two-Sample t Test and a Paired t Test) Appendix 13.A.1 MINITAB Applications 14. Nonparametric Statistics 14.1 Introduction 14.2 The Sign Test 14.3 Wilcoxon Tests 14.4 The Kruskal-Wallis Test 14.5 The Friedman Test 14.6 Testing for Randomness 14.7 Taking a Second Look at Statistics (Comparing Parametric and Nonparametric Procedures) Appendix 14.A.1 MINITAB Applications Appendix: Statistical Tables Answers to Selected Odd-Numbered Questions Bibliography Index

524 citations

Journal ArticleDOI
TL;DR: This article surveys the research of deep anomaly detection with a comprehensive taxonomy, covering advancements in 3 high-level categories and 11 fine-grained categories of the methods and discusses how they address the aforementioned challenges.
Abstract: Anomaly detection, a.k.a. outlier detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This paper reviews the research of deep anomaly detection with a comprehensive taxonomy of detection methods, covering advancements in three high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages and disadvantages, and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges.

385 citations

Journal Article
TL;DR: In this article, the authors proposed a measure on local outliers based on a symmetric neighborhood relationship, which considers both neighbors and reverse neighbors of an object when estimating its density distribution.
Abstract: Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

321 citations