scispace - formally typeset
Journal ArticleDOI

Efficient algorithms for mining outliers from large data sets

Sridhar Ramaswamy, +2 more
- Vol. 29, Iss: 2, pp 427-438
Reads0
Chats0
TLDR
A novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor is proposed and the top n points in this ranking are declared to be outliers.
Abstract
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Anomaly detection: A survey

TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Journal ArticleDOI

LOF: identifying density-based local outliers

TL;DR: This paper contends that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier, called the local outlier factor (LOF), and gives a detailed formal analysis showing that LOF enjoys many desirable properties.
Journal ArticleDOI

A Survey of Outlier Detection Methodologies

TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.
Book ChapterDOI

A Survey of Clustering Data Mining Techniques

TL;DR: This survey concentrates on clustering algorithms from a data mining perspective as a data modeling technique that provides for concise summaries of the data.
Journal ArticleDOI

An overview of anomaly detection techniques: Existing solutions and latest technological trends

TL;DR: This paper provides a comprehensive survey of anomaly detection systems and hybrid intrusion detection systems of the recent past and present and discusses recent technological trends in anomaly detection and identifies open problems and challenges in this area.
References
More filters

Computational geometry. an introduction

TL;DR: This book offers a coherent treatment, at the graduate textbook level, of the field that has come to be known in the last decade or so as computational geometry.
Journal ArticleDOI

LOF: identifying density-based local outliers

TL;DR: This paper contends that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier, called the local outlier factor (LOF), and gives a detailed formal analysis showing that LOF enjoys many desirable properties.
Proceedings ArticleDOI

The R*-tree: an efficient and robust access method for points and rectangles

TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.
Trending Questions (1)
When are the NBA Live 19 servers shutting down?

The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database.