scispace - formally typeset
Search or ask a question
Author

Sridhar Ramaswamy

Other affiliations: Telcordia Technologies
Bio: Sridhar Ramaswamy is an academic researcher from Alcatel-Lucent. The author has contributed to research in topics: Association rule learning & Data set. The author has an hindex of 8, co-authored 10 publications receiving 2004 citations. Previous affiliations of Sridhar Ramaswamy include Telcordia Technologies.

Papers
More filters
Journal ArticleDOI
16 May 2000
TL;DR: A novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor is proposed and the top n points in this ranking are declared to be outliers.
Abstract: In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

1,871 citations

Journal ArticleDOI
01 Jun 1999
TL;DR: This paper examines selectivity estimation in the context of Geographic Information Systems, which manage spatial data such as points, lines, poly-lines and polygons, and identifies a BSP based partitioning that is consistently provides the most accurate selectivity estimates for spatial queries.
Abstract: Selectivity estimation of queries is an important and well-studied problem in relational database systems. In this paper, we examine selectivity estimation in the context of Geographic Information Systems, which manage spatial data such as points, lines, poly-lines and polygons. In particular, we focus on point and range queries over two-dimensional rectangular data. We propose several techniques based on using spatial indices, histograms, binary space partitionings (BSPs), and the novel notion of spatial skew. Our techniques carefully partition the input rectangles into subsets and approximate each partition accurately. We present a detailed experimental study comparing the proposed techniques and the best known sampling and parametric techniques. We evaluate them using synthetic as well as real-life TIGER datasets. Based on our experiments, we identify a BSP based partitioning that we call Min-Skew which consistently provides the most accurate selectivity estimates for spatial queries. The Min-Skew partitioning can be constructed efficiently, occupies very little space, and provides accurate selectivity estimates over a broad range of spatial queries.

235 citations

Proceedings ArticleDOI
22 May 1995
TL;DR: This work has developed a technique, called indexing by class-division (CD), which it is believed can be used as a practical alternative to CH and an optimized implementation and experimental validation of CD's average-case performance are presented.
Abstract: Indexing a class hierarchy, in order to efficiently search or update the objects of a class according to a (range of) value(s) of an attribute, impacts OODB performance heavily. For this indexing problem, most systems use the class hierarchy index (CH) technique of [15] implemented using B+-trees. Other techniques, such as those of [14, 18,31], can lead to improved average-case performance but involve the implementation of new data-structures. As a special form of external dynamic two-dimensional range searching, this OODB indexing problem is solvable within reasonable worst-case bounds [12]. Based on this insight, we have developed a technique, called indexing by class-division (CD), which we believe can be used as a practical alternative to CH. We present an optimized implementation and experimental validation of CD's average-case performance. The main advantages of the CD technique are: (1) CD is an extension of CH that provides a significant speed-up over CH for a wide spectrum of range queries--this speed-up is at least linear in the number of classes queried for uniform data and larger otherwise; and (2) CD queries, updates and concurrent use are implementable using existing B+-tree technology. The basic idea of class-division involves a time-space tradeoff and CD requires some space and update overhead in comparison to CH. In practice, this overhead is a small factor (2 to 3) and, in worst-case, is bounded by the depth of the hierarchy and the logarithm of its size.

49 citations

Patent
16 Feb 1999
TL;DR: In this paper, a system and method for discovering association rules that display regular cyclic variation over time is disclosed. But the method is based on the interaction between association rules and time, which reduces the amount of time needed to find cyclic association rules.
Abstract: A system and method for discovering association rules that display regular cyclic variation over time is disclosed. Such association rules may apply over daily, weekly or monthly (or other) cycles of sales data or the like. A first technique, referred to as the sequential algorithm, treats association rules and cycles relatively independently. Based on the interaction between association rules and time, we employ a new technique called cycle pruning, which reduces the amount of time needed to find cyclic association rules. A second algorithm, the interleaved algorithm, uses cycle pruning and other optimization techniques for discovering cyclic association rules with reduced overhead.

26 citations

Patent
18 Nov 1999
TL;DR: In this article, a new method for identifying a predetermined number of data points of interest in a large data set is proposed, which is ranked in relation to the distance to their neighboring points.
Abstract: A new method for identifying a predetermined number of data points of interest in a large data set. The data points of interest are ranked in relation to the distance to their neighboring points. The method employs partition-based detection algorithms to partition the data points and then compute upper and lower bounds for each partition. These bounds are then used to eliminate those partitions that do contain the predetermined number of data points of interest. The data points of interest are then computed from the remaining partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as the points of interest, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.

18 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations

Journal ArticleDOI
16 May 2000
TL;DR: This paper contends that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier, called the local outlier factor (LOF), and gives a detailed formal analysis showing that LOF enjoys many desirable properties.
Abstract: For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.

5,248 citations

Journal ArticleDOI
TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.
Abstract: Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

3,235 citations

Book ChapterDOI
Pavel Berkhin1
01 Jan 2006
TL;DR: This survey concentrates on clustering algorithms from a data mining perspective as a data modeling technique that provides for concise summaries of the data.
Abstract: Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

3,047 citations

Journal ArticleDOI
TL;DR: This paper provides a comprehensive survey of anomaly detection systems and hybrid intrusion detection systems of the recent past and present and discusses recent technological trends in anomaly detection and identifies open problems and challenges in this area.

1,433 citations