scispace - formally typeset
Search or ask a question
Author

Mihael Ankerst

Bio: Mihael Ankerst is an academic researcher from Ludwig Maximilian University of Munich. The author has contributed to research in topics: Nearest neighbor search & Similarity (network science). The author has an hindex of 12, co-authored 14 publications receiving 5497 citations.

Papers
More filters
Journal ArticleDOI
01 Jun 1999
TL;DR: A new algorithm is introduced for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure.
Abstract: Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only 'traditional' clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.

4,020 citations

Book ChapterDOI
TL;DR: 3D shape histograms are introduced as an intuitive and powerful similarity model for 3D objects that efficiently supports similarity search based on quadratic forms and has high classification accuracy and good performance.
Abstract: Classification is one of the basic tasks of data mining in modern database applications including molecular biology, astronomy, mechanical engineering, medical imaging or meteorology The underlying models have to consider spatial properties such as shape or extension as well as thematic attributes We introduce 3D shape histograms as an intuitive and powerful similarity model for 3D objects Particular flexibility is provided by using quadratic form distance functions in order to account for errors of measurement, sampling, and numerical rounding that all may result in small displacements and rotations of shapes For query processing, a general filter-refinement architecture is employed that efficiently supports similarity search based on quadratic forms An experimental evaluation in the context of molecular biology demonstrates both, the high classification accuracy of more than 90% and the good performance of the approach

608 citations

Proceedings ArticleDOI
19 Oct 1998
TL;DR: A systematic approach to arrange the dimensions according to their similarity based on heuristic algorithms, which shows the high impact of the similarity clustering of dimensions on the visualization results.
Abstract: The order and arrangement of dimensions (variates) is crucial for the effectiveness of a large number of visualization techniques such as parallel coordinates, scatterplots, recursive pattern, and many others. We describe a systematic approach to arrange the dimensions according to their similarity. The basic idea is to rearrange the data dimensions such that dimensions showing a similar behavior are positioned next to each other. For the similarity clustering of dimensions, we need to define similarity measures which determine the partial or global similarity of dimensions. We then consider the problem of finding an optimal one- or two-dimensional arrangement of the dimensions based on their similarity. Theoretical considerations show that both, the one- and the two-dimensional arrangement problem are surprisingly hard problems, i.e. they are NP complete. Our solution of the problem is therefore based on heuristic algorithms. An empirical evaluation using a number of different visualization techniques shows the high impact of our similarity clustering of dimensions on the visualization results.

268 citations

Proceedings ArticleDOI
29 Oct 1995
TL;DR: This paper proposes a new visualization technique called a 'recursive pattern', which has been developed for visualizing large amounts of multidimensional data, based on a generic recursive scheme which generalizes a wide range of pixel-oriented arrangements for displaying large data sets.
Abstract: An important goal of visualization technology is to support the exploration and analysis of very large amounts of data. In this paper, we propose a new visualization technique called a 'recursive pattern', which has been developed for visualizing large amounts of multidimensional data. The technique is based on a generic recursive scheme which generalizes a wide range of pixel-oriented arrangements for displaying large data sets. By instantiating the technique with adequate data- and application-dependent parameters, the user may greatly influence the structure of the resulting visualizations. Since the technique uses one pixel for presenting each data value, the amount of data which can be displayed is only limited by the resolution of current display technology and by the limitations of human perceptibility. Beside describing the basic idea of the 'recursive pattern' technique, we provide several examples of useful parameter settings for the various recursion levels. We further show that our 'recursive pattern' technique is particularly advantageous for the large class of data sets which have a natural order according to one dimension (e.g. time series data). We demonstrate the usefulness of our technique by using a stock market application.

264 citations

01 Jan 1996
TL;DR: The first results show that the ‘circle segment’ technique is very powerful for visualizing large amounts of data, providing more expressive visualizations than other wellknown techniques such as the “recursive pattern” technique and traditional ‘line graphs’.
Abstract: In this paper , we describe a novel technique for visualizing large amounts of high-dimensio nal data, called ‘circle segments’. The technique uses one colored pixel per data value and can therefore be classified as a pixel-per-value technique [Kei 96]. The basic idea of the ‘circle segments’ visualization technique is to display the data dimensions as segments of a circle. If the data consists of k dimensions, the circle is partitioned into k segments, each representing one data dimension. Inside the segments, the data values belonging to one dimension are arranged from the center of the circle to the outside in a back and forth manner orthogonal to the line that halves the segment. Our first results show that the ‘circle segment’ technique is very powerful for visualizing large amounts of data, providing more expressive visualizations than other wellknown techniques such as the ‘recursive pattern’ technique and traditional ‘line graphs’.

220 citations


Cited by
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

01 Jan 2002

9,314 citations

Journal ArticleDOI
16 May 2000
TL;DR: This paper contends that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier, called the local outlier factor (LOF), and gives a detailed formal analysis showing that LOF enjoys many desirable properties.
Abstract: For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.

5,248 citations