Showing papers on "k-nearest neighbors algorithm published in 2000"

PDF

Open Access

Journal Article•DOI•

Efficient algorithms for mining outliers from large data sets

[...]

Sridhar Ramaswamy, Rajeev Rastogi¹, Kyuseok Shim²•Institutions (2)

16 May 2000

TL;DR: A novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor is proposed and the top n points in this ranking are declared to be outliers.

...read moreread less

Abstract: In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

...read moreread less

1,871 citations

Proceedings Article•

Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning

[...]

Mark Hall¹•Institutions (1)

University of Waikato¹

29 Jun 2000

TL;DR: In this article, a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems is described, which often outperforms the ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees.

...read moreread less

Abstract: Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often outperforms the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does—reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the preprocessed data are often significantly smaller.

...read moreread less

1,511 citations

Journal Article•DOI•

Dimensionality reduction using genetic algorithms

[...]

Michael L. Raymer¹, William F. Punch¹, Erik D. Goodman¹, Leslie A. Kuhn¹, Anil K. Jain¹ - Show less +1 more•Institutions (1)

Michigan State University¹

01 Jul 2000-IEEE Transactions on Evolutionary Computation

TL;DR: This work presents a new approach to feature extraction in which feature selection and extraction and classifier training are performed simultaneously using a genetic algorithm, and employs this technique in combination with the k nearest neighbor classification rule.

...read moreread less

Abstract: Pattern recognition generally requires that objects be described in terms of a set of measurable features. The selection and quality of the features representing each pattern affect the success of subsequent classification. Feature extraction is the process of deriving new features from original features to reduce the cost of feature measurement, increase classifier efficiency, and allow higher accuracy. Many feature extraction techniques involve linear transformations of the original pattern vectors to new vectors of lower dimensionality. While this is useful for data visualization and classification efficiency, it does not necessarily reduce the number of features to be measured since each new feature may be a linear combination of all of the features in the original pattern vector. Here, we present a new approach to feature extraction in which feature selection and extraction and classifier training are performed simultaneously using a genetic algorithm. The genetic algorithm optimizes a feature weight vector used to scale the individual features in the original pattern vectors. A masking vector is also employed for simultaneous selection of a feature subset. We employ this technique in combination with the k nearest neighbor classification rule, and compare the results with classical feature selection and extraction techniques, including sequential floating forward feature selection, and linear discriminant analysis. We also present results for the identification of favorable water-binding sites on protein surfaces.

...read moreread less

849 citations

Journal Article•DOI•

Second nearest-neighbor modified embedded-atom-method potential

[...]

Byeong-Joo Lee¹, Michael I. Baskes²•Institutions (2)

Korea Research Institute of Standards and Science¹, Los Alamos National Laboratory²

01 Oct 2000-Physical Review B

TL;DR: The modified embedded-atom method, a first nearest-neighbor semi-empirical model for atomic potentials, can describe the physical properties of a wide range of elements and alloys with various lattice structures.

...read moreread less

Abstract: The modified embedded-atom method, a first nearest-neighbor semiempirical model for atomic potentials, can describe the physical properties of a wide range of elements and alloys with various lattice structures. However, the model is not quite successful for bcc metals in that it predicts the order among the size of low index surface energies incorrectly and that it generates a structure more stable than bcc for some bcc metals. In order to remove the problems, the formalism has been extended so that the second nearest neighbor interactions are taken into consideration. New parameters for Fe and comparisons between calculated and experimental physical properties of Fe are presented.

...read moreread less

516 citations

Proceedings Article•

What Is the Nearest Neighbor in High Dimensional Spaces

[...]

Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim

10 Sep 2000

TL;DR: A new generalized notion of nearest neighbor search is identified as the relevant problem in high dimensional space and a quality criterion is used to select relevant dimensions (projections) with respect to the given query.

...read moreread less

Abstract: Nearest neighbor search in high dimensional spaces is an interesting and important problem which is relevant for a wide variety of novel database applications. As recent results show, however, the problem is a very di cult one, not only with regards to the performance issue but also to the quality issue. In this paper, we discuss the quality issue and identify a new generalized notion of nearest neighbor search as the relevant problem in high dimensional space. In contrast to previous approaches, our new notion of nearest neighbor search does not treat all dimensions equally but uses a quality criterion to select relevant dimensions (projections) with respect to the given query. As an example for a useful quality criterion, we rate how well the data is clustered around the query point within the selected projection. We then propose an e cient and e ective algorithm to solve the generalized nearest neighbor problem. Our experiments based on a number of real and synthetic data sets show that our new approach provides new insights into the nature of nearest neighbor search on high

...read moreread less

505 citations

Journal Article•DOI•

Influence sets based on reverse nearest neighbor queries

[...]

Flip Korn¹, S. Muthukrishnan¹•Institutions (1)

AT&T Labs¹

16 May 2000

TL;DR: This paper formalizes a novel notion of influence based on reverse neighbor queries and its variants, and presents a general approach for solving RNN queries and an efficient R-tree based method for large data sets, based on this approach.

...read moreread less

Abstract: Inherent in the operation of many decision support and continuous referral systems is the notion of the “influence” of a data point on the database. This notion arises in examples such as finding the set of customers affected by the opening of a new store outlet location, notifying the subset of subscribers to a digital library who will find a newly added document most relevant, etc. Standard approaches to determining the influence set of a data point involve range searching and nearest neighbor queries.In this paper, we formalize a novel notion of influence based on reverse neighbor queries and its variants. Since the nearest neighbor relation is not symmetric, the set of points that are closest to a query point (i.e., the nearest neighbors) differs from the set of points that have the query point as their nearest neighbor (called the reverse nearest neighbors). Influence sets based on reverse nearest neighbor (RNN) queries seem to capture the intuitive notion of influence from our motivating examples.We present a general approach for solving RNN queries and an efficient R-tree based method for large data sets, based on this approach. Although the RNN query appears to be natural, it has not been studied previously. RNN queries are of independent interest, and as such should be part of the suite of available queries for processing spatial and multimedia data. In our experiments with real geographical data, the proposed method appears to scale logarithmically, whereas straightforward sequential scan scales linearly. Our experimental study also shows that approaches based on range searching or nearest neighbors are ineffective at finding influence sets of our interest.

...read moreread less

486 citations

Journal Article•DOI•

Novel variable selection quantitative structure--property relationship approach based on the k-nearest-neighbor principle

[...]

Weifan Zheng¹, Alexander Tropsha¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jan 2000-Journal of Chemical Information and Computer Sciences

TL;DR: A novel automated variable selection quantitative structure-activity relationship (QSAR) method, based on the kappa-nearest neighbor principle (kNN-QSar) has been developed, which implies that similar compounds display similar profiles of pharmacological activities.

...read moreread less

Abstract: A novel automated variable selection quantitative structure−activity relationship (QSAR) method, based on the K-nearest neighbor principle (kNN-QSAR) has been developed. The kNN-QSAR method explore...

...read moreread less

418 citations

Journal Article•DOI•

Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces

[...]

Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani

04 Apr 2000-SIAM Journal on Computing

TL;DR: Significantly improving and extending recent results of Kleinberg, data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension are constructed.

...read moreread less

Abstract: We address the problem of designing data structures that allow efficient search for approximate nearest neighbors. More specifically, given a database consisting of a set of vectors in some high dimensional Euclidean space, we want to construct a space-efficient data structure that would allow us to search, given a query vector, for the closest or nearly closest vector in the database. We also address this problem when distances are measured by the L1 norm and in the Hamming cube. Significantly improving and extending recent results of Kleinberg, we construct data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension. (Depending on the case, the extra factors are polylogarithmic in the size of the database.)

...read moreread less

300 citations

Proceedings Article•

Reverse Nearest Neighbor Queries for Dynamic Databases.

[...]

Ioana Stanoi, Divyakant Agrawal, Amr El Abbadi

01 Jan 2000

TL;DR: An algorithmic approach that is exible enough to support a larger class of RNN queries is proposed and the current method of nearest neighbor search is extended to that of conditional nearest neighbor.

...read moreread less

Abstract: In this paper we propose an algorithm for answering reverse nearest neighbor RNN queries a problem formulated only recently This class of queries is strongly related to that of nearest neighbor NN queries although the two are not necessarily complementary Unlike nearest neighbor queries RNN queries nd the set of database points that have the query point as the nearest neighbor There is no other proposal we are aware of that provides an algorithmic approach to answer RNN queries The earlier approach for RNN queries KM is based on the pre computation of neighborhood information that is organized in terms of auxiliary data structures It can be argued that the pre computation of the RNN information for all points in the database can be too restrictive In the case of dynamic databases insert and update operations are expensive and can lead to modi cations of large parts of the auxiliary data structures Also answers to RNN queries for a set of data points depend on the number of dimensions taken in considerations when initializing the data structures We propose an algorithmic approach that is exible enough to support a larger class of RNN queries and in order to support them we also extend the current method of nearest neighbor search to that of conditional nearest neighbor

...read moreread less

220 citations

Journal Article•DOI•

Content-based audio classification and retrieval using the nearest feature line method

[...]

Stan Z. Li¹•Institutions (1)

Microsoft¹

01 Sep 2000-IEEE Transactions on Speech and Audio Processing

TL;DR: The results show that the NFL-based method produces consistently better results than the NN-based and other methods.

...read moreread less

Abstract: A method is presented for content-based audio classification and retrieval. It is based on a new pattern classification method called the nearest feature line (NFL). In the NFL, information provided by multiple prototypes per class is explored. This contrasts to the nearest neighbor (NN) classification in which the query is compared to each prototype individually. Regarding audio representation, perceptual and cepstral features and their combinations are considered. Extensive experiments are performed to compare various classification methods and feature sets. The results show that the NFL-based method produces consistently better results than the NN-based and other methods. A system resulting from this work has achieved the error rate of 9.78%, as compared to that of 18.34% of a compelling existing system, as tested on a common audio database.

...read moreread less

203 citations

Proceedings Article•DOI•

PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces

[...]

Paolo Ciaccia¹, Marco Patella¹•Institutions (1)

University of Bologna¹

29 Feb 2000

TL;DR: This paper describes sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition that respects the error bound, and provides experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up the accuracy of the result.

...read moreread less

Abstract: In high-dimensional and complex metric spaces, determining the nearest neighbor (NN) of a query object q can be a very expensive task, because of the poor partitioning operated by index structures-the so-called "curse of dimensionality". This also affects approximately correct (AC) algorithms, which return as results a point whose distance from q is less than (1+/spl epsiv/) times the distance between q and its true NN. In this paper we introduce a new approach to approximate similarity search, called PAC-NN queries, where the error bound /spl epsiv/ can be exceeded with probability /spl delta/ and both /spl epsiv/ and /spl delta/ parameters can be tuned at query time to trade the quality of the result for the cost of the search. We describe sequential and index-based PAC-NN algorithms that exploit the distance distribution of the query object in order to determine a stopping condition that respects the error bound. Analysis and experimental evaluation of the sequential algorithm confirm that, for moderately large data sets and suitable /spl epsiv/ and /spl delta/ values, PAC-NN queries can be efficiently solved and the error controlled. Then, we provide experimental evidence that indexing can further speed-up the retrieval process by up to 1-2 orders of magnitude without giving up the accuracy of the result.

...read moreread less

Journal Article•DOI•

Finding curvilinear features in spatial point patterns: principal curve clustering with noise

[...]

Derek C. Stanford¹, Adrian E. Raftery²•Institutions (2)

Mathsoft¹, University of Washington²

01 Jun 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The algorithm for principal curve clustering is in two steps: the first is hierarchical and agglomerative (HPCC) and the second consists of iterative relocation based on the classification EM algorithm (CEM-PCC), which is used to combine potential feature clusters and refines the results and deals with background noise.

...read moreread less

Abstract: Clustering about principal curves combines parametric modeling of noise with nonparametric modeling of feature shape. This is useful for detecting curvilinear features in spatial point patterns, with or without background noise. Applications include the detection of curvilinear minefields from reconnaissance images, some of the points in which represent false detections, and the detection of seismic faults from earthquake catalogs. Our algorithm for principal curve clustering is in two steps: The first is hierarchical and agglomerative (HPCC) and the second consists of iterative relocation based on the classification EM algorithm (CEM-PCC). HPCC is used to combine potential feature clusters, while CEM-PCC refines the results and deals with background noise. It is important to have a good starting point for the algorithm: This can be found manually or automatically using, for example, nearest neighbor clutter removal or model-based clustering. We choose the number of features and the amount of smoothing simultaneously, using approximate Bayes factors.

...read moreread less

Book•

High-dimensional computational geometry

[...]

Rajeev Motwani, Piotr Indyk

01 Jan 2000

TL;DR: This thesis shows that it is in fact possible to obtain efficient algorithms for the nearest neighbor problem and a wide range of metrics, including Euclidean, Manhattan or maximum norms and Hausdorff metrics; some of the results hold even for general metrics.

...read moreread less

Abstract: Consider the following problem: given a database of points in a multidimensional space, construct a data structure which given any query point finds the database point(s) closest to it. This problem, called nearest neighbor search has been of great importance to several areas of computer science, including pattern recognition, databases, vector compression, computational statistics and data mining. Many of the above applications involve data sets whose size and dimensionality are very large. Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearest neighbor problem is an example of a large class of proximity problems. Apart from the nearest neighbor, the class contains problems like closest pair, diameter (or furthest pair), minimum spanning tree and clustering problems. In the latter case the goal is to find a partition of points into k clusters, in order to minimize a certain function. Example functions are: the sum of the distances from each point to its nearest cluster representative (this problem is called k-median), the maximum such distance (k-center), the sum of all distances between points in same clusters (k-clustering), etc. Since these problems are ubiquitous, they have been investigated in computer science for a long while (e.g., in computational geometry). As a result of this research effort, many efficient solutions have been discovered for the case when the points lie in a space of small dimension. Unfortunately, their running time grows exponentially with the dimension. In this thesis we show that it is in fact possible to obtain efficient algorithms for the aforementioned problems, if we are satisfied with answers which are approximate. The running time of our algorithms for the aforementioned problems has only polynomial dependence on the dimension, and sublinear (for the nearest neighbor problem) or subquadratic (for the closest pair, minimum spanning tree, clustering etc.) dependence on the number of input points. These results hold for a wide range of metrics, including Euclidean, Manhattan or maximum norms and Hausdorff metrics; some of the results hold even for general metrics. We support our theoretical results with their experimental evaluation.

...read moreread less

Proceedings Article•DOI•

Deflating the dimensionality curse using multiple fractal dimensions

[...]

B.-U. Pagel, Flip Korn¹, Christos Faloutsos²•Institutions (2)

AT&T Labs¹, Carnegie Mellon University²

01 Feb 2000

TL;DR: The theoretical and empirical results show that previous worst-case analysis of nearest neighbor search in high dimensions are over-pessimistic, to the point of being unrealistic, and the performance depends critically on the intrinsic ("fractal") dimensionality as opposed to the embedding dimension that the uniformity assumption incorrectly implies.

...read moreread less

Abstract: Nearest neighbor queries are important in many settings, including spatial databases (find the k closet cities) and multimedia databases (find the k most similar images). Previous analyses have concluded that nearest neighbor search is hopeless in high dimensions, due to the notorious "curse of dimensionality". However, their precise analysis over real data sets is still an open problem. The typical and often implicit assumption in previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic ("fractal") dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. We show how the Hausdorff and correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict I/O performance to within one standard deviation. The practical contributions of this work are our accurate formulas which can be used for query optimization in spatial and multimedia databases. The theoretical contribution is the 'deflation' of the dimensionality curse. Our theoretical and empirical results show that previous worst-case analysis of nearest neighbor search in high dimensions are over-pessimistic, to the point of being unrealistic. The performance depends critically on the intrinsic ("fractal") dimensionality as opposed to the embedding dimension that the uniformity assumption incorrectly implies.

...read moreread less

Journal Article•DOI•

Appearance-based hand sign recognition from intensity image sequences

[...]

Yuntao Cui¹, Juyang Weng²•Institutions (2)

Princeton University¹, Michigan State University²

01 May 2000-Computer Vision and Image Understanding

TL;DR: This approach uses multiclass, multidimensional discriminant analysis to automatically select the most discriminating linear features for gesture classification, and a recursive partition tree approximator is proposed to do classification.

...read moreread less

Journal Article•DOI•

Different Myofilament Nearest-Neighbor Interactions Have Distinctive Effects on Contractile Behavior

[...]

Maria V. Razumova¹, Maria V. Razumova², Anna E. Bukatina², Anna E. Bukatina³, Kenneth B. Campbell² - Show less +1 more•Institutions (3)

Moscow State University¹, Washington State University², Russian Academy of Sciences³

01 Jun 2000-Biophysical Journal

TL;DR: Variations in all three kinds of nearest-neighbor interactions may be responsible for a wide variety of currently unexplained observations of myofilament contractile behavior.

...read moreread less

Journal Article•DOI•

Noisy replication in skewed binary classification

[...]

Sauchi Stephen Lee¹•Institutions (1)

University of Idaho¹

28 Aug 2000-Computational Statistics & Data Analysis

TL;DR: This article showed that adding small normal noise to replicate the success in the training set could slightly improve estimates in several common classification models, namely, nearest neighbor, neural networks, classification trees, and quadratic discriminant.

...read moreread less

Content-based Classification and Retrieval of Audio Using the Nearest Feature Line Method

[...]

Stan Z. Li

01 Sep 2000

TL;DR: In this article, a method for content-based audio classification and retrieval is presented based on a new pattern classification method called the nearest feature line (NFL) in which information provided by multiple prototypes per class is explored This contrasts to the nearest neighbor (NN) classification in which the query is compared to each prototype individually.

...read moreread less

Abstract: A method is presented for content-based audio classification and retrieval It is based on a new pattern classification method called the nearest Feature Line (NFL) In the NFL, information provided by multiple prototypes per class is explored This contrasts to the nearest neighbor (NN) classification in which the query is compared to each prototype individually Regarding audio representation, perceptual and cepstral features and their combinations are considered Extensive experiments are performed to compare various classification methods and feature sets The results show that the NFL-based method produces consistently better results than the NN-based and other methods A system resulting from this work has achieved the error rate of 978%, as compared to that of 1834% of a compelling existing system, as tested on a common audio database

...read moreread less

Journal Article•DOI•

Overcoming time-varying co-channel interference using type-2 fuzzy adaptive filters

[...]

Qilian Liang¹, Jerry M. Mendel¹•Institutions (1)

University of Southern California¹

01 Dec 2000-IEEE Transactions on Circuits and Systems Ii: Analog and Digital Signal Processing

TL;DR: A method for overcoming time-varying co-channel interference (CCI) using type-2 fuzzy adaptive filters (FAF) and uses transversal equalizer and decision feedback equalizer structures to eliminate the CCI.

...read moreread less

Abstract: This paper presents a method for overcoming time-varying co-channel interference (CCI) using type-2 fuzzy adaptive filters (FAF). The type-2 FAF is realized using an unnormalized type-2 Takagi-Sugeno-Kang fuzzy logic system. A clustering method is used to adaptively design the parameters of the FAF. We use transversal equalizer and decision feedback equalizer structures to eliminate the CCI. Simulation results show that the equalizers based on type-2 FAFs perform better than the nearest neighbor classifiers or the equalizers based on type-1 FAFs when the number of co-channels is much large than 1.

...read moreread less

Proceedings Article•

Application of K-nearest neighbors algorithm on breast cancer diagnosis problem.

[...]

M. Sarkar¹, Tze-Yun Leong•Institutions (1)

National University of Singapore¹

01 Jan 2000

TL;DR: This paper addresses the Breast Cancer diagnosis problem as a pattern classification problem using the Wisconsin-Madison Breast Cancer data set and the K-nearest neighbors algorithm is employed as the classifier.

...read moreread less

Abstract: This paper addresses the Breast Cancer diagnosis problem as a pattern classification problem. Specifically, this problem is studied using the Wisconsin-Madison Breast Cancer data set. The K-nearest neighbors algorithm is employed as the classifier. Conceptually and implementation-wise, the K-nearest neighbors algorithm is simpler than the other techniques that have been applied to this problem. In addition, the Knearest neighbors algorithm produces the overall classification result 1.17% better than the best result known for this problem.

...read moreread less

Proceedings Article•DOI•

Data condensation in large databases by incremental learning with support vector machines

[...]

Pabitra Mitra, C. A. Murthy¹, Sankar K. Pal¹•Institutions (1)

Indian Statistical Institute¹

01 Sep 2000

TL;DR: Experimental results presented show that such active incremental learning enjoy superiority in terms of computation time and condensation ratio, over related methods.

...read moreread less

Abstract: An algorithm for data condensation using support vector machines (SVM) is presented. The algorithm extracts data points lying close to the class boundaries, which form a much reduced but critical set for classification. The problem of large memory requirements for training SVM in batch mode is circumvented by adopting an active incremental learning algorithm. The learning strategy is motivated from the condensed nearest neighbor classification technique. Experimental results presented show that such active incremental learning enjoy superiority in terms of computation time and condensation ratio, over related methods.

...read moreread less

Journal Article•DOI•

An integrated instance-based learning algorithm

[...]

D. Randall Wilson¹, Tony Martinez¹•Institutions (1)

Brigham Young University¹

01 Feb 2000

TL;DR: A comprehensive learning system called the Integrated Decremental Instance‐Based Learning Algorithm (IDIBL) that seeks to reduce storage, improve execution speed, and increase generalization accuracy, when compared to the basic nearest neighbor algorithm and other learning models is proposed.

...read moreread less

Abstract: The basic nearest-neighbor rule generalizes well in many domains but has several shortcomings, including inappropriate distance functions, large storage requirements, slow execution time, sensitivity to noise, and an inability to adjust its decision boundaries after storing the training data. This paper proposes methods for overcoming each of these weaknesses and combines the methods into a comprehensive learning system called the Integrated Decremental Instance-Based Learning Algorithm (IDIBL) that seeks to reduce storage, improve execution speed, and increase generalization accuracy, when compared to the basic nearest neighbor algorithm and other learning models. IDIBL tunes its own parameters using a new measure of fitness that combines confidence and cross-validation accuracy in order to avoid discretization problems with more traditional leave-one-out cross-validation. In our experiments IDIBL achieves higher generalization accuracy than other less comprehensive instance-based learning algorithms, while requiring less than one-fourth the storage of the nearest neighbor algorithm and improving execution speed by a corresponding factor. In experiments on twenty-one data sets, IDIBL also achieves higher generalization accuracy than that reported for sixteen major machine learning and neural network models.

...read moreread less

Journal Article•DOI•

A class-dependent weighted dissimilarity measure for nearest neighbor classification problems

[...]

Roberto Paredes¹, Enrique Vidal¹•Institutions (1)

Polytechnic University of Valencia¹

01 Nov 2000-Pattern Recognition Letters

TL;DR: A class-dependent weighted dissimilarity measure in vector spaces is proposed to improve the performance of the nearest neighbor (NN) classifier and an approach based on Fractional Programming is presented.

...read moreread less

Proceedings Article•DOI•

Key-frame extraction and shot retrieval using nearest feature line (NFL)

[...]

Li Zhao¹, Wei Qi², Stan Z. Li², Shiqiang Yang¹, Hong-Jiang Zhang² - Show less +1 more•Institutions (2)

Tsinghua University¹, Microsoft²

04 Nov 2000

TL;DR: This paper proposes to use the “breakpoints” of feature trajectory of a video shot as the key frames and use the lines passing through these points to represent the shot to achieve a better performance.

...read moreread less

Abstract: Query by key frame or video example is a convenient and often effective way to search in video database. This paper proposes a new approach to support such searches. The main contribution of the proposed approach is the consideration of both feature extraction and distance computation as a whole process. With a video shot represented by key-frames corresponding to feature points in a feature space, a new metric is defined to measure the distance between a query image and a shot based on the concept of Nearest Feature Line (NFL). We propose to use the “breakpoints” of feature trajectory of a video shot as the key frames and use the lines passing through these points to represent the shot. When combined with the NFL method, it helps to achieve a better performance, as evidenced by experiments.

...read moreread less

Proceedings Article•

Complete Cross-Validation for Nearest Neighbor Classifiers

[...]

Matthew D. Mullin, Rahul Sukthankar

29 Jun 2000

TL;DR: This work presents a technique for calculating the complete cross-validation for nearest-neighbor classifiers: i.e., averaging over all desired test/train partitions of data, though in effect it averages an exponential number of trials.

...read moreread less

Abstract: Cross-validation is an established technique for estimating the accuracy of a classifier and is normally performed either using a number of random test/train partitions of the data, or using kfold cross-validation. We present a technique for calculating the complete cross-validation for nearest-neighbor classifiers: i.e., averaging over all desired test/train partitions of data. This technique is applied to several common classifier variants such as K-nearest-neighbor, stratified data partitioning and arbitrary loss functions. We demonstrate, with complexity analysis and experimental timing results, that the technique can be performed in time comparable to k-fold cross-validation, though in effect it averages an exponential number of trials. We show that the results of complete cross-validation are biased equally compared to subsampling and kfold cross-validation, and there is some reduction in variance. This algorithm offers significant benefits both in terms of time and accuracy.

...read moreread less

A Nearest Trajectory Strategy for Time Series Prediction

[...]

James McNames¹•Institutions (1)

Stanford University¹

01 Jan 2000

TL;DR: A method of local modeling for predicting time series generated by nonlinear dynamic systems is proposed that incorporates a weighted Euclidean metric and a novel ρ-steps ahead crossvalidation error to assess model accuracy.

...read moreread less

Abstract: A method of local modeling for predicting time series generated by nonlinear dynamic systems is proposed that incorporates a weighted Euclidean metric and a novel ρ-steps ahead crossvalidation error to assess model accuracy. The tradeoff between the cost of computation and model accuracy is discussed in the context of optimizing model parameters. A fast nearest neighbor algorithm and a novel modification to find neighboring trajectory segments are described.

...read moreread less

Journal Article•DOI•

A new edited k-nearest neighbor rule in the pattern classification problem

[...]

Kazuo Hattori¹, Masahito Takahashi¹•Institutions (1)

Toyohashi University of Technology¹

01 Mar 2000-Pattern Recognition

TL;DR: It is shown that the rule proposed will yield good results in many pattern classi"cation problems and has been investigated using three classi'cation examples.

...read moreread less

Journal Article•DOI•

k-nearest neighbors directed noise injection in multilayer perceptron training

[...]

Marina Skurichina¹, Sarunas Raudys, Robert P. W. Duin¹•Institutions (1)

Delft University of Technology¹

01 Mar 2000-IEEE Transactions on Neural Networks

TL;DR: By both empirical as well as theoretical studies, it is shown that the -nearest neighbors directed noise injection is preferable over the Gaussian spherical noise injection for data with low intrinsic dimensionality.

...read moreread less

Abstract: The relation between classifier complexity and learning set size is very important in discriminant analysis. One of the ways to overcome the complexity control problem is to add noise to the training objects, increasing in this way the size of the training set. Both the amount and the directions of noise injection are important factors which determine the effectiveness for classifier training. In this paper the effect is studied of the injection of Gaussian spherical noise and k-nearest neighbors directed noise on the performance of multilayer perceptrons. As it is impossible to provide an analytical investigation for multilayer perceptrons, a theoretical analysis is made for statistical classifiers. The goal is to get a better understanding of the effect of noise injection on the accuracy of sample-based classifiers. By both empirical as well as theoretical studies, it is shown that the k-nearest neighbors directed noise injection is preferable over the Gaussian spherical noise injection for data with low intrinsic dimensionality.

...read moreread less

Proceedings Article•DOI•

Web page classification based on k-nearest neighbor approach

[...]

Oh-Woog Kwon¹, Jong-Hyeok Lee¹•Institutions (1)

Pohang University of Science and Technology¹

01 Nov 2000

TL;DR: To improve the performance of k-NN approach, it is supplemented with a feature selection method and a term-weighting scheme using markup tags, and reform document-document similarity measure used in vector space model.

...read moreread less

Abstract: Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web. In this paper, we propose a Web page classifier based on an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of k-NN approach, we supplement k-NN approach with a feature selection method and a term-weighting scheme using markup tags, and reform document-document similarity measure used in vector space model. In our experiments on a Korean commercial Web directory, our proposed methods in k-NN approach for Web page classification improved the performance of classification.

...read moreread less

Journal Article•DOI•

Fast nearest-neighbor searching for nonlinear signal processing

[...]

Christian Merkwirth¹, Ulrich Parlitz¹, Werner Lauterborn¹•Institutions (1)

University of Göttingen¹

01 Aug 2000-Physical Review E

TL;DR: A fast algorithm for exact and approximate nearest-neighbor searching is presented that is suitable for tasks encountered in nonlinear signal processing and compares the running time of the algorithm with those of two previously proposed algorithms.

...read moreread less

Abstract: A fast algorithm for exact and approximate nearest-neighbor searching is presented that is suitable for tasks encountered in nonlinear signal processing. Empirical benchmarks show that the algorithm's performance depends mainly on the (fractal) dimension D(d) of the data set, which is usually smaller than the dimension D(s) of the vector space in which the data points are embedded. We also compare the running time of our algorithm with those of two previously proposed algorithms for nearest-neighbor searching.

...read moreread less