scispace - formally typeset
Search or ask a question

Showing papers on "k-nearest neighbors algorithm published in 2003"


Journal ArticleDOI
TL;DR: A theoretical explanation for the observed behavior of the Vicsek model, which proves to be a graphic example of a switched linear system which is stable, but for which there does not exist a common quadratic Lyapunov function.
Abstract: In a recent Physical Review Letters article, Vicsek et al. propose a simple but compelling discrete-time model of n autonomous agents (i.e., points or particles) all moving in the plane with the same speed but with different headings. Each agent's heading is updated using a local rule based on the average of its own heading plus the headings of its "neighbors." In their paper, Vicsek et al. provide simulation results which demonstrate that the nearest neighbor rule they are studying can cause all agents to eventually move in the same direction despite the absence of centralized coordination and despite the fact that each agent's set of nearest neighbors change with time as the system evolves. This paper provides a theoretical explanation for this observed behavior. In addition, convergence results are derived for several other similarly inspired models. The Vicsek model proves to be a graphic example of a switched linear system which is stable, but for which there does not exist a common quadratic Lyapunov function.

8,233 citations


Proceedings ArticleDOI
13 Oct 2003
TL;DR: A new algorithm is introduced that learns a set of hashing functions that efficiently index examples relevant to a particular estimation task, and can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.
Abstract: Example-based methods are effective for parameter estimation problems when the underlying system is simple or the dimensionality of the input is low. For complex and high-dimensional problems such as pose estimation, the number of required examples and the computational complexity rapidly become prohibitively high. We introduce a new algorithm that learns a set of hashing functions that efficiently index examples relevant to a particular estimation task. Our algorithm extends locality-sensitive hashing, a recently developed method to find approximate neighbors in time sublinear in the number of examples. This method depends critically on the choice of hash functions that are optimally relevant to a particular estimation problem. Experiments demonstrate that the resulting algorithm, which we call parameter-sensitive hashing, can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.

929 citations


Journal ArticleDOI
TL;DR: An empirical study is conducted to examine the pros and cons of these search methods, give some guidelines on choosing a search method, and compare the classifier error rates before and after feature selection.

846 citations


Proceedings Article
01 Jan 2003
TL;DR: A novel clustering technique that addresses problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size, and a number of optimizations that allow the algorithm to handle large data sets are discussed.
Abstract: Finding clusters in data, especially high dimensional data, is challenging when the clusters are of widely diering shapes, sizes, and densities, and when the data contains noise and outliers. We present a novel clustering technique that addresses these issues. Our algorithm rst nds the nearest neighbors of each data point and then redenes the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this denition of similarity, our algorithm identies core points and then builds clusters around the core points. The use of a shared nearest neighbor denition of similarity alleviates problems with varying densities and high dimensionality, while the use of core points handles problems with shape and size. While our algorithm can nd the \dense" clusters that other clustering algorithms nd, it also nds clusters that these approaches overlook, i.e., clusters of low or medium density which represent relatively uniform regions \surrounded" by non-uniform or higher density areas. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means, DBSCAN, CURE) on a variety of data sets: KDD Cup '99 network intrusion data, NASA Earth science time series data, two-dimensional point sets, and documents. The run-time complexity of our technique is O(n 2 ) if the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large data sets ecien tly.

715 citations


Journal ArticleDOI
TL;DR: There is additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and it is argued that this observation is a general property of any QSAR model developed with LOO cross-validation.
Abstract: Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.

591 citations


Journal ArticleDOI
TL;DR: The results of handwritten digit recognition on well-known image databases using state-of-the-art feature extraction and classification techniques are competitive to the best ones previously reported on the same databases.

545 citations


Proceedings ArticleDOI
09 Jun 2003
TL;DR: This work proposes a novel approach to performing efficient similarity search and classification in high dimensional data and proves that with high probability, it produces a result that is a (1 + ε) factor approximation to the Euclidean nearest neighbor.
Abstract: We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + e) factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors.We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed---similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency.

442 citations


Journal ArticleDOI
TL;DR: The experimental results indicate that the classification accuracy is increased significantly under parallel feature fusion and also demonstrate that the developed parallel fusion is more effective than the classical serial feature fusion.

418 citations


Journal ArticleDOI
TL;DR: An abstract framework for integrating multiple feature spaces in the k-means clustering algorithm is presented and the effectiveness of feature weighting in clustering on several different application domains is demonstrated.
Abstract: Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.

414 citations



Journal ArticleDOI
TL;DR: A feature extraction method is presented by utilizing an error estimation equation based on the Bhattacharyya distance to use classification errors in the transformed feature space, which are estimated using the error estimation equations, as a criterion for feature extraction.

Journal ArticleDOI
TL;DR: The minimum classification error (MCE) training algorithm (which was originally proposed for optimizing classifiers) is investigated for feature extraction and a generalized MCE (GMCE)Training algorithm is proposed to mend the shortcomings of the MCE training algorithm.

Journal ArticleDOI
TL;DR: In this article, the kth nearest neighbor distance between the n sample points, where k (< n − 1) is a fixed positive integer, is used to estimate the entropy of internal rotation in the methanol molecule and of diethyl ether.
Abstract: SYNOPTIC ABSTRACTMotivated by the problems in molecular sciences, we introduce new nonparametric estimators of entropy which are based on the kth nearest neighbor distances between the n sample points, where k (< n – 1) is a fixed positive integer. These provide competing estimators to an estimator proposed by Kozachenko and Leonenko (1987), which is based on the first nearest neighbor distances of the sample points. These estimators are helpful in the evaluation of entropies of random vectors. We establish the asymptotic unbiasedness and consistency of the proposed estimators. For some standard distributions, we also investigate their performance for finite sample sizes using Monte Carlo simulations. The proposed estimators are applied to estimate the entropy of internal rotation in the methanol molecule, which can be characterized by a one-dimensional random vector, and of diethyl ether, which is described by a four-dimensional random vector.

Proceedings ArticleDOI
19 Jun 2003
TL;DR: This paper compares two algorithms for Multiple Target Tracking, using Global Nearest neighbor (GNN) and Suboptimal Nearest Neighbor (SNN) approach respectively and results reveal that in some cases the GNN approach gives batter solution than the SNN approach.
Abstract: This paper compares two algorithms for Multiple Target Tracking (MTT), using Global Nearest Neighbor (GNN) and Suboptimal Nearest Neighbor (SNN) approach respectively. For both algorithms the observations are divided in clusters to reduce computational efforts. For each cluster the assignment problem is solved by using Munkres algorithm or according SNN rules. Results reveal that in some cases the GNN approach gives batter solution than.SNN approach. The computational time, needed for assignment problem solution using Munkres algorithm is studied and results prove that it is suitable for real time implementations.

Journal ArticleDOI
TL;DR: An embedding technique to transform a road network to a high dimensional space in order to utilize computationally simple Minkowski metrics for distance measurement is applied and the Chessboard distance metric (L∞) in the embedding space preserves the ordering of the distances between a point and its neighbors more precisely.
Abstract: A very important class of queries in GIS applications is the class of K-nearest neighbor queries. Most of the current studies on the K-nearest neighbor queries utilize spatial index structures and hence are based on the Euclidean distances between the points. In real-world road networks, however, the shortest distance between two points depends on the actual path connecting the points and cannot be computed accurately using one of the Minkowski metrics. Thus, the Euclidean distance may not properly approximate the real distance. In this paper, we apply an embedding technique to transform a road network to a high dimensional space in order to utilize computationally simple Minkowski metrics for distance measurement. Subsequently, we extend our approach to dynamically transform new points into the embedding space. Finally, we propose an efficient technique that can find the actual shortest path between two points in the original road network using only the embedding space. Our empirical experiments indicate that the Chessboard distance metric (L∞) in the embedding space preserves the ordering of the distances between a point and its neighbors more precisely as compared to the Euclidean distance in the original road network.

Posted Content
TL;DR: An improved kNN algorithm is proposed, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories, and is promising for some cases, where estimating the parameter k via cross-validation is not allowed.
Abstract: k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class distribution in the training set is uneven. Some classes may have more samples than others. Therefore, the system performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias on large categories. To deal with these problems, we propose an improved kNN algorithm, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories. More samples (nearest neighbors) will be used for deciding whether a test document should be classified to a category, which has more samples in the training set. Preliminary experiments on Chinese text categorization show that our method is less sensitive to the parameter k than the traditional one, and it can properly classify documents belonging to smaller classes with a large k. The method is promising for some cases, where estimating the parameter k via cross-validation is not allowed.

Journal ArticleDOI
TL;DR: It is observed that when the authors seek a linear representation adapted to improve NN performance, what they obtain not surprisingly is quite close to NDA.

Journal ArticleDOI
TL;DR: In this article, the Fierz-Pauli Lagrangian model with multiple interacting massive gravitons is considered, and it is shown that any model with only nearest neighbor interactions is doomed.
Abstract: It would be extremely useful to know whether a particular low energy effective theory might have come from a compactification of a higher dimensional space. Here, this problem is approached from the ground up by considering theories with multiple interacting massive gravitons. It is actually very difficult to construct discrete gravitational dimensions which have a local continuum limit. In fact, any model with only nearest neighbor interactions is doomed. If we could find a non-linear extension for the Fierz-Pauli Lagrangian for a graviton of mass ${m}_{g},$ which does not break down until the scale ${\ensuremath{\Lambda}}_{2}=\sqrt{{m}_{g}{M}_{\mathrm{Pl}}},$ this could be used to construct a large class of models whose continuum limit is local in the extra dimension. But this is shown to be impossible: a theory with a single graviton must break down by ${\ensuremath{\Lambda}}_{3}{=(m}_{g}^{2}{M}_{\mathrm{Pl}}{)}^{1/3}.$ Next, we look at how the discretization prescribed by the truncation of the Kaluza-Klein tower of an honest extra dimension raises the scale of strong coupling. It dictates an intricate set of interactions among various fields which conspire to soften the strongest scattering amplitudes and allow for a local continuum limit, at least at the tree level. A number of candidate symmetries associated with locality in the discretized dimension are also discussed.

Book ChapterDOI
24 Jul 2003
TL;DR: The paper describes a fast system for appearance based image recognition that uses local invariant descriptors and efficient nearest neighbor search to overcomes the drawbacks of most binary tree-like indexing techniques, namely the high complexity in high dimensional data sets and the boundary problem.
Abstract: The paper describes a fast system for appearance based image recognition. It uses local invariant descriptors and efficient nearest neighbor search. First, local affine invariant regions are found nested at multiscale intensity extremas. These regions are characterized by nine generalized color moment invariants. An efficient novel method called HPAT (hyper-polyhedron with adaptive threshold) is introduced for efficient localization of the nearest neighbor in feature space. The invariants make the method robust against changing illumination and viewpoint. The locality helps to resolve occlusions. The proposed indexing method overcomes the drawbacks of most binary tree-like indexing techniques, namely the high complexity in high dimensional data sets and the boundary problem. The database representation is very compact and the retrieval close to realtime on a standard PC. The performance of the proposed method is demonstrated on a public database containing 1005 images of urban scenes. Experiments with an image database containing objects are also presented.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed nearest-neighbor chain (NNC) based approach has achieved an improved accuracy for estimating document image skew angle and has an advantage of being language independent.

Patent
09 Jul 2003
TL;DR: In this paper, audio/visual data is classified into semantic classes such as News, Sports, Music video or the like by providing class models for each class and comparing input audio visual data to the models.
Abstract: Audio/Visual data is classified into semantic classes such as News, Sports, Music video or the like by providing class models for each class and comparing input audio visual data to the models. The class models are generated by extracting feature vectors from training samples, and then subjecting the feature vectors to kernel discriminant analysis or principal component analysis to give discriminatory basis vectors. These vectors are then used to obtain further feature vector of much lower dimension than the original feature vectors, which may then be used directly as a class model, or used to train a Gaussian Mixture Model or the like. During classification of unknown input data, the same feature extraction and analysis steps are performed to obtain the low-dimensional feature vectors, which are then fed into the previously created class models to identify the data genre.

Proceedings ArticleDOI
20 May 2003
TL;DR: This paper adopts two techniques: a matrix conversion method for similarity measure and an instance selection method and presents an improved collaborative filtering algorithm based on these two methods that shows its satisfactory accuracy and performance.
Abstract: Collaborative filtering has been very successful in both research and applications such as information filtering and E-commerce. The k-Nearest Neighbor (KNN) method is a popular way for its realization. Its key technique is to find k nearest neighbors for a given user to predict his interests. However, this method suffers from two fundamental problems: sparsity and scalability. In this paper, we present our solutions for these two problems. We adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. And then we present an improved collaborative filtering algorithm based on these two methods. In contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance.

Journal ArticleDOI
TL;DR: A new generalization of the rank nearest neighbor (RNN) rule for multivariate data for diagnosis of breast cancer is proposed, and the computational complexity of the proposed k-RNN is much less than the conventional k-NN rule.

Proceedings ArticleDOI
27 Dec 2003
TL;DR: This paper proposes a simple yet highly accurate system for the recognition or unconstrained handwritten numerals and illustrates how the basic CL implementation can be extended and used in conjunction with a multilayer perception neural network classifier to increase the recognition rate to 98%.
Abstract: This paper proposes a simple yet highly accurate system for the recognition or unconstrained handwritten numerals. It starts with an examination of the basic characteristic loci (CL) features used along with a nearest neighbor classifier achieving a recognition rate of 90.5%. We then illustrate how the basic CL implementation can be extended and used in conjunction with a multilayer perception neural network classifier to increase the recognition rate to 98%. This proposed recognition system was tested on a totally unconstrained handwritten numeral database while training it with only 600 samples exclusive from the test set. An accuracy exceeding 98% is also expected if a larger training set is used. Lastly, to demonstrate the effectiveness of the system its performance is also compared to that of some other common recognition schemes. These systems use moment Invariants as features along with nearest neighbor classification schemes.

Journal ArticleDOI
TL;DR: This work uses local support vector machine learning to estimate an effective metric for producing neighborhoods that are elongated along less discriminant feature dimensions and constricted along most discriminant ones, whereby better classification performance can be achieved.
Abstract: Nearest neighbor (NN) classification relies on the assumption that class conditional probabilities are locally constant. This assumption becomes false in high dimensions with finite samples due to the curse of dimensionality. The NN rule introduces severe bias under these conditions. We propose a locally adaptive neighborhood morphing classification method to try to minimize bias. We use local support vector machine learning to estimate an effective metric for producing neighborhoods that are elongated along less discriminant feature dimensions and constricted along most discriminant ones. As a result, the class conditional probabilities can be expected to be approximately constant in the modified neighborhoods, whereby better classification performance can be achieved. The efficacy of our method is validated and compared against other competing techniques using a number of datasets.

Proceedings ArticleDOI
07 Nov 2003
TL;DR: This paper addresses the problem of finding the in-route nearest neighbor (IRNN) for a query object tuple which consists of a given route with a destination and a current location on it and addresses four alternative solution methods.
Abstract: Nearest neighbor query is one of the most important operations in spatial databases and their application domains, e.g., location-based services, advanced traveler information systems, etc. This paper addresses the problem of finding the in-route nearest neighbor (IRNN) for a query object tuple which consists of a given route with a destination and a current location on it. The IRNN is a facility instance via which the detour from the original route on the way to the destination is smallest. This paper addresses four alternative solution methods. Comparisons among them are presented using an experimental framework. Several experiments using real road map datasets are conducted to examine the behavior of the solutions in terms of three parameters affecting the performance. Our experiments show that the computation costs for all methods except the precomputed zone-based method increase with increases in the road map size and the query route length but decreases with increase in the facility density. The precomputed zone-based method shows the most efficiency when there are no updates on the road map.

Proceedings ArticleDOI
18 Jun 2003
TL;DR: The linear programming technique used in this paper, which is called feature selection via linear programming (FSLP), can determine the number of features and which features to use in the resulting classification function based on recent results in optimization.
Abstract: A linear programming technique is introduced that jointly performs feature selection and classifier training so that a subset of features is optimally selected together with the classifier. Because traditional classification methods in computer vision have used a two-step approach: feature selection followed by classifier training, feature selection has often been ad hoc using heuristics or requiring a time-consuming forward and backward search process. Moreover, it is difficult to determine which features to use and how many features to use when these two steps are separated. The linear programming technique used in this paper, which we call feature selection via linear programming (FSLP), can determine the number of features and which features to use in the resulting classification function based on recent results in optimization. We analyze why FSLP can avoid the curse of dimensionality problem based on margin analysis. As one demonstration of the performance of this FSLP technique for computer vision tasks, we apply it to the problem of face expression recognition. Recognition accuracy is compared with results using support vector machines, the AdaBoost algorithm, and a Bayes classifier.

Dissertation
01 Jan 2003
TL;DR: The embeddings are shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.
Abstract: Sequences represent a large class of fundamental objects in Computer Science sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Fast computation of short sketches in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences. Approximate nearest neighbor and clustering problems, significantly faster than the naive exact solutions. Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time. Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication. Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

Book ChapterDOI
04 Jun 2003
TL;DR: This paper investigates its extension, called supervised locally linear embedding (SLLE), using class labels of data points in their mapping into a low-dimensional space, and derives an efficient eigendecomposition scheme for SLLE.
Abstract: The dimensionality of the input data often far exceeds their intrinsic dimensionality. As a result, it may be difficult to recognize multidimensional data, especially if the number of samples in a dataset is not large. In addition, the more dimensions the data have, the longer the recognition time is. This leads to the necessity of performing dimensionality reduction before pattern recognition. Locally linear embedding (LLE) 5,6 is one of the methods intended for this task. In this paper, we investigate its extension, called supervised locally linear embedding (SLLE), using class labels of data points in their mapping into a low-dimensional space. An efficient eigendecomposition scheme for SLLE is derived. Two variants of SLLE are analyzed coupled with a k nearest neighbor classifier and tested on real-world images. Preliminary results demonstrate that both variants yield identical best accuracy, despite of being conceptually different.

Proceedings Article
09 Aug 2003
TL;DR: A parametric method for metric learning based on class label information is proposed, which performs parametric learning to find a regression mapping from the input space to a feature space, such that the dissimilarity between patterns in theinput space is approximated by the Euclidean distance between points in the feature space.
Abstract: Distance-based methods in pattern recognition and machine learning have to rely on a similarity or dissimilarity measure between patterns in the input space. For many applications, Euclidean distance in the input space is not a good choice and hence more complicated distance metrics have to be used. In this paper, we propose a parametric method for metric learning based on class label information. We first define a dissimilarity measure that can be proved to be metric. It has the favorable property that between-class dissimilarity is always larger than within-class dissimilarity. We then perform parametric learning to find a regression mapping from the input space to a feature space, such that the dissimilarity between patterns in the input space is approximated by the Euclidean distance between points in the feature space. Parametric learning is performed using the iterative majorization algorithm. Experimental results on real-world benchmark data sets show that this approach is promising.