scispace - formally typeset
Search or ask a question

Showing papers on "k-nearest neighbors algorithm published in 2004"


Journal ArticleDOI
TL;DR: Two classes of improved estimators for mutual information M(X,Y), from samples of random points distributed according to some joint probability density mu(x,y), based on entropy estimates from k -nearest neighbor distances are presented.
Abstract: We present two classes of improved estimators for mutual information M(X,Y), from samples of random points distributed according to some joint probability density mu(x,y). In contrast to conventional estimators based on binnings, they are based on entropy estimates from k -nearest neighbor distances. This means that they are data efficient (with k=1 we resolve structures down to the smallest possible scales), adaptive (the resolution is higher where data are more numerous), and have minimal bias. Indeed, the bias of the underlying entropy estimates is mainly due to nonuniformity of the density at the smallest resolved scale, giving typically systematic errors which scale as functions of k/N for N points. Numerically, we find that both families become exact for independent distributions, i.e. the estimator M(X,Y) vanishes (up to statistical fluctuations) if mu(x,y)=mu(x)mu(y). This holds for all tested marginal distributions and for all dimensions of x and y. In addition, we give estimators for redundancies between more than two random variables. We compare our algorithms in detail with existing algorithms. Finally, we demonstrate the usefulness of our estimators for assessing the actual independence of components obtained from independent component analysis (ICA), for improving ICA, and for estimating the reliability of blind source separation.

3,224 citations


Book ChapterDOI
31 Aug 2004
TL;DR: This paper proposes a novel approach to efficiently and accurately evaluate KNN queries in spatial network databases using first order Voronoi diagram, which outperforms approaches that are based on on-line distance computation by up to one order of magnitude, and provides a factor of four improvement in the selectivity of the filter step as compared to the index-based approaches.
Abstract: A frequent type of query in spatial networks (e.g., road networks) is to find the K nearest neighbors (KNN) of a given query object. With these networks, the distances between objects depend on their network connectivity and it is computationally expensive to compute the distances (e.g., shortest paths) between objects. In this paper, we propose a novel approach to efficiently and accurately evaluate KNN queries in spatial network databases using first order Voronoi diagram. This approach is based on partitioning a large network to small Voronoi regions, and then pre-computing distances both within and across the regions. By localizing the precomputation within the regions, we save on both storage and computation and by performing across-the-network computation for only the border points of the neighboring regions, we avoid global pre-computation between every node-pair. Our empirical experiments with several real-world data sets show that our proposed solution outperforms approaches that are based on on-line distance computation by up to one order of magnitude, and provides a factor of four improvement in the selectivity of the filter step as compared to the index-based approaches.

520 citations


Proceedings Article
01 Dec 2004
TL;DR: This paper asks the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how and why and introduces a new kind of metric tree that allows overlap.
Abstract: This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same random-projection-based approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efficiency. We then provide a detailed empirical evaluation on five large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels.

487 citations


Proceedings ArticleDOI
01 Jan 2004
TL;DR: It is indicated that connectivity seems to have an adverse effect on controllability, and it is formally shown why a path is controllable while a complete graph is not.
Abstract: In this paper we derive necessary and sufficient conditions for a group of systems interconnected via nearest neighbor rules, to be controllable by one of them acting as a leader. It is indicated that connectivity seems to have an adverse effect on controllability, and it is formally shown why a path is controllable while a complete graph is not. The dependence of the graph controllability property on the size of the graph and its connectivity is investigated in simulation. Results suggest analytical means of selecting the right leader and/or the appropriate topology to be able to control an interconnected system with nearest neighbor interaction rules.

469 citations


Journal ArticleDOI
TL;DR: A new method for estimating the area of home ranges and constructing utilization distributions (UDs) from spatial data is described and a minimum spurious hole covering (MSHC) rule is proposed for selecting k and interpreted in terms of type I and type II statistical errors.
Abstract: We describe a new method for estimating the area of home ranges and constructing utilization distributions (UDs) from spatial data. We compare our method with bivariate kernel and alpha-hull methods, using both randomly distributed and highly aggregated data to test the accuracy of area estimates and UD isopleth construction. The data variously contain holes, corners, and corridors linking high use areas. Our method is based on taking the union of the minimum convex polygons (MCP) associated with the k-1 nearest neighbors of each point in the data and, as such, has one free parameter k. We propose a minimum spurious hole covering (MSHC) rule for selecting k and interpret its application in terms of type I and type II statistical errors. Our MSHC rule provides estimates within 12% of true area values for all 5 data sets, while kernel methods are worse in all cases: in one case overestimating area by a factor of 10 and in another case underestimating area by a factor of 50. Our method also constructs much better estimates for the density isopleths of the UDs than kernel methods. The alpha-hull method does not lead directly to the construction of isopleths and also does not always include all points in the constructed home range. Finally we demonstrate that kernel methods, unlike our method and the alpha-hull method, does not converges to the true area represented by the data as the number of data points increase.

413 citations


DOI
01 Jan 2004
TL;DR: This paper presents an extended version of k-nearest neighbor classification, where the distances of the nearest neighbors can be taken into account, and shows possibilities to use nearest neighbor for classification in the case of an ordinal class structure.
Abstract: In the field of statistical discrimination k-nearest neighbor classification is a well-known, easy and successful method. In this paper we present an extended version of this technique, where the distances of the nearest neighbors can be taken into account. In this sense there is a close connection to LOESS, a local regression technique. In addition we show possibilities to use nearest neighbor for classification in the case of an ordinal class structure. Empirical studies show the advantages of the new techniques.

332 citations


Journal ArticleDOI
TL;DR: A comparison of normalization functions shows that moment-based functions outperform the dimension-based ones and the aspect ratio mapping is influential and the comparison of feature vectors shows that the improved feature extraction strategies outperform their baseline counterparts.

305 citations


Book ChapterDOI
31 Aug 2004
TL;DR: The proposed algorithms for exact processing of RkNN with arbitrary values of k on dynamic multidimensional datasets utilize a conventional data-partitioning index on the dataset and do not require any pre-computation.
Abstract: Given a point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: (i) they do not support arbitrary values of k (ii) they cannot deal efficiently with database updates, (iii) they are applicable only to 2D data (but not to higher dimensionality), and (iv) they retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact processing of RkNN with arbitrary values of k on dynamic multidimensional datasets. Our methods utilize a conventional data-partitioning index on the dataset and do not require any pre-computation. In addition to their flexibility, we experimentally verify that the proposed algorithms outperform the existing ones even in their restricted focus.

294 citations


Proceedings ArticleDOI
14 Mar 2004
TL;DR: This paper compares two commonly used distance measures in vector models, namely, Euclidean distance (EUD) and cosine angle distance (CAD), for nearest neighbor (NN) queries in high dimensional data spaces and shows that CAD works no worse than EUD.
Abstract: Understanding the relationship among different distance measures is helpful in choosing a proper one for a particular application. In this paper, we compare two commonly used distance measures in vector models, namely, Euclidean distance (EUD) and cosine angle distance (CAD), for nearest neighbor (NN) queries in high dimensional data spaces. Using theoretical analysis and experimental results, we show that the retrieval results based on EUD are similar to those based on CAD when dimension is high. We have applied CAD for content based image retrieval (CBIR). Retrieval results show that CAD works no worse than EUD, which is a commonly used distance measure for CBIR, while providing other advantages, such as naturally normalized distance.

281 citations


Proceedings ArticleDOI
30 Mar 2004
TL;DR: This work proposes several algorithms for finding the group nearest neighbors efficiently and extends their techniques for situations where Q cannot fit in memory, covering both indexed and nonindexed query points.
Abstract: Given two sets of points P and Q, a group nearest neighbor (GNN) query retrieves the point(s) of P with the smallest sum of distances to all points in Q. Consider, for instance, three users at locations q/sub 1/ q/sub 2/ and q/sub 3/ that want to find a meeting point (e.g., a restaurant); the corresponding query returns the data point p that minimizes the sum of Euclidean distances |pq/sub i/| for 1/spl les/i/spl les/3. Assuming that Q fits in memory and P is indexed by an R-tree, we propose several algorithms for finding the group nearest neighbors efficiently. As a second step, we extend our techniques for situations where Q cannot fit in memory, covering both indexed and nonindexed query points. An experimental evaluation identifies the best alternative based on the data and query properties.

270 citations


Book ChapterDOI
13 Apr 2004

Journal ArticleDOI
TL;DR: In this paper, a simple Gabor feature space is proposed for invariant object recognition, which has been successfully applied to applications, e.g., in invariant face detection to extract facial features in demanding environments.

Journal ArticleDOI
TL;DR: In this article, a look-ahead algorithm for selective sampling of examples for nearest neighbor classifiers is proposed, where the algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account.
Abstract: Most existing inductive learning algorithms work under the assumption that their training examples are already tagged. There are domains, however, where the tagging procedure requires significant computation resources or manual labor. In such cases, it may be beneficial for the learner to be active, intelligently selecting the examples for labeling with the goal of reducing the labeling cost. In this paper we present LSS—a lookahead algorithm for selective sampling of examples for nearest neighbor classifiers. The algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account. Computing the expected utility of an example requires estimating the probability of its possible labels. We propose to use the random field model for this estimation. The LSS algorithm was evaluated empirically on seven real and artificial data sets, and its performance was compared to other selective sampling algorithms. The experiments show that the proposed algorithm outperforms other methods in terms of average error rate and stability.

Proceedings Article
01 Dec 2004
TL;DR: A framework for learning an object classifier from a single example by emphasizing the relevant dimensions for classification using available examples of related classes by making use of a kernel based metric learning algorithm.
Abstract: We describe a framework for learning an object classifier from a single example. This goal is achieved by emphasizing the relevant dimensions for classification using available examples of related classes. Learning to accurately classify objects from a single training example is often unfeasible due to overfitting effects. However, if the instance representation provides that the distance between each two instances of the same class is smaller than the distance between any two instances from different classes, then a nearest neighbor classifier could achieve perfect performance with a single training example. We therefore suggest a two stage strategy. First, learn a metric over the instances that achieves the distance criterion mentioned above, from available examples of other related classes. Then, using the single examples, define a nearest neighbor classifier where distance is evaluated by the learned class relevance metric. Finding a metric that emphasizes the relevant dimensions for classification might not be possible when restricted to linear projections. We therefore make use of a kernel based metric learning algorithm. Our setting encodes object instances as sets of locality based descriptors and adopts an appropriate image kernel for the class relevance metric learning. The proposed framework for learning from a single example is demonstrated in a synthetic setting and on a character classification task.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms.
Abstract: This paper explores feature scoring and selection based on weights from linear classification models. It investigates how these methods combine with various learning models. Our comparative analysis includes three learning algorithms: Naive Bayes, Perceptron, and Support Vector Machines (SVM) in combination with three feature weighting methods: Odds Ratio, Information Gain, and weights from linear models, the linear SVM and Perceptron. Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms. The results support the conjecture that it is the sophistication of the feature weighting method rather than its apparent compatibility with the learning algorithm that improves classification performance.

Journal ArticleDOI
TL;DR: A cluster-based tree algorithm to accelerate k-NN classification without any presuppositions about the metric form and properties of a dissimilarity measure is proposed.
Abstract: Most fast k-nearest neighbor (k-NN) algorithms exploit metric properties of distance measures for reducing computation cost and a few can work effectively on both metric and nonmetric measures. We propose a cluster-based tree algorithm to accelerate k-NN classification without any presuppositions about the metric form and properties of a dissimilarity measure. A mechanism of early decision making and minimal side-operations for choosing searching paths largely contribute to the efficiency of the algorithm. The algorithm is evaluated through extensive experiments over standard NIST and MNIST databases.

Proceedings ArticleDOI
23 Aug 2004
TL;DR: An on-line recognition method for hand-sketched symbols is presented that is independent of stroke-order, -number, and -direction, as well as invariant to scaling, translation, rotation and reflection of symbols.
Abstract: We present an on-line recognition method for hand-sketched symbols. The method is independent of stroke-order, -number, and -direction, as well as invariant to scaling, translation, rotation and reflection of symbols. Zernike moment descriptors are used to represent symbols and three different classification techniques are compared: support vector machines (SVM), minimum mean distance (MMD), and nearest neighbor (NN). We have obtained a 97% recognition accuracy rate on a dataset consisting of 7,410 sketched symbols using Zernike moment features and a SVM classifier.

Journal ArticleDOI
TL;DR: A new metaheuristic approach called ACOMAC algorithm for solving the traveling salesman problem (TSP) is presented, which introduces multiple ant clans' concept from parallel genetic algorithm to search solution space utilizing various islands to avoid local minima and thus can yield global minimum for solved TSPs.

Journal ArticleDOI
TL;DR: A probabilistic active learning strategy for support vector machine (SVM) design in large data applications that queries for a set of points according to a distribution as determined by the current separating hyperplane and a newly defined concept of an adaptive confidence factor.
Abstract: The paper describes a probabilistic active learning strategy for support vector machine (SVM) design in large data applications. The learning strategy is motivated by the statistical query model. While most existing methods of active SVM learning query for points based on their proximity to the current separating hyperplane, the proposed method queries for a set of points according to a distribution as determined by the current separating hyperplane and a newly defined concept of an adaptive confidence factor. This enables the algorithm to have more robust and efficient learning capabilities. The confidence factor is estimated from local information using the k nearest neighbor principle. The effectiveness of the method is demonstrated on real-life data sets both in terms of generalization performance, query complexity, and training time.

Journal ArticleDOI
TL;DR: Two novel classifiers based on locally nearest neighborhood rule, called nearest neighbor line and nearest neighbor plane, are presented for pattern classification, which take much lower computation cost and achieve competitive performance.

Journal ArticleDOI
TL;DR: Experiments show that the improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.
Abstract: k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

Proceedings ArticleDOI
14 Mar 2004
TL;DR: Extensive experiments on internet newsgroup datasets using the K-means clustering algorithm with kNN consistency enhancement show that kNN / kMN consistency can be improved significantly while the clustering accuracy is improved simultaneously, indicating the local consistency information helps the global cluster objective function optimization.
Abstract: Nearest neighbor consistency is a central concept in statistical pattern recognition, especially the kNN classification methods and its strong theoretical foundation. In this paper, we extend this concept to data clustering, requiring that for any data point in a cluster, its k-nearest neighbors and mutual nearest neighbors should also be in the same cluster. We study properties of the cluster k-nearest neighbor consistency and propose kNN and kMN consistency enforcing and improving algorithms. Extensive experiments on internet newsgroup datasets using the K-means clustering algorithm with kNN consistency enhancement show that kNN / kMN consistency can be improved significantly (about 100% for 1MN and 1NN consistencies) while the clustering accuracy is improved simultaneously. This indicates the local consistency information helps the global cluster objective function optimization.

Journal ArticleDOI
TL;DR: Experiments show that two different feature selection mechanisms, feature scaling with Diverse Density and feature reduction with principal component analysis, can significantly improve the performance of BP-MIP.
Abstract: Multi-instance learning is regarded as a new learning framework where the training examples are bags composed of instances without labels, and the task is to predict the labels of unseen bags through analyzing the training bags with known labels. Recently, a multi-instance neural network BP-MIP was proposed. In this paper, BP-MIP is improved through adopting two different feature selection techniques, i.e. feature scaling with Diverse Density and feature reduction with principal component analysis. In detail, before feature vectors are fed to a BP-MIP neural network, they are scaled by the feature weights found by running Diverse Density on the training data, or projected by a linear transformation matrix formed by principal component analysis. Experiments show that these feature selection mechanisms can significantly improve the performance of BP-MIP.

Journal ArticleDOI
TL;DR: This paper investigates how to improve the accuracy of recognition based on non-negative matrix factorization from two viewpoints by adopting a Riemannian metric like distance for the learned feature vectors instead of Euclidean distance.

Book ChapterDOI
01 Jan 2004
TL;DR: The chapter reviews the most important methods for obtaining transformed signal characteristics such as principal component analysis, the discrete Fourier transform, and the discrete cosine and sine transform.
Abstract: This chapter gives an overview of the most relevant feature selection and extraction methods for biomedical image processing. Besides the traditional transformed and non-transformed signal characteristics and texture, feature extraction methods encompass structural and graph descriptors. The feature selection methods described in this chapter are the exhaustive search, branch and bound algorithm, max-min feature selection, sequential forward and backward selection, and also Fisher's linear discriminant. Feature extraction and selection in pattern recognition are based on finding mathematical methods for reducing dimensionality of pattern representation. A lower-dimensional representation based on pattern descriptors is a so-called feature. It plays a crucial role in determining the separating properties of pattern classes. The choice of features, attributes, or measurements has an important influence on: the accuracy of classification, the time needed for classification, the number of examples needed for learning, and the cost of performing classification. The chapter reviews the most important methods for obtaining transformed signal characteristics such as principal component analysis, the discrete Fourier transform, and the discrete cosine and sine transform. The basic idea employed in transformed signal characteristics is to find such transform-based features with a low redundancy and a high information density of the original input.

Journal ArticleDOI
TL;DR: This article proposes an alternative method that captures the performance of nearest neighbor queries using approximation that involves closed formulae that are very efficient to compute and accurate for up to 10 dimensions.
Abstract: Existing models for nearest neighbor search in multidimensional spaces are not appropriate for query optimization because they either lead to erroneous estimation or involve complex equations that are expensive to evaluate in real-time. This article proposes an alternative method that captures the performance of nearest neighbor queries using approximation. For uniform data, our model involves closed formulae that are very efficient to compute and accurate for up to 10 dimensions. Further, the proposed equations can be applied on nonuniform data with the aid of histograms. We demonstrate the effectiveness of the model by using it to solve several optimization problems related to nearest neighbor search.

Proceedings ArticleDOI
23 Aug 2004
TL;DR: In empirical evaluation, these PCA -based classification schemes are found to compare favorably with nearest neighbour classification.
Abstract: In this paper, principal component analysis (PCA) is applied to the problem of online handwritten character recognition in the Tamil script. The input is a temporally ordered sequence of (x,y) pen coordinates corresponding to an isolated character obtained from a digitizer. The input is converted into a feature vector of constant dimensions following smoothing and normalization. PCA is used to find the basis vectors of each class subspace and the orthogonal distance to the subspaces used for classification. Pre-clustering of the training data and modification of distance measure are explored to overcome some common problems in the traditional subspace method, in empirical evaluation, these PCA -based classification schemes are found to compare favorably with nearest neighbour classification.

01 Jan 2004
TL;DR: In this paper, the authors propose two techniques to address continuous K-KNN queries in SNDB: intersection examination (IE) and upper bound algorithm (UBA) to find the KNNs of all nodes on a path and then, for those adjacent nodes whose nearest neighbors are different, they find the intermediate split points.
Abstract: Continuous K nearest neighbor queries (C-KNN) are defined as finding the nearest points of interest along an enitre path (e.g., finding the three nearest gas stations to a moving car on any point of a pre-specified path). The result of this type of query is a set of intervals (or split points) and their corresponding KNNs, such that the KNNs of all points within each interval are the same. The current studies on C-KNN focus on vector spaces where the distance between two objects is a function of their spatial attributes (e.g., Euclidean distance metric). These studies are not applicable to spatial network databases (SNDB) where the distance between two objects is a function of the network connectivity (e.g., shortest path between two objects). In this paper, we propose two techniques to address C-KNN queries in SNDB: Intersection Examination (IE) and Upper Bound Algorithm (UBA). With IE, we first find the KNNs of all nodes on a path and then, for those adjacent nodes whose nearest neighbors are different, we find the intermediate split points. Finally, we compute the KNNs of the split points using the KNNs of the surrounding nodes. The intuition behind UBA is that the performance of IE can be improved by determining the adjacent nodes that cannot have any split points in between, and consequently eliminating the computation of KNN queries for those nodes. Our empirical experiments show that the UBA approach outperforms IE, specially when the points of interest are sparsely distributed in the network.

Journal ArticleDOI
TL;DR: This paper examines their applicability to the classification of phonemes in a phonological awareness drilling software package and found, in most cases, that the transformations have a beneficial effect on the classification performance.
Abstract: Kernel-based nonlinear feature extraction and classification algorithms are a popular new research direction in machine learning. This paper examines their applicability to the classification of phonemes in a phonological awareness drilling software package. We first give a concise overview of the nonlinear feature extraction methods such as kernel principal component analysis (KPCA), kernel independent component analysis (KICA), kernel linear discriminant analysis (KLDA), and kernel springy discriminant analysis (KSDA). The overview deals with all the methods in a unified framework, regardless of whether they are unsupervised or supervised. The effect of the transformations on a subsequent classification is tested in combination with learning algorithms such as Gaussian mixture modeling (GMM), artificial neural nets (ANN), projection pursuit learning (PPL), decision tree-based classification (C4.5), and support vector machines (SVMs). We found, in most cases, that the transformations have a beneficial effect on the classification performance. Furthermore, the nonlinear supervised algorithms yielded the best results.

Journal ArticleDOI
TL;DR: In this article, a variational cluster perturbation theory for half filling with repulsive nearest neighbor interaction is proposed, which takes into account short-range correlations correctly by the exact diagonalisation of clusters of finite size, whereas long-range order beyond the size of the clusters is treated on a mean field level.
Abstract: We present a generalization of the recently proposed variational cluster perturbation theory to extended Hubbard models at half filling with repulsive nearest neighbor interaction. The method takes into account short-range correlations correctly by the exact diagonalisation of clusters of finite size, whereas long-range order beyond the size of the clusters is treated on a mean-field level. For one dimension, we show that quantum Monte Carlo and density-matrix renormalization-group results can be reproduced with very good accuracy. Moreover we apply the method to the two-dimensional extended Hubbard model on a square lattice. In contrast to the one-dimensional case, a first order phase transition between spin density wave phase and charge density wave phase is found as function of the nearest-neighbor interaction at onsite interactions U>=3t. The single-particle spectral function is calculated for both the one-dimensional and the two-dimensional system.