scispace - formally typeset
Search or ask a question

Showing papers on "k-nearest neighbors algorithm published in 2006"


Proceedings ArticleDOI
21 Oct 2006
TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O and space O almost matches the lower bound for hashing-based algorithm recently obtained in [27].
Abstract: We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O\left( {dn^{1/c^2 + o(1)} } \right) and space O\left( {dn + n^{1 + 1/c^2 + o(1)} } \right). This almost matches the lower bound for hashing-based algorithm recently obtained in [27]. We also obtain a space-efficient version of the algorithm, which uses dn+n log^{O(1)} n space, with a query time of dn^{O(1/c^2 )}. Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech Lattice.

1,486 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work considers visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories and proposes a hybrid of these two methods which deals naturally with the multiclass setting, has reasonable computational complexity both in training and at run time, and yields excellent results in practice.
Abstract: We consider visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories. This approach is quite flexible, and permits recognition based on color, texture, and particularly shape, in a homogeneous framework. While nearest neighbor classifiers are natural in this setting, they suffer from the problem of high variance (in bias-variance decomposition) in the case of limited sampling. Alternatively, one could use support vector machines but they involve time-consuming optimization and computation of pairwise distances. We propose a hybrid of these two methods which deals naturally with the multiclass setting, has reasonable computational complexity both in training and at run time, and yields excellent results in practice. The basic idea is to find close neighbors to a query sample and train a local support vector machine that preserves the distance function on the collection of neighbors. Our method can be applied to large, multiclass data sets for which it outperforms nearest neighbor and support vector machines, and remains efficient when the problem becomes intractable for support vector machines. A wide variety of distance functions can be used and our experiments show state-of-the-art performance on a number of benchmark data sets for shape and texture classification (MNIST, USPS, CUReT) and object recognition (Caltech- 101). On Caltech-101 we achieved a correct classification rate of 59.05%(±0.56%) at 15 training images per class, and 66.23%(±0.48%) at 30 training images.

1,265 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: A tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points) that shows speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.
Abstract: We present a tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points). The data structure requires O(n) space regardless of the metric's structure yet maintains all performance properties of a navigating net (Krauthgamer & Lee, 2004b). If the point set has a bounded expansion constant c, which is a measure of the intrinsic dimensionality, as defined in (Karger & Ruhl, 2002), the cover tree data structure can be constructed in O (c6n log n) time. Furthermore, nearest neighbor queries require time only logarithmic in n, in particular O (c12 log n) time. Our experimental results show speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.

896 citations


Book
01 Mar 2006
TL;DR: This volume presents theoretical and practical discussions of nearest-neighbor (NN) methods in machine learning and examines computer vision as an application domain in which the benefit of these advanced methods is often dramatic.
Abstract: Regression and classification methods based on similarity of the input to stored examples have not been widely used in applications involving very large sets of high-dimensional data. Recent advances in computational geometry and machine learning, however, may alleviate the problems in using these methods on large data sets. This volume presents theoretical and practical discussions of nearest-neighbor (NN) methods in machine learning and examines computer vision as an application domain in which the benefit of these advanced methods is often dramatic. It brings together contributions from researchers in theory of computation, machine learning, and computer vision with the goals of bridging the gaps between disciplines and presenting state-of-the-art methods for emerging applications.The contributors focus on the importance of designing algorithms for NN search, and for the related classification, regression, and retrieval tasks, that remain efficient even as the number of points or the dimensionality of the data grows very large. The book begins with two theoretical chapters on computational geometry and then explores ways to make the NN approach practicable in machine learning applications where the dimensionality of the data and the size of the data sets make the naive methods for NN search prohibitively expensive. The final chapters describe successful applications of an NN algorithm, locality-sensitive hashing (LSH), to vision tasks.

526 citations


Proceedings Article
01 Jan 2006
TL;DR: A practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space and achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over the previous baseline.
Abstract: This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement.

461 citations


Journal ArticleDOI
TL;DR: A neural network particle finding algorithm and a new four-frame predictive tracking algorithm are proposed for three-dimensional Lagrangian particle tracking (LPT) and the best algorithms are verified to work in a real experimental environment.
Abstract: A neural network particle finding algorithm and a new four-frame predictive tracking algorithm are proposed for three-dimensional Lagrangian particle tracking (LPT). A quantitative comparison of these and other algorithms commonly used in three-dimensional LPT is presented. Weighted averaging, one-dimensional and two-dimensional Gaussian fitting, and the neural network scheme are considered for determining particle centers in digital camera images. When the signal to noise ratio is high, the one-dimensional Gaussian estimation scheme is shown to achieve a good combination of accuracy and efficiency, while the neural network approach provides greater accuracy when the images are noisy. The effect of camera placement on both the yield and accuracy of three-dimensional particle positions is investigated, and it is shown that at least one camera must be positioned at a large angle with respect to the other cameras to minimize errors. Finally, the problem of tracking particles in time is studied. The nearest neighbor algorithm is compared with a three-frame predictive algorithm and two four-frame algorithms. These four algorithms are applied to particle tracks generated by direct numerical simulation both with and without a method to resolve tracking conflicts. The new four-frame predictive algorithm with no conflict resolution is shown to give the best performance. Finally, the best algorithms are verified to work in a real experimental environment.

439 citations


Journal ArticleDOI
01 Feb 2006
TL;DR: This paper presents an online feature selection algorithm using genetic programming (GP) that simultaneously selects a good subset of features and constructs a classifier using the selected features and produces a feature ranking scheme.
Abstract: This paper presents an online feature selection algorithm using genetic programming (GP). The proposed GP methodology simultaneously selects a good subset of features and constructs a classifier using the selected features. For a c-class problem, it provides a classifier having c trees. In this context, we introduce two new crossover operations to suit the feature selection process. As a byproduct, our algorithm produces a feature ranking scheme. We tested our method on several data sets having dimensions varying from 4 to 7129. We compared the performance of our method with results available in the literature and found that the proposed method produces consistently good results. To demonstrate the robustness of the scheme, we studied its effectiveness on data sets with known (synthetically added) redundant/bad features.

313 citations


Journal ArticleDOI
TL;DR: A fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations and a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search.
Abstract: We propose a fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations. The time complexity of the algorithm is improved from O(tauN2) to O(tauN log N) at the cost of a slight increase in distortion; here, tau denotes the lumber of nearest neighbor updates required at each iteration. According to the experiments, a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search

287 citations


Journal ArticleDOI
TL;DR: In this article, the existence of a compact global random attractor within the set of tempered random bounded sets was shown to converge under the forward flow to a random compact invariant set.
Abstract: We consider a one-dimensional lattice with diffusive nearest neighbor interaction, a dissipative nonlinear reaction term and additive independent white noise at each node. We prove the existence of a compact global random attractor within the set of tempered random bounded sets. An interesting feature of this is that, even though the spatial domain is unbounded and the solution operator is not smoothing or compact, pulled back bounded sets of initial data converge under the forward flow to a random compact invariant set.

275 citations


Proceedings Article
04 Dec 2006
TL;DR: This paper proposes a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin, and reduces the risks of overfitting.
Abstract: Metric learning has been shown to significantly improve the accuracy of k-nearest neighbor (kNN) classification In problems involving thousands of features, distance learning algorithms cannot be used due to overfitting and high computational complexity In such cases, previous work has relied on a two-step solution: first apply dimensionality reduction methods to the data, and then learn a metric in the resulting low-dimensional subspace In this paper we show that better classification performance can be achieved by unifying the objectives of dimensionality reduction and metric learning We propose a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin This projection is defined by a significantly smaller number of parameters than metrics learned in input space, and thus our optimization reduces the risks of overfitting Theory and results are presented for both a linear as well as a kernelized version of the algorithm Overall, we achieve classification rates similar, and in several cases superior, to those of support vector machines

257 citations


Proceedings ArticleDOI
01 Sep 2006
TL;DR: This paper studies k-NN monitoring in road networks, where the distance between a query and a data object is determined by the length of the shortest path connecting them, and proposes two methods that can handle arbitrary object and query moving patterns, as well as fluctuations of edge weights.
Abstract: Recent research has focused on continuous monitoring of nearest neighbors (NN) in highly dynamic scenarios, where the queries and the data objects move frequently and arbitrarily. All existing methods, however, assume the Euclidean distance metric. In this paper we study k-NN monitoring in road networks, where the distance between a query and a data object is determined by the length of the shortest path connecting them. We propose two methods that can handle arbitrary object and query moving patterns, as well as fluctuations of edge weights. The first one maintains the query results by processing only updates that may invalidate the current NN sets. The second method follows the shared execution paradigm to reduce the processing time. In particular, it groups together the queries that fall in the path between two consecutive intersections in the network, and produces their results by monitoring the NN sets of these intersections. We experimentally verify the applicability of the proposed techniques to continuous monitoring of large data and query sets.

Journal ArticleDOI
01 Sep 2006
TL;DR: This paper proposes algorithms for k nearest and reverse k nearest neighbor queries on the current and anticipated future positions of points moving continuously in the plane based on the indexing of object positions represented as linear functions of time.
Abstract: With the continued proliferation of wireless communications and advances in positioning technologies, algorithms for efficiently answering queries about large populations of moving objects are gaining interest. This paper proposes algorithms for k nearest and reverse k nearest neighbor queries on the current and anticipated future positions of points moving continuously in the plane. The former type of query returns k objects nearest to a query object for each time point during a time interval, while the latter returns the objects that have a specified query object as one of their k closest neighbors, again for each time point during a time interval. In addition, algorithms for so-called persistent and continuous variants of these queries are provided. The algorithms are based on the indexing of object positions represented as linear functions of time. The results of empirical performance experiments are reported.

Journal ArticleDOI
TL;DR: In order to optimize the accuracy of the nearest-neighbor classification rule, a weighted distance is proposed, along with algorithms to automatically learn the corresponding weights, which are specific for each class and feature.
Abstract: In order to optimize the accuracy of the nearest-neighbor classification rule, a weighted distance is proposed, along with algorithms to automatically learn the corresponding weights. These weights may be specific for each class and feature, for each individual prototype, or for both. The learning algorithms are derived by (approximately) minimizing the leaving-one-out classification error of the given training set. The proposed approach is assessed through a series of experiments with UCI/STATLOG corpora, as well as with a more specific task of text classification which entails very sparse data representation and huge dimensionality. In all these experiments, the proposed approach shows a uniformly good behavior, with results comparable to or better than state-of-the-art results published with the same data so far

Journal ArticleDOI
TL;DR: It is shown that the kNN method inherently introduces a systematic error in melting point prediction, and much of the remaining error can be attributed to the lack of information about interactions in the liquid state, which are not well-captured by molecular descriptors.
Abstract: We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. A data set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) were used to compare performance in different regions of chemical space, and we investigated the influence of the number of nearest neighbors using different types of molecular descriptors. To compute the prediction on the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmetic and geometric average, inverse distance weighting, and exponential weighting), of which the exponential weighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation (with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictions for drugs based on drugs (separate training and test sets each taken from data set 2) were found to be considerably better [root-mean-squared error (RMSE) = 46.3 ...

Journal ArticleDOI
TL;DR: A novel method is proposed to compute the cluster radius threshold and a powerful clustering-based method is presented for the unsupervised intrusion detection (CBUID).

Proceedings ArticleDOI
27 Jun 2006
TL;DR: This paper proposes the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time and uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits.
Abstract: The reverse k-nearest neighbor (RkNN) problem, i.e. finding all objects in a data set the k-nearest neighbors of which include a specified query object, is a generalization of the reverse 1-nearest neighbor problem which has received increasing attention recently. Many industrial and scientific applications call for solutions of the RkNN problem in arbitrary metric spaces where the data objects are not Euclidean and only a metric distance function is given for specifying object similarity. Usually, these applications need a solution for the generalized problem where the value of k is not known in advance and may change from query to query. However, existing approaches, except one, are designed for the specific R1NN problem. In addition - to the best of our knowledge - all previously proposed methods, especially the one for generalized RkNN search, are only applicable to Euclidean vector data but not for general metric objects. In this paper, we propose the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time. Our approach uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits. In particular, we approximate the k-nearest neighbor distance for each data object by upper and lower bounds using two functions of only two parameters each. Thus, our method does not generate any considerable storage overhead. We show in a broad experimental evaluation on real-world data the scalability and the usability of our novel approach.

Proceedings ArticleDOI
21 May 2006
TL;DR: A new data structure is presented that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension and finds a (1+ε)-approximate nearest neighbor in time O(log n) + (1/ε)O(1).
Abstract: We present a new data structure that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension. Our data structure has linear size and supports insertions and deletions in O(log n) time, and finds a (1+e)-approximate nearest neighbor in time O(log n) + (1/e)O(1). The search and update times hide multiplicative factors that depend on the doubling dimension; the space does not. These performance times are independent of the aspect ratio (or spread) of the points.

Journal ArticleDOI
TL;DR: The k-NN technique is a competitive alternative to other techniques to develop pedotransfer functions (PTFs), especially since redevelopment of PTFs is not necessarily needed as new data become available.
Abstract: Nonparametric approaches are being used in various fields to address classification type problems, as well as to estimate continuous variables. One type of the nonparametric lazy learning algorithms, a k-nearest neighbor (k-NN) algorithm has been applied to estimate water retention at 233- and 21500-kPa matric potentials. Performance of the algorithm has subsequently been tested against estimations made by a neural network (NNet) model, developed using the same data and input soil attributes. We used a hierarchical set of inputs using soil texture, bulk density (Db), and organic matter (OM) content to avoid possible bias toward one set of inputs, and varied the size of the data set used to develop the NNet models and to run the k-NN estimation algorithms. Different ‘design-parameter’ settings, analogous to model parameters have been optimized. The kNN technique showed little sensitivity to potential suboptimal settings in terms of how many nearest soils were selected and how those were weighed while formulating the output of the algorithm, as long as extremes were avoided. The optimal settings were, however, dependent on the size of the development/reference data set. The nonparametric k-NN technique performed mostly equally well with the NNet models, in terms of root-mean-squared residuals (RMSRs) and mean residuals (MRs). Gradual reduction of the data set size from 1600 to 100 resulted in only a slight loss of accuracy for both the k-NN and NNet approaches. The k-NN technique is a competitive alternative to other techniques to develop pedotransfer functions (PTFs), especially since redevelopment of PTFs is not necessarily needed as new data become available.

Journal ArticleDOI
TL;DR: A composite power law behavior for both the average nearest neighbor's degree and average clustering coefficient as a function of the vertex degree is found, which implies the existence of different functional classes of vertices.
Abstract: We investigate the nature of written human language within the framework of complex network theory. In particular, we analyze the topology of Orwell's 1984 focusing on the local properties of the network, such as the properties of the nearest neighbors and the clustering coefficient. We find a composite power law behavior for both the average nearest neighbor's degree and average clustering coefficient as a function of the vertex degree. This implies the existence of different functional classes of vertices. Furthermore, we find that the second order vertex correlations are an essential component of the network architecture. To model our empirical results we extend a previously introduced model for language due to Dorogovtsev and Mendes. We propose an accelerated growing network model that contains three growth mechanisms: linear preferential attachment, local preferential attachment, and the random growth of a predetermined small finite subset of initial vertices. We find that with these elementary stochastic rules we are able to produce a network showing syntacticlike structures.

Proceedings Article
04 Dec 2006
TL;DR: A novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors is introduced.
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: This work shows how to convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy.
Abstract: For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A new data reduction algorithm that iteratively selects some samples and ignores others that can be absorbed, or represented, by those selected, and can achieve consistency, or asymptotic Bayes-risk efficiency, under certain conditions.
Abstract: In this paper, we propose a new data reduction algorithm that iteratively selects some samples and ignores others that can be absorbed, or represented, by those selected. This algorithm differs from the condensed nearest neighbor (CNN) rule in its employment of a strong absorption criterion, in contrast to the weak criterion employed by CNN; hence, it is called the generalized CNN (GCNN) algorithm. The new criterion allows GCNN to incorporate CNN as a special case, and can achieve consistency, or asymptotic Bayes-risk efficiency, under certain conditions. GCNN, moreover, can yield significantly better accuracy than other instance- based data reduction methods. We demonstrate the last claim through experiments on five datasets, some of which contain a very large number of samples.

01 Jan 2006
TL;DR: This work proposes ClustKnn, a simple and intuitive algorithm that is well suited for large data sets and provides very good recommendation accuracy as well, and compares with a number of other popular CF algorithms that, apart from being highly scalable and intuitive, provide very good recommendations accuracy.
Abstract: Collaborative Filtering (CF)-based recommender systems are indispensable tools to find items of interest from the unmanageable number of available items. Moreover, companies who deploy a CF-based recommender system may be able to increase revenue by drawing customers’ attention to items that they are likely to buy. However, the sheer number of customers and items typical in e-commerce systems demand specially designed CF algorithms that can gracefully cope with the vast size of the data. Many algorithms proposed thus far, where the principal concern is recommendation quality, may be too expensive to operate in a large-scale system. We propose ClustKnn, a simple and intuitive algorithm that is well suited for large data sets. The method first compresses data tremendously by building a straightforward but ecient clustering model. Recommendations are then generated quickly by using a simple Nearest Neighbor-based approach. We demonstrate the feasibility of ClustKnn both analytically and empirically. We also show, by comparing with a number of other popular CF algorithms that, apart from being highly scalable and intuitive, ClustKnn provides very good recommendation accuracy as well.

Journal ArticleDOI
TL;DR: A strong correlation is found between the predicted melting temperatures of RNA sequences and the optimal growth temperatures of the host organism, indicating that organisms that live at higher temperatures have evolved RNA sequences with higher melting temperatures.
Abstract: A complete set of nearest neighbor parameters to predict the enthalpy change of RNA secondary structure formation was derived. These parameters can be used with available free energy nearest neighbor parameters to extend the secondary structure prediction of RNA sequences to temperatures other than 37°C. The parameters were tested by predicting the secondary structures of sequences with known secondary structure that are from organisms with known optimal growth temperatures. Compared with the previous set of enthalpy nearest neighbor parameters, the sensitivity of base pair prediction improved from 65.2 to 68.9% at optimal growth temperatures ranging from 10 to 60°C. Base pair probabilities were predicted with a partition function and the positive predictive value of structure prediction is 90.4% when considering the base pairs in the lowest free energy structure with pairing probability of 0.99 or above. Moreover, a strong correlation is found between the predicted melting temperatures of RNA sequences and the optimal growth temperatures of the host organism. This indicates that organisms that live at higher temperatures have evolved RNA sequences with higher melting temperatures.

Journal ArticleDOI
TL;DR: This paper considers the ranges as (hyper)rectangles and proposes efficient in-memory processing and secondary memory pruning techniques for RNN queries in both 2D and high-dimensional spaces and devise an auxiliary solution-based index EXO-tree to speed up any type of NN query.
Abstract: A range nearest-neighbor (RNN) query retrieves the nearest neighbor (NN) for every point in a range. It is a natural generalization of point and continuous nearest-neighbor queries and has many applications. In this paper, we consider the ranges as (hyper)rectangles and propose efficient in-memory processing and secondary memory pruning techniques for RNN queries in both 2D and high-dimensional spaces. These techniques are generalized for kRNN queries, which return the k nearest neighbors for every point in the range. In addition, we devise an auxiliary solution-based index EXO-tree to speed up any type of NN query. EXO-tree is orthogonal to any existing NN processing algorithm and, thus, can be transparently integrated. An extensive empirical study was conducted to evaluate the CPU and I/O performance of these techniques, and the study showed that they are efficient and robust under various data sets, query ranges, numbers of nearest neighbors, dimensions, and cache sizes.

Journal ArticleDOI
01 Aug 2006
TL;DR: BD-PCA supplemented with an assembled matrix distance (AMD) metric is proposed, which can be used for image feature extraction by reducing the dimensionality in both column and row directions.
Abstract: Principal component analysis (PCA) has been very successful in image recognition. Recent research on PCA-based methods has mainly concentrated on two issues, namely: 1) feature extraction and 2) classification. This paper proposes to deal with these two issues simultaneously by using bidirectional PCA (BD-PCA) supplemented with an assembled matrix distance (AMD) metric. For feature extraction, BD-PCA is proposed, which can be used for image feature extraction by reducing the dimensionality in both column and row directions. For classification, an AMD metric is presented to calculate the distance between two feature matrices and then the nearest neighbor and nearest feature line classifiers are used for image recognition. The results of the experiments show the efficiency of BD-PCA with AMD metric in image recognition

Proceedings ArticleDOI
28 Mar 2006
TL;DR: Compression-based methods are not a "parameter free" magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms.
Abstract: The use of compression algorithms in machine learning tasks such as clustering and classification has appeared in a variety of fields, sometimes with the promise of reducing problems of explicit feature selection. The theoretical justification for such methods has been founded on an upper bound on Kolmogorov complexity and an idealized information space. An alternate view shows compression algorithms implicitly map strings into implicit feature space vectors, and compression-based similarity measures compute similarity within these feature spaces. Thus, compression-based methods are not a "parameter free" magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms. To underscore this point, we find theoretical and empirical connections between traditional machine learning vector models and compression, encouraging cross-fertilization in future work.

Journal ArticleDOI
TL;DR: A new method for neighborhood size selection that is based on the concept of statistical confidence that locally adjusts the number of nearest neighbors until a satisfactory level of confidence is reached.

Journal ArticleDOI
TL;DR: This paper presents the first algorithms for efficient RNN search in generic metric spaces that require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality.
Abstract: Given a set D of objects, a reverse nearest neighbor (RNN) query returns the objects o in D such that o is closer to a query object q than to any other object in D, according to a certain similarity metric. The existing RNN solutions are not sufficient because they either 1) rely on precomputed information that is expensive to maintain in the presence of updates or 2) are applicable only when the data consists of "Euclidean objects" and similarity is measured using the L2 norm. In this paper, we present the first algorithms for efficient RNN search in generic metric spaces. Our techniques require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality. We confirm the effectiveness of the proposed methods with extensive experiments

Proceedings ArticleDOI
09 Jul 2006
TL;DR: A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed, and both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity.
Abstract: A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed. Given i.i.d. samples, both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity. In experiments on high-dimensional data, the nearest neighbor approach generally exhibits faster convergence compared to previous algorithms based on partitioning.