Showing papers on "k-nearest neighbors algorithm published in 2006"

PDF

Open Access

Proceedings Article•DOI•

Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

[...]

Alexandr Andoni¹, Piotr Indyk¹•Institutions (1)

21 Oct 2006

TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O and space O almost matches the lower bound for hashing-based algorithm recently obtained in [27].

...read moreread less

Abstract: We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O\left( {dn^{1/c^2 + o(1)} } \right) and space O\left( {dn + n^{1 + 1/c^2 + o(1)} } \right). This almost matches the lower bound for hashing-based algorithm recently obtained in [27]. We also obtain a space-efficient version of the algorithm, which uses dn+n log^{O(1)} n space, with a query time of dn^{O(1/c^2 )}. Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech Lattice.

...read moreread less

1,486 citations

Proceedings Article•DOI•

SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition

[...]

Hao Zhang¹, Alexander C. Berg¹, Michael Maire¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

17 Jun 2006

TL;DR: This work considers visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories and proposes a hybrid of these two methods which deals naturally with the multiclass setting, has reasonable computational complexity both in training and at run time, and yields excellent results in practice.

...read moreread less

Abstract: We consider visual category recognition in the framework of measuring similarities, or equivalently perceptual distances, to prototype examples of categories. This approach is quite flexible, and permits recognition based on color, texture, and particularly shape, in a homogeneous framework. While nearest neighbor classifiers are natural in this setting, they suffer from the problem of high variance (in bias-variance decomposition) in the case of limited sampling. Alternatively, one could use support vector machines but they involve time-consuming optimization and computation of pairwise distances. We propose a hybrid of these two methods which deals naturally with the multiclass setting, has reasonable computational complexity both in training and at run time, and yields excellent results in practice. The basic idea is to find close neighbors to a query sample and train a local support vector machine that preserves the distance function on the collection of neighbors. Our method can be applied to large, multiclass data sets for which it outperforms nearest neighbor and support vector machines, and remains efficient when the problem becomes intractable for support vector machines. A wide variety of distance functions can be used and our experiments show state-of-the-art performance on a number of benchmark data sets for shape and texture classification (MNIST, USPS, CUReT) and object recognition (Caltech- 101). On Caltech-101 we achieved a correct classification rate of 59.05%(±0.56%) at 15 training images per class, and 66.23%(±0.48%) at 30 training images.

...read moreread less

1,265 citations

Proceedings Article•DOI•

Cover trees for nearest neighbor

[...]

Alina Beygelzimer¹, Sham M. Kakade, John Langford•Institutions (1)

IBM¹

25 Jun 2006

TL;DR: A tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points) that shows speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.

...read moreread less

Abstract: We present a tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points). The data structure requires O(n) space regardless of the metric's structure yet maintains all performance properties of a navigating net (Krauthgamer & Lee, 2004b). If the point set has a bounded expansion constant c, which is a measure of the intrinsic dimensionality, as defined in (Karger & Ruhl, 2002), the cover tree data structure can be constructed in O (c6n log n) time. Furthermore, nearest neighbor queries require time only logarithmic in n, in particular O (c12 log n) time. Our experimental results show speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.

...read moreread less

896 citations

Book•

Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)

[...]

Gregory Shakhnarovich, Trevor Darrell, Piotr Indyk

01 Mar 2006

TL;DR: This volume presents theoretical and practical discussions of nearest-neighbor (NN) methods in machine learning and examines computer vision as an application domain in which the benefit of these advanced methods is often dramatic.

...read moreread less

Abstract: Regression and classification methods based on similarity of the input to stored examples have not been widely used in applications involving very large sets of high-dimensional data. Recent advances in computational geometry and machine learning, however, may alleviate the problems in using these methods on large data sets. This volume presents theoretical and practical discussions of nearest-neighbor (NN) methods in machine learning and examines computer vision as an application domain in which the benefit of these advanced methods is often dramatic. It brings together contributions from researchers in theory of computation, machine learning, and computer vision with the goals of bridging the gaps between disciplines and presenting state-of-the-art methods for emerging applications.The contributors focus on the importance of designing algorithms for NN search, and for the related classification, regression, and retrieval tasks, that remain efficient even as the number of points or the dimensionality of the data grows very large. The book begins with two theoretical chapters on computational geometry and then explores ways to make the NN approach practicable in machine learning applications where the dimensionality of the data and the size of the data sets make the naive methods for NN search prohibitively expensive. The final chapters describe successful applications of an NN algorithm, locality-sensitive hashing (LSH), to vision tasks.

...read moreread less

526 citations

Proceedings Article•

Within-class covariance normalization for SVM-based speaker recognition.

[...]

Andrew O. Hatch¹, Sachin S. Kajarekar², Andreas Stolcke²•Institutions (2)

University of California, Berkeley¹, SRI International²

01 Jan 2006

TL;DR: A practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space and achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over the previous baseline.

...read moreread less

Abstract: This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement.

...read moreread less

461 citations

Journal Article•DOI•

A quantitative study of three-dimensional Lagrangian particle tracking algorithms

[...]

Nicholas T. Ouellette¹, Haitao Xu¹, Eberhard Bodenschatz¹•Institutions (1)

Max Planck Society¹

20 Jan 2006-Experiments in Fluids

TL;DR: A neural network particle finding algorithm and a new four-frame predictive tracking algorithm are proposed for three-dimensional Lagrangian particle tracking (LPT) and the best algorithms are verified to work in a real experimental environment.

...read moreread less

Abstract: A neural network particle finding algorithm and a new four-frame predictive tracking algorithm are proposed for three-dimensional Lagrangian particle tracking (LPT). A quantitative comparison of these and other algorithms commonly used in three-dimensional LPT is presented. Weighted averaging, one-dimensional and two-dimensional Gaussian fitting, and the neural network scheme are considered for determining particle centers in digital camera images. When the signal to noise ratio is high, the one-dimensional Gaussian estimation scheme is shown to achieve a good combination of accuracy and efficiency, while the neural network approach provides greater accuracy when the images are noisy. The effect of camera placement on both the yield and accuracy of three-dimensional particle positions is investigated, and it is shown that at least one camera must be positioned at a large angle with respect to the other cameras to minimize errors. Finally, the problem of tracking particles in time is studied. The nearest neighbor algorithm is compared with a three-frame predictive algorithm and two four-frame algorithms. These four algorithms are applied to particle tracks generated by direct numerical simulation both with and without a method to resolve tracking conflicts. The new four-frame predictive algorithm with no conflict resolution is shown to give the best performance. Finally, the best algorithms are verified to work in a real experimental environment.

...read moreread less

439 citations

Journal Article•DOI•

Genetic programming for simultaneous feature selection and classifier design

[...]

Durga Prasad Muni, Nikhil R. Pal, J. Das

01 Feb 2006

TL;DR: This paper presents an online feature selection algorithm using genetic programming (GP) that simultaneously selects a good subset of features and constructs a classifier using the selected features and produces a feature ranking scheme.

...read moreread less

Abstract: This paper presents an online feature selection algorithm using genetic programming (GP). The proposed GP methodology simultaneously selects a good subset of features and constructs a classifier using the selected features. For a c-class problem, it provides a classifier having c trees. In this context, we introduce two new crossover operations to suit the feature selection process. As a byproduct, our algorithm produces a feature ranking scheme. We tested our method on several data sets having dimensions varying from 4 to 7129. We compared the performance of our method with results available in the literature and found that the proposed method produces consistently good results. To demonstrate the robustness of the scheme, we studied its effectiveness on data sets with known (synthetically added) redundant/bad features.

...read moreread less

313 citations

Journal Article•DOI•

Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph

[...]

Pasi Fränti, O. Virmajoki, Ville Hautamäki

01 Nov 2006-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations and a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search.

...read moreread less

Abstract: We propose a fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations. The time complexity of the algorithm is improved from O(tauN2) to O(tauN log N) at the cost of a slight increase in distortion; here, tau denotes the lumber of nearest neighbor updates required at each iteration. According to the experiments, a relatively small neighborhood size is sufficient to maintain the quality close to that of the full search

...read moreread less

287 citations

Journal Article•DOI•

Attractors for stochastic lattice dynamical systems

[...]

Peter W. Bates¹, Hannelore Lisei², Kening Lu³•Institutions (3)

Michigan State University¹, Babeș-Bolyai University², Brigham Young University³

01 Mar 2006-Stochastics and Dynamics

TL;DR: In this article, the existence of a compact global random attractor within the set of tempered random bounded sets was shown to converge under the forward flow to a random compact invariant set.

...read moreread less

Abstract: We consider a one-dimensional lattice with diffusive nearest neighbor interaction, a dissipative nonlinear reaction term and additive independent white noise at each node. We prove the existence of a compact global random attractor within the set of tempered random bounded sets. An interesting feature of this is that, even though the spatial domain is unbounded and the solution operator is not smoothing or compact, pulled back bounded sets of initial data converge under the forward flow to a random compact invariant set.

...read moreread less

275 citations

Proceedings Article•

Large Margin Component Analysis

[...]

Lorenzo Torresani, Kuang-Chih Lee

04 Dec 2006

TL;DR: This paper proposes a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin, and reduces the risks of overfitting.

...read moreread less

Abstract: Metric learning has been shown to significantly improve the accuracy of k-nearest neighbor (kNN) classification In problems involving thousands of features, distance learning algorithms cannot be used due to overfitting and high computational complexity In such cases, previous work has relied on a two-step solution: first apply dimensionality reduction methods to the data, and then learn a metric in the resulting low-dimensional subspace In this paper we show that better classification performance can be achieved by unifying the objectives of dimensionality reduction and metric learning We propose a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin This projection is defined by a significantly smaller number of parameters than metrics learned in input space, and thus our optimization reduces the risks of overfitting Theory and results are presented for both a linear as well as a kernelized version of the algorithm Overall, we achieve classification rates similar, and in several cases superior, to those of support vector machines

...read moreread less

257 citations

Proceedings Article•DOI•

Continuous nearest neighbor monitoring in road networks

[...]

Kyriakos Mouratidis¹, Man Lung Yiu², Dimitris Papadias³, Nikos Mamoulis²•Institutions (3)

Singapore Management University¹, University of Hong Kong², Hong Kong University of Science and Technology³

01 Sep 2006

TL;DR: This paper studies k-NN monitoring in road networks, where the distance between a query and a data object is determined by the length of the shortest path connecting them, and proposes two methods that can handle arbitrary object and query moving patterns, as well as fluctuations of edge weights.

...read moreread less

Abstract: Recent research has focused on continuous monitoring of nearest neighbors (NN) in highly dynamic scenarios, where the queries and the data objects move frequently and arbitrarily. All existing methods, however, assume the Euclidean distance metric. In this paper we study k-NN monitoring in road networks, where the distance between a query and a data object is determined by the length of the shortest path connecting them. We propose two methods that can handle arbitrary object and query moving patterns, as well as fluctuations of edge weights. The first one maintains the query results by processing only updates that may invalidate the current NN sets. The second method follows the shared execution paradigm to reduce the processing time. In particular, it groups together the queries that fall in the path between two consecutive intersections in the network, and produces their results by monitoring the NN sets of these intersections. We experimentally verify the applicability of the proposed techniques to continuous monitoring of large data and query sets.

...read moreread less

Journal Article•DOI•

Nearest and reverse nearest neighbor queries for moving objects

[...]

Rimantas Benetis¹, Søren Holdt Jensen¹, Gytis Karĉiauskas¹, Simonas Ŝaltenis¹•Institutions (1)

Aalborg University¹

01 Sep 2006

TL;DR: This paper proposes algorithms for k nearest and reverse k nearest neighbor queries on the current and anticipated future positions of points moving continuously in the plane based on the indexing of object positions represented as linear functions of time.

...read moreread less

Abstract: With the continued proliferation of wireless communications and advances in positioning technologies, algorithms for efficiently answering queries about large populations of moving objects are gaining interest. This paper proposes algorithms for k nearest and reverse k nearest neighbor queries on the current and anticipated future positions of points moving continuously in the plane. The former type of query returns k objects nearest to a query object for each time point during a time interval, while the latter returns the objects that have a specified query object as one of their k closest neighbors, again for each time point during a time interval. In addition, algorithms for so-called persistent and continuous variants of these queries are provided. The algorithms are based on the indexing of object positions represented as linear functions of time. The results of empirical performance experiments are reported.

...read moreread less

Journal Article•DOI•

Learning weighted metrics to minimize nearest-neighbor classification error

[...]

Roberto Paredes, Enrique Vidal

01 Jul 2006-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In order to optimize the accuracy of the nearest-neighbor classification rule, a weighted distance is proposed, along with algorithms to automatically learn the corresponding weights, which are specific for each class and feature.

...read moreread less

Abstract: In order to optimize the accuracy of the nearest-neighbor classification rule, a weighted distance is proposed, along with algorithms to automatically learn the corresponding weights. These weights may be specific for each class and feature, for each individual prototype, or for both. The learning algorithms are derived by (approximately) minimizing the leaving-one-out classification error of the given training set. The proposed approach is assessed through a series of experiments with UCI/STATLOG corpora, as well as with a more specific task of text classification which entails very sparse data representation and huge dimensionality. In all these experiments, the proposed approach shows a uniformly good behavior, with results comparable to or better than state-of-the-art results published with the same data so far

...read moreread less

Journal Article•DOI•

Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization.

[...]

Florian Nigsch¹, Andreas Bender¹, Bernd van Buuren¹, Jos Tissen¹, Eduard A. Nigsch¹, John B. O. Mitchell¹ - Show less +2 more•Institutions (1)

Vienna University of Technology¹

16 Sep 2006-Journal of Chemical Information and Modeling

TL;DR: It is shown that the kNN method inherently introduces a systematic error in melting point prediction, and much of the remaining error can be attributed to the lack of information about interactions in the liquid state, which are not well-captured by molecular descriptors.

...read moreread less

Abstract: We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. A data set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) were used to compare performance in different regions of chemical space, and we investigated the influence of the number of nearest neighbors using different types of molecular descriptors. To compute the prediction on the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmetic and geometric average, inverse distance weighting, and exponential weighting), of which the exponential weighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation (with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictions for drugs based on drugs (separate training and test sets each taken from data set 2) were found to be considerably better [root-mean-squared error (RMSE) = 46.3 ...

...read moreread less

Journal Article•DOI•

A clustering-based method for unsupervised intrusion detections

[...]

ShengYi Jiang¹, Xiaoyu Song², Hui Wang, Jian-Jun Han³, Qing-Hua Li³ - Show less +1 more•Institutions (3)

Guangdong University of Foreign Studies¹, Portland State University², Huazhong University of Science and Technology³

01 May 2006-Pattern Recognition Letters

TL;DR: A novel method is proposed to compute the cluster radius threshold and a powerful clustering-based method is presented for the unsupervised intrusion detection (CBUID).

...read moreread less

Proceedings Article•DOI•

Efficient reverse k-nearest neighbor search in arbitrary metric spaces

[...]

Elke Achtert¹, Christian Bohm¹, Peer Kröger¹, Peter Kunath¹, Alexey Pryakhin¹, Matthias Renz¹ - Show less +2 more•Institutions (1)

Ludwig Maximilian University of Munich¹

27 Jun 2006

TL;DR: This paper proposes the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time and uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits.

...read moreread less

Abstract: The reverse k-nearest neighbor (RkNN) problem, i.e. finding all objects in a data set the k-nearest neighbors of which include a specified query object, is a generalization of the reverse 1-nearest neighbor problem which has received increasing attention recently. Many industrial and scientific applications call for solutions of the RkNN problem in arbitrary metric spaces where the data objects are not Euclidean and only a metric distance function is given for specifying object similarity. Usually, these applications need a solution for the generalized problem where the value of k is not known in advance and may change from query to query. However, existing approaches, except one, are designed for the specific R1NN problem. In addition - to the best of our knowledge - all previously proposed methods, especially the one for generalized RkNN search, are only applicable to Euclidean vector data but not for general metric objects. In this paper, we propose the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time. Our approach uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits. In particular, we approximate the k-nearest neighbor distance for each data object by upper and lower bounds using two functions of only two parameters each. Thus, our method does not generate any considerable storage overhead. We show in a broad experimental evaluation on real-world data the scalability and the usability of our novel approach.

...read moreread less

Proceedings Article•DOI•

Searching dynamic point sets in spaces with bounded doubling dimension

[...]

Richard Cole¹, Lee-Ad Gottlieb¹•Institutions (1)

New York University¹

21 May 2006

TL;DR: A new data structure is presented that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension and finds a (1+ε)-approximate nearest neighbor in time O(log n) + (1/ε)O(1).

...read moreread less

Abstract: We present a new data structure that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension. Our data structure has linear size and supports insertions and deletions in O(log n) time, and finds a (1+e)-approximate nearest neighbor in time O(log n) + (1/e)O(1). The search and update times hide multiplicative factors that depend on the doubling dimension; the space does not. These performance times are independent of the aspect ratio (or spread) of the points.

...read moreread less

Journal Article•DOI•

Use of the Nonparametric Nearest Neighbor Approach to Estimate Soil Hydraulic Properties

[...]

Attila Nemes¹, Walter J. Rawls², Yakov Pachepsky²•Institutions (2)

United States Department of Agriculture¹, Agricultural Research Service²

01 Mar 2006-Soil Science Society of America Journal

TL;DR: The k-NN technique is a competitive alternative to other techniques to develop pedotransfer functions (PTFs), especially since redevelopment of PTFs is not necessarily needed as new data become available.

...read moreread less

Abstract: Nonparametric approaches are being used in various fields to address classification type problems, as well as to estimate continuous variables. One type of the nonparametric lazy learning algorithms, a k-nearest neighbor (k-NN) algorithm has been applied to estimate water retention at 233- and 21500-kPa matric potentials. Performance of the algorithm has subsequently been tested against estimations made by a neural network (NNet) model, developed using the same data and input soil attributes. We used a hierarchical set of inputs using soil texture, bulk density (Db), and organic matter (OM) content to avoid possible bias toward one set of inputs, and varied the size of the data set used to develop the NNet models and to run the k-NN estimation algorithms. Different ‘design-parameter’ settings, analogous to model parameters have been optimized. The kNN technique showed little sensitivity to potential suboptimal settings in terms of how many nearest soils were selected and how those were weighed while formulating the output of the algorithm, as long as extremes were avoided. The optimal settings were, however, dependent on the size of the development/reference data set. The nonparametric k-NN technique performed mostly equally well with the NNet models, in terms of root-mean-squared residuals (RMSRs) and mean residuals (MRs). Gradual reduction of the data set size from 1600 to 100 resulted in only a slight loss of accuracy for both the k-NN and NNet approaches. The k-NN technique is a competitive alternative to other techniques to develop pedotransfer functions (PTFs), especially since redevelopment of PTFs is not necessarily needed as new data become available.

...read moreread less

Journal Article•DOI•

Network properties of written human language.

[...]

A. P. Masucci¹, G. J. Rodgers¹•Institutions (1)

Brunel University London¹

02 Aug 2006-Physical Review E

TL;DR: A composite power law behavior for both the average nearest neighbor's degree and average clustering coefficient as a function of the vertex degree is found, which implies the existence of different functional classes of vertices.

...read moreread less

Abstract: We investigate the nature of written human language within the framework of complex network theory. In particular, we analyze the topology of Orwell's 1984 focusing on the local properties of the network, such as the properties of the nearest neighbors and the clustering coefficient. We find a composite power law behavior for both the average nearest neighbor's degree and average clustering coefficient as a function of the vertex degree. This implies the existence of different functional classes of vertices. Furthermore, we find that the second order vertex correlations are an essential component of the network architecture. To model our empirical results we extend a previously introduced model for language due to Dorogovtsev and Mendes. We propose an accelerated growing network model that contains three growth mechanisms: linear preferential attachment, local preferential attachment, and the random growth of a predetermined small finite subset of initial vertices. We find that with these elementary stochastic rules we are able to produce a network showing syntacticlike structures.

...read moreread less

Proceedings Article•

Approximate Correspondences in High Dimensions

[...]

Kristen Grauman¹, Trevor Darrell²•Institutions (2)

University of Texas at Austin¹, Massachusetts Institute of Technology²

04 Dec 2006

TL;DR: A novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors is introduced.

...read moreread less

Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel.

...read moreread less

Proceedings Article•DOI•

Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining

[...]

K. Ueno¹, Xiaopeng Xi², Eamonn Keogh², Dah-Jye Lee³•Institutions (3)

Toshiba¹, University of California, Riverside², Brigham Young University³

18 Dec 2006

TL;DR: This work shows how to convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy.

...read moreread less

Abstract: For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.

...read moreread less

Proceedings Article•DOI•

The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method

[...]

Chien-Hsing Chou, Bo-Han Kuo, Fu Chang

20 Aug 2006

TL;DR: A new data reduction algorithm that iteratively selects some samples and ignores others that can be absorbed, or represented, by those selected, and can achieve consistency, or asymptotic Bayes-risk efficiency, under certain conditions.

...read moreread less

Abstract: In this paper, we propose a new data reduction algorithm that iteratively selects some samples and ignores others that can be absorbed, or represented, by those selected. This algorithm differs from the condensed nearest neighbor (CNN) rule in its employment of a strong absorption criterion, in contrast to the weak criterion employed by CNN; hence, it is called the generalized CNN (GCNN) algorithm. The new criterion allows GCNN to incorporate CNN as a special case, and can achieve consistency, or asymptotic Bayes-risk efficiency, under certain conditions. GCNN, moreover, can yield significantly better accuracy than other instance- based data reduction methods. We demonstrate the last claim through experiments on five datasets, some of which contain a very large number of samples.

...read moreread less

ClustKNN: A Highly Scalable Hybrid Model- & Memory-Based CF Algorithm

[...]

Al Mamunur Rashid, Shyong K. Lam, George Karypis, John Riedl¹•Institutions (1)

University of Minnesota¹

01 Jan 2006

TL;DR: This work proposes ClustKnn, a simple and intuitive algorithm that is well suited for large data sets and provides very good recommendation accuracy as well, and compares with a number of other popular CF algorithms that, apart from being highly scalable and intuitive, provide very good recommendations accuracy.

...read moreread less

Abstract: Collaborative Filtering (CF)-based recommender systems are indispensable tools to find items of interest from the unmanageable number of available items. Moreover, companies who deploy a CF-based recommender system may be able to increase revenue by drawing customers’ attention to items that they are likely to buy. However, the sheer number of customers and items typical in e-commerce systems demand specially designed CF algorithms that can gracefully cope with the vast size of the data. Many algorithms proposed thus far, where the principal concern is recommendation quality, may be too expensive to operate in a large-scale system. We propose ClustKnn, a simple and intuitive algorithm that is well suited for large data sets. The method first compresses data tremendously by building a straightforward but ecient clustering model. Recommendations are then generated quickly by using a simple Nearest Neighbor-based approach. We demonstrate the feasibility of ClustKnn both analytically and empirically. We also show, by comparing with a number of other popular CF algorithms that, apart from being highly scalable and intuitive, ClustKnn provides very good recommendation accuracy as well.

...read moreread less

Journal Article•DOI•

A set of nearest neighbor parameters for predicting the enthalpy change of RNA secondary structure formation

[...]

Zhi John Lu¹, Douglas H. Turner², David H. Mathews•Institutions (2)

University of Rochester Medical Center¹, University of Rochester²

01 Oct 2006-Nucleic Acids Research

TL;DR: A strong correlation is found between the predicted melting temperatures of RNA sequences and the optimal growth temperatures of the host organism, indicating that organisms that live at higher temperatures have evolved RNA sequences with higher melting temperatures.

...read moreread less

Abstract: A complete set of nearest neighbor parameters to predict the enthalpy change of RNA secondary structure formation was derived. These parameters can be used with available free energy nearest neighbor parameters to extend the secondary structure prediction of RNA sequences to temperatures other than 37°C. The parameters were tested by predicting the secondary structures of sequences with known secondary structure that are from organisms with known optimal growth temperatures. Compared with the previous set of enthalpy nearest neighbor parameters, the sensitivity of base pair prediction improved from 65.2 to 68.9% at optimal growth temperatures ranging from 10 to 60°C. Base pair probabilities were predicted with a partition function and the positive predictive value of structure prediction is 90.4% when considering the base pairs in the lowest free energy structure with pairing probability of 0.99 or above. Moreover, a strong correlation is found between the predicted melting temperatures of RNA sequences and the optimal growth temperatures of the host organism. This indicates that organisms that live at higher temperatures have evolved RNA sequences with higher melting temperatures.

...read moreread less

Journal Article•DOI•

Range nearest-neighbor query

[...]

Haibo Hu¹, Dik Lun Lee¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Jan 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper considers the ranges as (hyper)rectangles and proposes efficient in-memory processing and secondary memory pruning techniques for RNN queries in both 2D and high-dimensional spaces and devise an auxiliary solution-based index EXO-tree to speed up any type of NN query.

...read moreread less

Abstract: A range nearest-neighbor (RNN) query retrieves the nearest neighbor (NN) for every point in a range. It is a natural generalization of point and continuous nearest-neighbor queries and has many applications. In this paper, we consider the ranges as (hyper)rectangles and propose efficient in-memory processing and secondary memory pruning techniques for RNN queries in both 2D and high-dimensional spaces. These techniques are generalized for kRNN queries, which return the k nearest neighbors for every point in the range. In addition, we devise an auxiliary solution-based index EXO-tree to speed up any type of NN query. EXO-tree is orthogonal to any existing NN processing algorithm and, thus, can be transparently integrated. An extensive empirical study was conducted to evaluate the CPU and I/O performance of these techniques, and the study showed that they are efficient and robust under various data sets, query ranges, numbers of nearest neighbors, dimensions, and cache sizes.

...read moreread less

Journal Article•DOI•

Bidirectional PCA with assembled matrix distance metric for image recognition

[...]

Wangmeng Zuo¹, David Zhang², Kuanquan Wang¹•Institutions (2)

Harbin Institute of Technology¹, Hong Kong Polytechnic University²

01 Aug 2006

TL;DR: BD-PCA supplemented with an assembled matrix distance (AMD) metric is proposed, which can be used for image feature extraction by reducing the dimensionality in both column and row directions.

...read moreread less

Abstract: Principal component analysis (PCA) has been very successful in image recognition. Recent research on PCA-based methods has mainly concentrated on two issues, namely: 1) feature extraction and 2) classification. This paper proposes to deal with these two issues simultaneously by using bidirectional PCA (BD-PCA) supplemented with an assembled matrix distance (AMD) metric. For feature extraction, BD-PCA is proposed, which can be used for image feature extraction by reducing the dimensionality in both column and row directions. For classification, an AMD metric is presented to calculate the distance between two feature matrices and then the nearest neighbor and nearest feature line classifiers are used for image recognition. The results of the experiments show the efficiency of BD-PCA with AMD metric in image recognition

...read moreread less

Proceedings Article•DOI•

Compression and machine learning: a new perspective on feature space vectors

[...]

D. Sculley¹, Carla E. Brodley¹•Institutions (1)

Tufts University¹

28 Mar 2006

TL;DR: Compression-based methods are not a "parameter free" magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms.

...read moreread less

Abstract: The use of compression algorithms in machine learning tasks such as clustering and classification has appeared in a variety of fields, sometimes with the promise of reducing problems of explicit feature selection. The theoretical justification for such methods has been founded on an upper bound on Kolmogorov complexity and an idealized information space. An alternate view shows compression algorithms implicitly map strings into implicit feature space vectors, and compression-based similarity measures compute similarity within these feature spaces. Thus, compression-based methods are not a "parameter free" magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms. To underscore this point, we find theoretical and empirical connections between traditional machine learning vector models and compression, encouraging cross-fertilization in future work.

...read moreread less

Journal Article•DOI•

Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence

[...]

Jigang Wang¹, Predrag Neskovic¹, Leon N. Cooper¹•Institutions (1)

Brown University¹

01 Mar 2006-Pattern Recognition

TL;DR: A new method for neighborhood size selection that is based on the concept of statistical confidence that locally adjusts the number of nearest neighbors until a satisfactory level of confidence is reached.

...read moreread less

Journal Article•DOI•

Reverse Nearest Neighbor Search in Metric Spaces

[...]

Yufei Tao¹, Man Lung Yiu², Nikos Mamoulis²•Institutions (2)

The Chinese University of Hong Kong¹, University of Hong Kong²

01 Sep 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents the first algorithms for efficient RNN search in generic metric spaces that require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality.

...read moreread less

Abstract: Given a set D of objects, a reverse nearest neighbor (RNN) query returns the objects o in D such that o is closer to a query object q than to any other object in D, according to a certain similarity metric. The existing RNN solutions are not sufficient because they either 1) rely on precomputed information that is expensive to maintain in the presence of updates or 2) are applicable only when the data consists of "Euclidean objects" and similarity is measured using the L2 norm. In this paper, we present the first algorithms for efficient RNN search in generic metric spaces. Our techniques require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality. We confirm the effectiveness of the proposed methods with extensive experiments

...read moreread less

Proceedings Article•DOI•

A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors

[...]

Qing Wang¹, Sanjeev R. Kulkarni¹, Sergio Verdu¹•Institutions (1)

Princeton University¹

09 Jul 2006

TL;DR: A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed, and both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity.

...read moreread less

Abstract: A method for divergence estimation between multidimensional distributions based on nearest neighbor distances is proposed. Given i.i.d. samples, both the bias and the variance of this estimator are proven to vanish as sample sizes go to infinity. In experiments on high-dimensional data, the nearest neighbor approach generally exhibits faster convergence compared to previous algorithms based on partitioning.

...read moreread less

Collapse