scispace - formally typeset
Search or ask a question

Showing papers on "k-nearest neighbors algorithm published in 2008"


Journal ArticleDOI
TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.
Abstract: In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.

1,759 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: It is argued that two practices commonly used in image classification methods, have led to the inferior performance of NN-based image classifiers: Quantization of local image descriptors (used to generate "bags-of-words ", codebooks) and Computation of 'image-to-image' distance, instead of ' image- to-class' distance.
Abstract: State-of-the-art image classification methods require an intensive learning/training stage (using SVM, Boosting, etc.) In contrast, non-parametric nearest-neighbor (NN) based image classifiers require no training time and have other favorable properties. However, the large performance gap between these two families of approaches rendered NN-based image classifiers useless. We claim that the effectiveness of non-parametric NN-based image classification has been considerably undervalued. We argue that two practices commonly used in image classification methods, have led to the inferior performance of NN-based image classifiers: (i) Quantization of local image descriptors (used to generate "bags-of-words ", codebooks). (ii) Computation of 'image-to-image' distance, instead of 'image-to-class' distance. We propose a trivial NN-based classifier - NBNN, (Naive-Bayes nearest-neighbor), which employs NN- distances in the space of the local image descriptors (and not in the space of images). NBNN computes direct 'image- to-class' distances without descriptor quantization. We further show that under the Naive-Bayes assumption, the theoretically optimal image classifier can be accurately approximated by NBNN. Although NBNN is extremely simple, efficient, and requires no learning/training phase, its performance ranks among the top leading learning-based image classifiers. Empirical comparisons are shown on several challenging databases (Caltech-101 ,Caltech-256 and Graz-01).

1,228 citations


01 Jan 2008
TL;DR: In this paper, the problem of classifying an unseen pattern on the basis of its nearest neighbors in a recorded data set is addressed from the point of view of Dempster-Shafer theory to provide a global treatment of such issues as ambiguity and distance rejection, and imperfect knowledge regarding the class membership of training patterns.
Abstract: In this paper, the problem of classifying an unseen pattern on the basis of its nearest neighbors in a recorded data set is addressed from the point of view of Dempster-Shafer theory. Each neighbor of a sample to be classified is considered as an item of evidence that supports certain hypotheses regarding the class membership of that pattern. The degree of support is defined as a function of the distance between the two vectors. The evidence of the k nearest neighbors is then pooled by means of Dempster's rule of combination. This approach provides a global treatment of such issues as ambiguity and distance rejection, and imperfect knowledge regarding the class membership of training patterns. The effectiveness of this classification scheme as compared to the voting and distance-weighted k-NN procedures is demonstrated using several sets of simulated and real-world data. >

806 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: This paper proposes a discriminant learning framework for problems in which data consist of linear subspaces instead of vectors, and treats each sub-space as a point in the Grassmann space, and performs feature extraction and classification in the same space.
Abstract: In this paper we propose a discriminant learning framework for problems in which data consist of linear subspaces instead of vectors. By treating subspaces as basic elements, we can make learning algorithms adapt naturally to the problems with linear invariant structures. We propose a unifying view on the subspace-based learning method by formulating the problems on the Grassmann manifold, which is the set of fixed-dimensional linear subspaces of a Euclidean space. Previous methods on the problem typically adopt an inconsistent strategy: feature extraction is performed in the Euclidean space while non-Euclidean distances are used. In our approach, we treat each sub-space as a point in the Grassmann space, and perform feature extraction and classification in the same space. We show feasibility of the approach by using the Grassmann kernel functions such as the Projection kernel and the Binet-Cauchy kernel. Experiments with real image databases show that the proposed method performs well compared with state-of-the-art algorithms.

635 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: A CUDA implementation of the ldquobrute forcerdquo kNN search and it is shown a speed increase on synthetic and real data by up to one or two orders of magnitude depending on the data, with a quasi-linear behavior with respect to the data size in a given, practical range.
Abstract: Statistical measures coming from information theory represent interesting bases for image and video processing tasks such as image retrieval and video object tracking. For example, let us mention the entropy and the Kullback-Leibler divergence. Accurate estimation of these measures requires to adapt to the local sample density, especially if the data are high-dimensional. The k nearest neighbor (kNN) framework has been used to define efficient variable-bandwidth kernel-based estimators with such a locally adaptive property. Unfortunately, these estimators are computationally intensive since they rely on searching neighbors among large sets of d-dimensional vectors. This computational burden can be reduced by pre-structuring the data, e.g. using binary trees as proposed by the approximated nearest neighbor (ANN) library. Yet, the recent opening of graphics processing units (GPU) to general-purpose computation by means of the NVIDIA CUDA API offers the image and video processing community a powerful platform with parallel calculation capabilities. In this paper, we propose a CUDA implementation of the ldquobrute forcerdquo kNN search and we compare its performances to several CPU-based implementations including an equivalent brute force algorithm and ANN. We show a speed increase on synthetic and real data by up to one or two orders of magnitude depending on the data, with a quasi-linear behavior with respect to the data size in a given, practical range.

509 citations


Journal ArticleDOI
TL;DR: The experimental results show that neighborhood-based feature selection algorithm is able to delete most of the redundant and irrelevant features and the classification accuracies based on neighborhood classifier is superior to K-NN, CART in original feature spaces and reduced feature subspaces, and a little weaker than SVM.
Abstract: K nearest neighbor classifier (K-NN) is widely discussed and applied in pattern recognition and machine learning, however, as a similar lazy classifier using local information for recognizing a new test, neighborhood classifier, few literatures are reported on. In this paper, we introduce neighborhood rough set model as a uniform framework to understand and implement neighborhood classifiers. This algorithm integrates attribute reduction technique with classification learning. We study the influence of the three norms on attribute reduction and classification, and compare neighborhood classifier with KNN, CART and SVM. The experimental results show that neighborhood-based feature selection algorithm is able to delete most of the redundant and irrelevant features. The classification accuracies based on neighborhood classifier is superior to K-NN, CART in original feature spaces and reduced feature subspaces, and a little weaker than SVM.

406 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: A highly efficient solver for the particular instance of semidefinite programming that arises in LMNN classification is described; this solver can handle problems with billions of large margin constraints in a few hours.
Abstract: In this paper we study how to improve nearest neighbor classification by learning a Mahalanobis distance metric. We build on a recently proposed framework for distance metric learning known as large margin nearest neighbor (LMNN) classification. Our paper makes three contributions. First, we describe a highly efficient solver for the particular instance of semidefinite programming that arises in LMNN classification; our solver can handle problems with billions of large margin constraints in a few hours. Second, we show how to reduce both training and testing times using metric ball trees; the speedups from ball trees are further magnified by learning low dimensional representations of the input space. Third, we show how to learn different Mahalanobis distance metrics in different parts of the input space. For large data sets, the use of locally adaptive distance metrics leads to even lower error rates.

295 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider two models, Poisson and Binomial, for the training samples, and show that the risk of misclassification is asymptotically equivalent to first order.
Abstract: The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of $k$; and by the absence of techniques for empirical choice of $k$. In the present paper we detail the way in which the value of $k$ determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are "assigned" to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of $k$.

276 citations


Journal ArticleDOI
TL;DR: A new variant of the k-nearest neighbor (kNN) classifier based on the maximal margin principle is presented, characterized by resulting global decision boundaries of the piecewise linear type.
Abstract: In this paper, we present a new variant of the k-nearest neighbor (kNN) classifier based on the maximal margin principle. The proposed method relies on classifying a given unlabeled sample by first finding its k-nearest training samples. A local partition of the input feature space is then carried out by means of local support vector machine (SVM) decision boundaries determined after training a multiclass SVM classifier on the considered k training samples. The labeling of the unknown sample is done by looking at the local decision region to which it belongs. The method is characterized by resulting global decision boundaries of the piecewise linear type. However, the entire process can be kernelized through the determination of the k -nearest training samples in the transformed feature space by using a distance function simply reformulated on the basis of the adopted kernel. To illustrate the performance of the proposed method, an experimental analysis on three different remote sensing datasets is reported and discussed.

241 citations


Journal ArticleDOI
TL;DR: A greedy attribute reduction algorithm is constructed based on Pawlak's rough set model, where the objects with numerical attributes are granulated with @d neighborhood relations or k-nearest-neighbor relations, while objects with categorical features are granulation with equivalence relations.
Abstract: Feature subset selection presents a common challenge for the applications where data with tens or hundreds of features are available. Existing feature selection algorithms are mainly designed for dealing with numerical or categorical attributes. However, data usually comes with a mixed format in real-world applications. In this paper, we generalize Pawlak's rough set model into @d neighborhood rough set model and k-nearest-neighbor rough set model, where the objects with numerical attributes are granulated with @d neighborhood relations or k-nearest-neighbor relations, while objects with categorical features are granulated with equivalence relations. Then the induced information granules are used to approximate the decision with lower and upper approximations. We compute the lower approximations of decision to measure the significance of attributes. Based on the proposed models, we give the definition of significance of mixed features and construct a greedy attribute reduction algorithm. We compare the proposed algorithm with others in terms of the number of selected features and classification performance. Experiments show the proposed technique is effective.

214 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider two models, Poisson and Binomial, for the training samples, and show that the risk of misclassification is asymptotically equivalent to first order.
Abstract: The kth-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of k; and by the absence of techniques for empirical choice of k. In the present paper we detail the way in which the value of k determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are “assigned” to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of k.

15 Sep 2008
TL;DR: In this article, the authors investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas, e.g., e-mail filtering and drug discovery.
Abstract: Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Posted Content
TL;DR: In this article, the authors considered a more realistic network model where a finite number of nodes are uniformly randomly distributed in a general d-dimensional ball of radius R and characterised the distribution of Euclidean distances in the system.
Abstract: In wireless networks, the knowledge of nodal distances is essential for several areas such as system configuration, performance analysis and protocol design. In order to evaluate distance distributions in random networks, the underlying nodal arrangement is almost universally taken to be an infinite Poisson point process. While this assumption is valid in some cases, there are also certain impracticalities to this model. For example, practical networks are non-stationary, and the number of nodes in disjoint areas are not independent. This paper considers a more realistic network model where a finite number of nodes are uniformly randomly distributed in a general d-dimensional ball of radius R and characterizes the distribution of Euclidean distances in the system. The key result is that the probability density function of the distance from the center of the network to its nth nearest neighbor follows a generalized beta distribution. This finding is applied to study network characteristics such as energy consumption, interference, outage and connectivity.

Proceedings ArticleDOI
20 Jul 2008
TL;DR: This paper proposes a K-Nearest Neighbor (KNN) method for query-dependent ranking, and proves a theory which indicates that the approximations are accurate in terms of difference in loss of prediction, if the learning algorithm used is stable with respect to minor changes in training examples.
Abstract: Many ranking models have been proposed in information retrieval, and recently machine learning techniques have also been applied to ranking model construction. Most of the existing methods do not take into consideration the fact that significant differences exist between queries, and only resort to a single function in ranking of documents. In this paper, we argue that it is necessary to employ different ranking models for different queries and onduct what we call query-dependent ranking. As the first such attempt, we propose a K-Nearest Neighbor (KNN) method for query-dependent ranking. We first consider an online method which creates a ranking model for a given query by using the labeled neighbors of the query in the query feature space and then rank the documents with respect to the query using the created model. Next, we give two offline approximations of the method, which create the ranking models in advance to enhance the efficiency of ranking. And we prove a theory which indicates that the approximations are accurate in terms of difference in loss of prediction, if the learning algorithm used is stable with respect to minor changes in training examples. Our experimental results show that the proposed online and offline methods both outperform the baseline method of using a single ranking function.

Proceedings ArticleDOI
19 May 2008
TL;DR: The proposed algorithm is based on a radially bounded nearest neighbor strategy and requires only two parameters and yields deterministic, repeatable results and does not depend on any initialization procedure.
Abstract: In this paper we present a novel method for the efficient segmentation of 3D laser range data. The proposed algorithm is based on a radially bounded nearest neighbor strategy and requires only two parameters. It yields deterministic, repeatable results and does not depend on any initialization procedure. The efficiency of the method is verified with synthetic and real 3D data.

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This work proposes a general framework for stable feature selection which emphasizes both good generalization and stability of feature selection results, and identifies dense feature groups based on kernel density estimation and treats features in each dense group as a coherent entity for feature selection.
Abstract: Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two causes of instability of feature selection algorithms: selection of a minimum subset without redundant features and small sample size. We propose a general framework for stable feature selection which emphasizes both good generalization and stability of feature selection results. The framework identifies dense feature groups based on kernel density estimation and treats features in each dense group as a coherent entity for feature selection. An efficient algorithm DRAGS (Dense Relevant Attribute Group Selector) is developed under this framework. We also introduce a general measure for assessing the stability of feature selection algorithms. Our empirical study based on microarray data verifies that dense feature groups remain stable under random sample hold out, and the DRAGS algorithm is effective in identifying a set of feature groups which exhibit both high classification accuracy and stability.

Journal ArticleDOI
Nojun Kwak1
TL;DR: The experimental results show that the proposed method performs well for face recognition problems, compared with conventional methods such as the principal component analysis (PCA), Fisher's linear discriminant (FLD), etc.

Journal ArticleDOI
01 Aug 2008
TL;DR: This paper adopts a general uncertainty model allowing for data and query uncertainty, and defines new query semantics, and provides several efficient evaluation algorithms to address the cost factors involved in query evaluation.
Abstract: Uncertainty pervades many domains in our lives. Current real-life applications, e.g., location tracking using GPS devices or cell phones, multimedia feature extraction, and sensor data management, deal with different kinds of uncertainty. Finding the nearest neighbor objects to a given query point is an important query type in these applications.In this paper, we study the problem of finding objects with the highest marginal probability of being the nearest neighbors to a query object. We adopt a general uncertainty model allowing for data and query uncertainty. Under this model, we define new query semantics, and provide several efficient evaluation algorithms. We analyze the cost factors involved in query evaluation, and present novel techniques to address the trade-offs among these factors. We give multiple extensions to our techniques including handling dependencies among data objects, and answering threshold queries. We conduct an extensive experimental study to evaluate our techniques on both real and synthetic data.

Book ChapterDOI
01 Jan 2008
TL;DR: The objective of this study was to evaluate SVMs for their effectiveness and prospects for object-based image analysis as a modern computational intelligence method and the SVM methodology seems very promising for Object Based Image Analysis.
Abstract: The Support Vector Machine is a theoretically superior machine learning methodology with great results in pattern recognition. Especially for supervised classification of high-dimensional datasets and has been found competitive with the best machine learning algorithms. In the past, SVMs were tested and evaluated only as pixel-based image classifiers. During recent years, advances in Remote Sensing occurred in the field of Object-Based Image Analysis (OBIA) with combination of low level and high level computer vision techniques. Moving from pixel-based techniques towards object-based representation, the dimensions of remote sensing imagery feature space increases significantly. This results to increased complexity of the classification process, and causes problems to traditional classification schemes. The objective of this study was to evaluate SVMs for their effectiveness and prospects for object-based image analysis as a modern computational intelligence method. Here, an SVM approach for multi-class classification was followed, based on primitive image objects provided by a multi-resolution segmentation algorithm. Then, a feature selection step took place in order to provide the features for classification which involved spectral, texture and shape information. After the feature selection step, a module that integrated an SVM classifier and the segmentation algorithm was developed in C++. For training the SVM, sample image objects derived from the segmentation procedure were used. The proposed classification procedure followed, resulting in the final object classification. The classification results were compared to the Nearest Neighbor object-based classifier results, and were found satisfactory. The SVM methodology seems very promising for Object Based Image Analysis and future work will focus on integrating SVM classifiers with rule-based classifiers.

Journal ArticleDOI
TL;DR: A general procedure to find Euclidean metrics in a low-dimensional space whose main characteristic is to minimize the variance of a given class label of all those pairs of points whose distance is less than a predefined value is proposed.
Abstract: Nearest neighbor (NN) techniques are commonly used in remote sensing, pattern recognition, and statistics to classify objects into a predefined number of categories based on a given set of predictors. These techniques are particularly useful in those cases exhibiting a highly nonlinear relationship between variables. In most studies, the distance measure is adopted a priori. In contrast, we propose a general procedure to find Euclidean metrics in a low-dimensional space (i.e., one in which the number of dimensions is less than the number of predictor variables) whose main characteristic is to minimize the variance of a given class label of all those pairs of points whose distance is less than a predefined value. k-NN is used in each embedded space to determine the possibility that a query belongs to a given class label. The class estimation is carried out by an ensemble of predictions. To illustrate the application of this technique, a typical land cover classification using a Landsat-5 Thematic Mapper scene is presented. Experimental results indicate substantial improvement with regard to the classification accuracy as compared with approaches such as maximum likelihood, linear discriminant analysis, standard k-NN, and adaptive quasi-conformal kernel k-NN.

Proceedings ArticleDOI
15 Dec 2008
TL;DR: The class information is incorporated into the framework of CCA, and a novel method of combined feature extraction for multimodal recognition, called discriminative canonical correlation analysis (DCCA), is proposed and the experiments show that DCCA outperforms some related methods of both unimodal Recognition and multimodAL recognition.
Abstract: Multimodal recognition is an emerging technique to overcome the non-robustness of the unimodal recognition in real applications. Canonical correlation analysis (CCA) has been employed as a powerful tool for feature fusion in the realization of such multimodal system. However, CCA is the unsupervised feature extraction and it does not utilize the class information of the samples, resulting in the constraint of the recognition performance. In this paper, the class information is incorporated into the framework of CCA for combined feature extraction, and a novel method of combined feature extraction for multimodal recognition, called discriminative canonical correlation analysis (DCCA), is proposed. The experiments show that DCCA outperforms some related methods of both unimodal recognition and multimodal recognition.

Proceedings ArticleDOI
07 Jul 2008
TL;DR: This paper reformulates the multiple feature fusion as a general subspace learning problem to find a general linear subspace in which the cumulative pairwise canonical correlation between every pair of feature sets is maximized after the dimension normalization and subspace projection.
Abstract: Since the emergence of extensive multimedia data, feature fusion has been more and more important for image and video retrieval, indexing and annotation. Existing feature fusion techniques simply concatenate a pair of different features or use canonical correlation analysis based methods for joint dimensionality reduction in the feature space. However, how to fuse multiple features in a generalized way is still an open problem. In this paper, we reformulate the multiple feature fusion as a general subspace learning problem. The objective of the framework is to find a general linear subspace in which the cumulative pairwise canonical correlation between every pair of feature sets is maximized after the dimension normalization and subspace projection. The learned subspace couples dimensionality reduction and feature fusion together, which can be applied to both unsupervised and supervised learning cases. In the supervised case, the pairwise canonical correlations of feature sets within the same classes are also counted in the objective function for maximization. To better model the high-order feature structure and overcome the computational difficulty, the features extracted from the same pattern source are represented by a single 2D tensor. The tensor-based dimensionality reduction methods are used to further extract low-dimensional discriminative features from the fused feature ensemble. Extensive experiments on visual data classification demonstrate the effectiveness and robustness of the proposed methods.

Journal ArticleDOI
01 Apr 2008
TL;DR: There is significant redundancy between and within feature schemes commonly used in practice in practice, suggesting that further feature analysis research is necessary in order to optimize feature selection and achieve better results for the instrument recognition problem.
Abstract: In tackling data mining and pattern recognition tasks, finding a compact but effective set of features has often been found to be a crucial step in the overall problem-solving process. In this paper, we present an empirical study on feature analysis for recognition of classical instrument, using machine learning techniques to select and evaluate features extracted from a number of different feature schemes. It is revealed that there is significant redundancy between and within feature schemes commonly used in practice. Our results suggest that further feature analysis research is necessary in order to optimize feature selection and achieve better results for the instrument recognition problem.

Journal ArticleDOI
TL;DR: Both the depth-first and best-first fc-nearest neighbor algorithms are modified to use MaxNearestDist, which is shown to enhance both algorithms by overcoming their shortcomings.
Abstract: Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth-first or a best-first algorithm to the search hierarchy containing the data. These algorithms are generally applicable to any index based on hierarchical clustering. The idea is that the data is partitioned into clusters that are aggregated to form other clusters, with the total aggregation being represented as a tree. These algorithms have traditionally used a lower bound corresponding to the minimum distance at which a nearest neighbor can be found (termed MlNDlST) to prune the search process by avoiding the processing of some of the clusters, as well as individual objects when they can be shown to be farther from the query object q than all of the current k nearest neighbors of q. An alternative pruning technique that uses an upper bound corresponding to the maximum possible distance at which a nearest neighbor is guaranteed to be found (termed MaxNearestDist) is described. The MaxNearestDist upper bound is adapted to enable its use for finding the k nearest neighbors instead of just the nearest neighbor (that is, k = 1) as in its previous uses. Both the depth-first and best-first fc-nearest neighbor algorithms are modified to use MaxNearestDist, which is shown to enhance both algorithms by overcoming their shortcomings. In particular, for the depth-first algorithm, the number of clusters in the search hierarchy that must be examined is not increased thereby potentially lowering its execution time, while for the best-first algorithm, the number of clusters in the search hierarchy that must be retained in the priority queue used to control the ordering of processing of the clusters is also not increased, thereby potentially lowering its storage requirements.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed local patch-based occlusion detection technique works well and the S-LNMF method shows superior performance to other conventional approaches.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: A novel formulation is presented, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.
Abstract: A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

Journal ArticleDOI
TL;DR: This paper introduces some nearest neighbor based portfolio selectors that are log-optimal for the very general class of stationary and ergodic random processes and shows very good finite-horizon performance when applied to different markets with different dimensionality or scales.
Abstract: In recent years optimal portfolio selection strategies for sequential investment have been shown to exist. Although their asymptotical optimality is well established, finite sample properties do need the adjustment of parameters that depend on dimensionality and scale. In this paper we introduce some nearest neighbor based portfolio selectors that solve these problems, and we show that they are also log-optimal for the very general class of stationary and ergodic random processes. The newly proposed algorithm shows very good finite-horizon performance when applied to different markets with different dimensionality or scales without any change: we see it as a very robust strategy.

Journal ArticleDOI
01 Feb 2008
TL;DR: A new classification method is proposed, which performs a classification task based on the local probabilistic centers of each class, and shows that this method substantially improves the classification performance of the nearest neighbor algorithm.
Abstract: When classes are nonseparable or overlapping, training samples in a local neighborhood may come from different classes. In this situation, the samples with different class labels may be comparable in the neighborhood of query. As a consequence, the conventional nearest neighbor classifier, such as -nearest neighbor scheme, may produce a wrong prediction. To address this issue, in this paper, we propose a new classification method, which performs a classification task based on the local probabilistic centers of each class. This method works by reducing the number of negative contributing points, which are the known samples falling on the wrong side of the ideal decision boundary, in a training set and by restricting their influence regions. In classification, this method classifies the query sample by using two measures of which one is the distance between the query and the local categorical probability centers, and the other is the computed posterior probability of the query. Although both measures are effective, the experiments show that the second one achieves the smaller classification error. Meanwhile, the theoretical analyses of the suggested methods are investigated, and some experiments are conducted on the basis of both constructed and real datasets. The investigation results show that this method substantially improves the classification performance of the nearest neighbor algorithm.

Journal ArticleDOI
TL;DR: Investigation of the imbalanced data problem in the classification of different types of weld flaws, a multi-class classification problem, indicates that the most difficult weld flaw type to recognize is crack.
Abstract: This paper presents research results of our investigation of the imbalanced data problem in the classification of different types of weld flaws, a multi-class classification problem. The one-against-all scheme is adopted to carry out multi-class classification and three algorithms including minimum distance, nearest neighbors, and fuzzy nearest neighbors are employed as the classifiers. The effectiveness of 22 data preprocessing methods for dealing with imbalanced data is evaluated in terms of eight evaluation criteria to determine whether any method would emerge to dominate the others. The test results indicate that: (1) nearest neighbor classifiers outperform the minimum distance classifier; (2) some data preprocessing methods do not improve any criterion and they vary from one classifier to another; (3) the combination of using the AHC_KM data preprocessing method with the 1-NN classifier is the best because they together produce the best performance in six of eight evaluation criteria; and (4) the most difficult weld flaw type to recognize is crack.

Journal ArticleDOI
TL;DR: KDF-WKNN is much better than the original KNN and the distance-weighted KNN methods, and is comparable to or better than several state-of-the-art methods in terms of classification accuracy.
Abstract: Nearest neighbor (NN) rule is one of the simplest and the most important methods in pattern recognition. In this paper, we propose a kernel difference-weighted k-nearest neighbor (KDF-KNN) method for pattern classification. The proposed method defines the weighted KNN rule as a constrained optimization problem, and we then propose an efficient solution to compute the weights of different nearest neighbors. Unlike traditional distance-weighted KNN which assigns different weights to the nearest neighbors according to the distance to the unclassified sample, difference-weighted KNN weighs the nearest neighbors by using both the correlation of the differences between the unclassified sample and its nearest neighbors. To take into account the effective nonlinear structure information, we further extend difference-weighted KNN to its kernel version KDF-KNN. Our experimental results indicate that KDF-WKNN is much better than the original KNN and the distance-weighted KNN methods, and is comparable to or better than several state-of-the-art methods in terms of classification accuracy.