Showing papers on "k-nearest neighbors algorithm published in 2009"

PDF

Open Access

Journal Article•DOI•

Distance Metric Learning for Large Margin Nearest Neighbor Classification

[...]

01 Dec 2009-Journal of Machine Learning Research

TL;DR: This paper shows how to learn a Mahalanobis distance metric for kNN classification from labeled examples in a globally integrated manner and finds that metrics trained in this way lead to significant improvements in kNN Classification.

...read moreread less

Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner.

...read moreread less

4,157 citations

Journal Article•DOI•

K-Nearest Neighbor

[...]

Leif E. Peterson

21 Feb 2009-Scholarpedia

1,155 citations

Proceedings Article•DOI•

Time series shapelets: a new primitive for data mining

[...]

Lexiang Ye¹, Eamonn Keogh¹•Institutions (1)

University of California, Riverside¹

28 Jun 2009

TL;DR: A new time series primitive, time series shapelets, is introduced, which can be interpretable, more accurate and significantly faster than state-of-the-art classifiers.

...read moreread less

Abstract: Classification of time series has been attracting great interest over the past decade. Recent empirical evidence has strongly suggested that the simple nearest neighbor algorithm is very difficult to beat for most time series problems. While this may be considered good news, given the simplicity of implementing the nearest neighbor algorithm, there are some negative consequences of this. First, the nearest neighbor algorithm requires storing and searching the entire dataset, resulting in a time and space complexity that limits its applicability, especially on resource-limited sensors. Second, beyond mere classification accuracy, we often wish to gain some insight into the data.In this work we introduce a new time series primitive, time series shapelets, which addresses these limitations. Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. As we shall show with extensive empirical evaluations in diverse domains, algorithms based on the time series shapelet primitives can be interpretable, more accurate and significantly faster than state-of-the-art classifiers.

...read moreread less

930 citations

Proceedings Article•DOI•

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

[...]

Matthieu Guillaumin¹, Thomas Mensink¹, Jakob Verbeek¹, Cordelia Schmid¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Sep 2009

TL;DR: This work proposes TagProp, a discriminatively trained nearest neighbor model that allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set, and introduces a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words.

...read moreread less

Abstract: Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art.

...read moreread less

739 citations

Journal Article•DOI•

Feature selection for multi-label naive Bayes classification

[...]

Min-Ling Zhang¹, José M. Peña², Víctor Robles²•Institutions (2)

Nanjing University¹, Technical University of Madrid²

01 Sep 2009-Information Sciences

TL;DR: This paper proposes a method called Mlnb which adapts the traditional naive Bayes classifiers to deal with multi-label instances and achieves comparable performance to other well-established multi- label learning algorithms.

...read moreread less

433 citations

Journal Article•DOI•

Divergence Estimation for Multidimensional Densities Via $k$ -Nearest-Neighbor Distances

[...]

Qing Wang¹, Sanjeev R. Kulkarni², Sergio Verdu²•Institutions (2)

Credit Suisse¹, Princeton University²

01 May 2009-IEEE Transactions on Information Theory

TL;DR: It is shown that the speed of convergence of the k-NN method can be further improved by an adaptive choice of k.i.d., and the new universal estimator of divergence is proved to be asymptotically unbiased and mean-square consistent.

...read moreread less

Abstract: A new universal estimator of divergence is presented for multidimensional continuous densities based on k-nearest-neighbor (k-NN) distances. Assuming independent and identically distributed (i.i.d.) samples, the new estimator is proved to be asymptotically unbiased and mean-square consistent. In experiments with high-dimensional data, the k-NN approach generally exhibits faster convergence than previous algorithms. It is also shown that the speed of convergence of the k-NN method can be further improved by an adaptive choice of k.

...read moreread less

372 citations

Journal Article•DOI•

Space-time tradeoffs for approximate nearest neighbor searching

[...]

Sunil Arya¹, Theocharis Malamatos², David M. Mount³•Institutions (3)

Hong Kong University of Science and Technology¹, University of Peloponnese², University of Maryland, College Park³

27 Nov 2009-Journal of the ACM

TL;DR: There is a single approach to nearest neighbor searching, which both improves upon existing results and spans the spectrum of space-time tradeoffs, and new algorithms for constructing AVDs and tools for analyzing their total space requirements are provided.

...read moreread less

Abstract: Nearest neighbor searching is the problem of preprocessing a set of n point points in d-dimensional space so that, given any query point q, it is possible to report the closest point to q rapidly. In approximate nearest neighbor searching, a parameter e > 0 is given, and a multiplicative error of (1 + e) is allowed. We assume that the dimension d is a constant and treat n and e as asymptotic quantities. Numerous solutions have been proposed, ranging from low-space solutions having space O(n) and query time O(log n + 1/ed−1) to high-space solutions having space roughly O((n log n)/ed) and query time O(log (n/e)).We show that there is a single approach to this fundamental problem, which both improves upon existing results and spans the spectrum of space-time tradeoffs. Given a tradeoff parameter γ, where 2 ≤ γ ≤ 1/e, we show that there exists a data structure of space O(nγd−1 log(1/e)) that can answer queries in time O(log(nγ) + 1/(eγ)(d−1)/2. When γ = 2, this yields a data structure of space O(n log (1/e)) that can answer queries in time O(log n + 1/e(d−1)/2). When γ = 1/e, it provides a data structure of space O((n/ed−1)log(1/e)) that can answer queries in time O(log(n/e)).Our results are based on a data structure called a (t,e)-AVD, which is a hierarchical quadtree-based subdivision of space into cells. Each cell stores up to t representative points of the set, such that for any query point q in the cell at least one of these points is an approximate nearest neighbor of q. We provide new algorithms for constructing AVDs and tools for analyzing their total space requirements. We also establish lower bounds on the space complexity of AVDs, and show that, up to a factor of O(log (1/e)), our space bounds are asymptotically tight in the two extremes, γ = 2 and γ = 1/e.

...read moreread less

266 citations

Journal Article•

Fast Approximate k NN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

[...]

Jie Chen¹, Haw-ren Fang², Yousef Saad•Institutions (2)

University of Minnesota¹, Argonne National Laboratory²

01 Dec 2009-Journal of Machine Learning Research

TL;DR: Two divide and conquer methods for computing an approximate kNN graph in Θ(dnt) time for high dimensional data (large d) and an additional refinement step is performed to improve the accuracy of the graph.

...read moreread less

Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt) time for high dimensional data (large d). The exponent t ∈ (1,2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs.

...read moreread less

253 citations

Proceedings Article•DOI•

Quality and efficiency in high dimensional nearest neighbor search

[...]

Yufei Tao¹, Ke Yi², Cheng Sheng¹, Panos Kalnis³•Institutions (3)

The Chinese University of Hong Kong¹, Hong Kong University of Science and Technology², King Abdullah University of Science and Technology³

29 Jun 2009

TL;DR: This work proposes a new access method called the locality sensitive B-tree (LSB-tree) that enables fast high-dimensional NN search with excellent quality and reduces its space and query cost dramatically, and outperforms adhoc-LSH even though the latter has no quality guarantee.

...read moreread less

Abstract: Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. Rigorous-LSH ensures good quality of query results, but requires expensive space and query cost. Although adhoc-LSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive B-tree (LSB-tree) that enables fast high-dimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the LSB-forest that ensures the same result quality as rigorous-LSH, but reduces its space and query cost dramatically. The LSB-forest also outperforms adhoc-LSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSB-tree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSB-tree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linear-space) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality.

...read moreread less

236 citations

Journal Article•DOI•

Gear crack level identification based on weighted K nearest neighbor classification algorithm

[...]

Yaguo Lei¹, Ming J. Zuo¹•Institutions (1)

University of Alberta¹

01 Jul 2009-Mechanical Systems and Signal Processing

TL;DR: In this paper, a two-stage feature selection and weighting technique (TFSWT) via Euclidean distance evaluation technique (EDET) is presented and adopted to select sensitive features and remove fault-unrelated features.

...read moreread less

224 citations

Journal Article•DOI•

Detection and Discrimination of Land Mines in Ground-Penetrating Radar Based on Edge Histogram Descriptors and a Possibilistic $K$ -Nearest Neighbor Classifier

[...]

Hichem Frigui¹, Paul D. Gader²•Institutions (2)

University of Louisville¹, University of Florida²

01 Feb 2009-IEEE Transactions on Fuzzy Systems

TL;DR: An algorithm for land mine detection using sensor data generated by a ground-penetrating radar (GPR) system that uses edge histogram descriptors for feature extraction and a possibilistic K -nearest neighbors (K-NNs) rule for confidence assignment is described, demonstrating the best performance among several high-performance algorithms.

...read moreread less

Abstract: This paper describes an algorithm for land mine detection using sensor data generated by a ground-penetrating radar (GPR) system that uses edge histogram descriptors for feature extraction and a possibilistic K -nearest neighbors (K-NNs) rule for confidence assignment. The algorithm demonstrated the best performance among several high-performance algorithms in extensive testing on a large real-world datasets associated with the difficult problem of land mine detection. The superior performance of the algorithm is attributed to the use of the possibilistic K -NN algorithm, thereby providing important evidence supporting the use of possibilistic methods in real-world applications. The GPR produces a 3-D array of intensity values, representing a volume below the surface of the ground. First, a computationally inexpensive prescreening algorithm for anomaly detection is used to focus attention and identify candidate signatures that resemble mines. The identified regions of interest are processed further by a feature extraction algorithm to capture their salient features. We use translation-invariant features that are based on the local edge distribution of the 3-D GPR signatures. Specifically, each 3-D signature is divided into subsignatures, and the local edge distribution for each subsignature is represented by a histogram. Next, the training signatures are clustered to identify prototypes. The main idea is to identify few prototypes that can capture the variations of the signatures within each class. These variations could be due to different mine types, different soil conditions, different weather conditions, etc. Fuzzy memberships are assigned to these representatives to capture their degree of sharing among the mines and false alarm classes. Finally, a possibilistic K-NN-based rule is used to assign a confidence value to distinguish true detections from false alarms. The proposed algorithm is implemented and integrated within a complete land mine prototype system. It is trained, field-tested, evaluated, and compared using a large-scale cross-validation experiment that uses a diverse dataset acquired from four outdoor test sites at different geographic locations. This collection covers over 41 807 m2 of ground and includes 1593 mine encounters.

...read moreread less

Journal Article•DOI•

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.

[...]

Naryttza N. Diaz¹, Lutz Krause², Alexander Goesmann¹, Karsten Niehaus¹, Tim Wilhelm Nattkemper¹ - Show less +1 more•Institutions (2)

Bielefeld University¹, Nestlé²

11 Feb 2009-BMC Bioinformatics

TL;DR: An accurate multi-class taxonomic classifier was developed for environmental genomic fragments TACOA, which can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp.

...read moreread less

Abstract: Background Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning.

...read moreread less

Journal Article•DOI•

A new feature selection method on classification of medical datasets: Kernel F-score feature selection

[...]

Kemal Polat¹, Salih Güneş¹•Institutions (1)

Selçuk University¹

01 Sep 2009-Expert Systems With Applications

TL;DR: The proposed feature selection method called KFFS is produced very promising results compared to F-score feature selection, and the irrelevant or redundant features are removed from high dimensional input feature space.

...read moreread less

Abstract: In this paper, we have proposed a new feature selection method called kernel F-score feature selection (KFFS) used as pre-processing step in the classification of medical datasets. KFFS consists of two phases. In the first phase, input spaces (features) of medical datasets have been transformed to kernel space by means of Linear (Lin) or Radial Basis Function (RBF) kernel functions. By this way, the dimensions of medical datasets have increased to high dimension feature space. In the second phase, the F-score values of medical datasets with high dimensional feature space have been calculated using F-score formula. And then the mean value of calculated F-scores has been computed. If the F-score value of any feature in medical datasets is bigger than this mean value, that feature will be selected. Otherwise, that feature is removed from feature space. Thanks to KFFS method, the irrelevant or redundant features are removed from high dimensional input feature space. The cause of using kernel functions transforms from non-linearly separable medical dataset to a linearly separable feature space. In this study, we have used the heart disease dataset, SPECT (Single Photon Emission Computed Tomography) images dataset, and Escherichia coli Promoter Gene Sequence dataset taken from UCI (University California, Irvine) machine learning database to test the performance of KFFS method. As classification algorithms, Least Square Support Vector Machine (LS-SVM) and Levenberg-Marquardt Artificial Neural Network have been used. As shown in the obtained results, the proposed feature selection method called KFFS is produced very promising results compared to F-score feature selection.

...read moreread less

Journal Article•DOI•

Excitonic versus electronic couplings in molecular assemblies: The importance of non-nearest neighbor interactions.

[...]

Johannes Gierschner¹, Ya-Shih Huang, Bernard Van Averbeke, Jérôme Cornil, Richard H. Friend, David Beljonne - Show less +2 more•Institutions (1)

Autonomous University of Madrid¹

28 Jan 2009-Journal of Chemical Physics

TL;DR: For a range of phenylene- and thiophene-based conjugated polymers of practical relevance for optoelectronic applications, exciton couplings in one-dimensional stacks deviate significantly from the nearest neighbor approximation, so long-range interactions with non-nearest neighbors have to be included.

...read moreread less

Abstract: We demonstrate that for a range of phenylene- and thiophene-based conjugated polymers of practical relevance for optoelectronic applications, exciton couplings in one-dimensional stacks deviate significantly from the nearest neighbor approximation. Instead, long-range interactions with non-nearest neighbors have to be included, which become increasingly important with growing oligomer size. While the exciton coupling vanishes for infinitely long ideal polymer chains and provides a sensitive measure of the actual conjugation length, the electronic coupling mediating charge transport shows rapid convergence with molecular size. Similar results have been obtained for very different molecular backbones, thus highlighting the general character of these findings.

...read moreread less

Journal Article•DOI•

Efficient method for maximizing bichromatic reverse nearest neighbor

[...]

Raymond Chi-Wing Wong¹, M. Tamer Özsu², Philip S. Yu³, Ada Wai-Chee Fu⁴, Lian Liu¹ - Show less +1 more•Institutions (4)

Hong Kong University of Science and Technology¹, University of Waterloo², University of Illinois at Chicago³, The Chinese University of Hong Kong⁴

01 Aug 2009

TL;DR: This paper studies a related problem called MaxBRNN: find an optimal region that maximizes the size of BRNNs and comes up with an efficient algorithm called MaxOverlap, which is many times faster than the best-known technique.

...read moreread less

Abstract: Bichromatic reverse nearest neighbor (BRNN) has been extensively studied in spatial database literature. In this paper, we study a related problem called MaxBRNN: find an optimal region that maximizes the size of BRNNs. Such a problem has many real life applications, including the problem of finding a new server point that attracts as many customers as possible by proximity. A straightforward approach is to determine the BRNNs for all possible points that are not feasible since there are a large (or infinite) number of possible points. To the best of our knowledge, the fastest known method has exponential time complexity on the data size. Based on some interesting properties of the problem, we come up with an efficient algorithm called MaxOverlap. Extensive experiments are conducted to show that our algorithm is many times faster than the best-known technique.

...read moreread less

Journal Article•DOI•

Pap smear diagnosis using a hybrid intelligent scheme focusing on genetic algorithm based feature selection and nearest neighbor classification

[...]

Yannis Marinakis¹, Georgios Dounias², Jan Jantzen³•Institutions (3)

Technical University of Crete¹, University of the Aegean², Technical University of Denmark³

01 Jan 2009-Computers in Biology and Medicine

TL;DR: A metaheuristic algorithm is proposed in order to classify the cells of pap-smear samples, and shows that classification accuracy generally outperforms other previously applied intelligent approaches.

...read moreread less

Journal Article•DOI•

A Novel Template Reduction Approach for the $K$ -Nearest Neighbor Method

[...]

Hatem A. Fayed¹, Amir F. Atiya¹•Institutions (1)

Cairo University¹

01 May 2009-IEEE Transactions on Neural Networks

TL;DR: Experiments show that the proposed approach effectively reduces the number of prototypes while maintaining the same level of classification accuracy as the traditional KNN, and is a simple and a fast condensing algorithm.

...read moreread less

Abstract: The K-nearest neighbor (KNN) rule is one of the most widely used pattern classification algorithms. For large data sets, the computational demands for classifying patterns using KNN can be prohibitive. A way to alleviate this problem is through the condensing approach. This means we remove patterns that are more of a computational burden but do not contribute to better classification accuracy. In this brief, we propose a new condensing algorithm. The proposed idea is based on defining the so-called chain. This is a sequence of nearest neighbors from alternating classes. We make the point that patterns further down the chain are close to the classification boundary and based on that we set a cutoff for the patterns we keep in the training set. Experiments show that the proposed approach effectively reduces the number of prototypes while maintaining the same level of classification accuracy as the traditional KNN. Moreover, it is a simple and a fast condensing algorithm.

...read moreread less

Book Chapter•DOI•

k-Nearest Neighbor Classification

[...]

Antonio Mucherino¹, Petraq Papajorgji¹, Panos M. Pardalos¹•Institutions (1)

University of Florida¹

01 Jan 2009

TL;DR: The k-nearest neighbor (k-NN) method uses the well-known principle of Cicero pares cum paribus facillime congregantur (birds of a feather flock together or literally equals with equals easily associate) to classify an unknown sample based on the known classification of its neighbors.

...read moreread less

Abstract: The k-nearest neighbor (k-NN) method is one of the data mining techniques considered to be among the top 10 techniques for data mining [237]. The k-NN method uses the well-known principle of Cicero pares cum paribus facillime congregantur (birds of a feather flock together or literally equals with equals easily associate). It tries to classify an unknown sample based on the known classification of its neighbors. Let us suppose that a set of samples with known classification is available, the so-called training set. Intuitively, each sample should be classified similarly to its surrounding samples. Therefore, if the classification of a sample is unknown, then it could be predicted by considering the classification of its nearest neighbor samples. Given an unknown sample and a training set, all the distances between the unknown sample and all the samples in the training set can be computed. The distance with the smallest value corresponds to the sample in the training set closest to the unknown sample. Therefore, the unknown sample may be classified based on the classification of this nearest neighbor.

...read moreread less

Proceedings Article•DOI•

Building text features for object image classification

[...]

Gang Wang¹, Derek Hoiem¹, David Forsyth¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 Jun 2009

TL;DR: A text-based image feature is introduced and it is demonstrated that it consistently improves performance on hard object classification problems, and is particularly effective when the training dataset is small.

...read moreread less

Abstract: We introduce a text-based image feature and demonstrate that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, downloaded from the Internet. We do not inspect or correct the tags and expect that they are noisy. We obtain the text feature of an unannotated image from the tags of its k-nearest neighbors in this auxiliary collection. A visual classifier presented with an object viewed under novel circumstances (say, a new viewing direction) must rely on its visual examples. Our text feature may not change, because the auxiliary dataset likely contains a similar picture. While the tags associated with images are noisy, they are more stable when appearance changes. We test the performance of this feature using PASCAL VOC 2006 and 2007 datasets. Our feature performs well, consistently improves the performance of visual object classifiers, and is particularly effective when the training dataset is small.

...read moreread less

Proceedings Article•DOI•

Incremental action recognition using feature-tree

[...]

Kishore K. Reddy¹, Jingen Liu¹, Mubarak Shah¹•Institutions (1)

University of Central Florida¹

01 Sep 2009

TL;DR: This work proposes a novel framework involving the feature- tree to index large scale motion features using Sphere/Rectangle-tree (SR-tree) and provides an effective way for practical incremental action recognition.

...read moreread less

Abstract: Action recognition methods suffer from many drawbacks in practice, which include (1)the inability to cope with incremental recognition problems; (2)the requirement of an intensive training stage to obtain good performance; (3) the inability to recognize simultaneous multiple actions; and (4) difficulty in performing recognition frame by frame. In order to overcome all these drawbacks using a single method, we propose a novel framework involving the feature- tree to index large scale motion features using Sphere/Rectangle-tree (SR-tree). The recognition consists of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), 2) using a simple voting strategy to label the action. The proposed method can provide the localization of the action. Since our method does not require feature quantization, the feature- tree can be efficiently grown by adding features from new training examples of actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets due to the fact that the SR-tree is a disk-based data structure. We have tested our approach on two publicly available datasets, the KTH and the IXMAS multi-view datasets, and obtained promising results.

...read moreread less

Journal Article•DOI•

A Divide-and-Conquer Approach for Minimum Spanning Tree-Based Clustering

[...]

Xiaochun Wang¹, Xiali Wang, Don Mitchell Wilkes¹•Institutions (1)

Vanderbilt University¹

01 Jul 2009-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2).

...read moreread less

Abstract: Due to their ability to detect clusters with irregular boundaries, minimum spanning tree-based clustering algorithms have been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum spanning trees is the main source of computation and the standard solutions take O(N2) time. In this paper, we present a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2).

...read moreread less

Posted Content•

Anomaly Detection with Score functions based on Nearest Neighbor Graphs

[...]

Manqi Zhao¹, Venkatesh Saligrama¹•Institutions (1)

Boston University¹

28 Oct 2009-arXiv: Learning

TL;DR: In this article, a non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on $n$-point nominal data is proposed.

...read moreread less

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on $n$-point nominal data. Anomalies are declared whenever the score of a test sample falls below $\alpha$, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, $\alpha$, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

...read moreread less

Proceedings Article•DOI•

Nearest neighbors in high-dimensional data: the emergence and influence of hubs

[...]

Miloš Radovanović¹, Alexandros Nanopoulos², Mirjana Ivanović¹•Institutions (2)

University of Novi Sad¹, University of Hildesheim²

14 Jun 2009

TL;DR: This paper studies a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set, and shows that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-Occurrences).

...read moreread less

Abstract: High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.

...read moreread less

Proceedings Article•

Early prediction on time series: a nearest neighbor approach

[...]

Zhengzheng Xing¹, Jian Pei¹, Philip S. Yu²•Institutions (2)

Simon Fraser University¹, University of Illinois at Chicago²

11 Jul 2009

TL;DR: ECTS (Early Classification on Time Series), an effective 1-nearest neighbor classification method that makes early predictions and at the same time retains the accuracy comparable to that of a 1NN classifier using the full-length time series.

...read moreread less

Abstract: In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as health-informatics. We introduce a novel concept of MPL (Minimum Prediction Length) and develop ECTS (Early Classification on Time Series), an effective 1-nearest neighbor classification method. ECTS makes early predictions and at the same time retains the accuracy comparable to that of a 1NN classifier using the full-length time series. Our empirical study using benchmark time series data sets shows that ECTS works well on the real data sets where 1NN classification is effective.

...read moreread less

Journal Article•DOI•

Modeling edge effects in Graphene Nanoribbon Field-effect Transistors with real and mode space methods

[...]

Pei Zhao, Jing Guo

26 Feb 2009-arXiv: Materials Science

TL;DR: In this paper, a computationally efficient mode space simulation method for atomistic simulation of a graphene nanoribbon field-effect transistor in the ballistic limits is developed, which is based on the atomistic Hamiltonian in a decoupled mode space.

...read moreread less

Abstract: A computationally efficient mode space simulation method for atomistic simulation of a graphene nanoribbon field-effect transistor in the ballistic limits is developed. The proposed simulation scheme, which solves the nonequilibrium Green's function coupled with a three dimensional Poisson equation, is based on the atomistic Hamiltonian in a decoupled mode space. The mode space approach, which only treats a few modes (subbands), significantly reduces the simulation time. Additionally, the edge bond relaxation and the third nearest neighbor effects are also included in the quantum transport solver. Simulation examples show that, mode space approach can significantly decrease the simulation cost by about an order of magnitude, yet the results are still accurate. This article also demonstrates that the effects of edge bond relaxation and third nearest neighbor significantly influence the transistor's performance and are necessary to be included in the modeling.

...read moreread less

Journal Article•DOI•

Classification of cytochrome P450 1A2 inhibitors and noninhibitors by machine learning techniques.

[...]

Poongavanam Vasanthanathan¹, Olivier Taboureau, Chris Oostenbrink, Nico P. E. Vermeulen, Lars Folke Olsen, Flemming Jørgensen¹ - Show less +2 more•Institutions (1)

University of Copenhagen¹

01 Mar 2009-Drug Metabolism and Disposition

TL;DR: All of the models developed in this work are fast and precise enough to be applicable for virtual screening of CYP1A2 inhibitors or noninhibitors or can be used as simple filters in the drug discovery process.

...read moreread less

Abstract: The cytochrome P450 (P450) superfamily plays an important role in the metabolism of drug compounds, and it is therefore highly desirable to have models that can predict whether a compound interacts with a specific isoform of the P450s. In this work, we provide in silico models for classification of CYP1A2 inhibitors and noninhibitors. Training and test sets consisted of approximately 400 and 7000 compounds, respectively. Various machine learning techniques, such as binary quantitative structure activity relationship, support vector machine (SVM), random forest, kappa nearest neighbor (kNN), and decision tree methods were used to develop in silico models, based on Volsurf and Molecular Operating Environment descriptors. The best models were obtained using the SVM, random forest, and kNN methods in combination with the BestFirst variable selection method, resulting in models with 73 to 76% of accuracy on the test set prediction (Matthews correlation coefficients of 0.51 and 0.52). Finally, a decision tree model based on Lipinski's Rule-of-Five descriptors was also developed. This model predicts 67% of the compounds correctly and gives a simple and interesting insight into the issue of classification. All of the models developed in this work are fast and precise enough to be applicable for virtual screening of CYP1A2 inhibitors or noninhibitors or can be used as simple filters in the drug discovery process.

...read moreread less

Journal Article•DOI•

Modeling edge effects in graphene nanoribbon field-effect transistors with real and mode space methods

[...]

Pei Zhao, Jing Guo

02 Feb 2009-Journal of Applied Physics

TL;DR: In this article, a computationally efficient mode space simulation method for atomistic simulation of a graphene nanoribbon field-effect transistor in the ballistic limits is developed, which is based on the atomistic Hamiltonian in a decoupled mode space.

...read moreread less

Abstract: A computationally efficient mode space simulation method for atomistic simulation of a graphene nanoribbon field-effect transistor in the ballistic limits is developed. The proposed simulation scheme, which solves the nonequilibrium Green’s function coupled with a three dimensional Poisson equation, is based on the atomistic Hamiltonian in a decoupled mode space. The mode space approach, which only treats a few modes (subbands), significantly reduces the simulation time. Additionally, the edge bond relaxation and the third nearest neighbor effects are also included in the quantum transport solver. Simulation examples show that the mode space approach can significantly decrease the simulation cost by about an order of magnitude, yet the results are still accurate. This article also demonstrates that the effects of the edge bond relaxation and the third nearest neighbor significantly influence the transistor’s performance and are necessary to be included in the modeling.

...read moreread less

Journal Article•DOI•

A method of learning weighted similarity function to improve the performance of nearest neighbor

[...]

Mansoor Zolghadri Jahromi¹, Elham Parvinnia², Robert John¹•Institutions (2)

De Montfort University¹, Shiraz University²

01 Aug 2009-Information Sciences

TL;DR: This paper proposes a learning algorithm that attempts to maximize the leave-one-out (LV1) classification rate of the NN rule by adjusting the weights of the training instances, and shows that this scheme has comparable or better performance than some recent methods proposed in the literature for the task of learning the distance function and/or prototype reduction.

...read moreread less

Proceedings Article•

Anomaly Detection with Score functions based on Nearest Neighbor Graphs

[...]

Manqi Zhao¹, Venkatesh Saligrama¹•Institutions (1)

Boston University¹

07 Dec 2009

TL;DR: The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density.

...read moreread less

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

...read moreread less

Journal Article•DOI•

Pseudo nearest neighbor rule for pattern classification

[...]

Yong Zeng¹, Yupu Yang¹, Liang Zhao¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Mar 2009-Expert Systems With Applications

TL;DR: A new pseudo nearest neighbor classification rule that utilizes the distance weighted local learning in each class to get a new nearest neighbor of the unlabeled pattern-pseudo nearest neighbor (PNN), and then assigns the label associated with the PNN for the unl Isabeled pattern using the NNR.

...read moreread less

Abstract: In this paper, we propose a new pseudo nearest neighbor classification rule (PNNR). It is different from the previous nearest neighbor rule (NNR), this new rule utilizes the distance weighted local learning in each class to get a new nearest neighbor of the unlabeled pattern-pseudo nearest neighbor (PNN), and then assigns the label associated with the PNN for the unlabeled pattern using the NNR. The proposed PNNR is compared with the k-NNR, distance weighted k-NNR, and the local mean-based nonparametric classification [Mitani, Y., & Hamamoto, Y. (2006). A local mean-based nonparametric classifier. Pattern Recognition Letters, 27, 1151-1159] in terms of the classification accuracy on the unknown patterns. Experimental results confirm the validity of this new classification rule even in practical situations.

...read moreread less

Collapse