Showing papers on "k-nearest neighbors algorithm published in 2020"

PDF

Open Access

Posted Content•

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

[...]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, Arnold Overwijk¹ - Show less +4 more•Institutions (1)

Microsoft¹

01 Jul 2020-arXiv: Information Retrieval

TL;DR: Approximate nearest neighbor Negative Contrastive Estimation (ANCE) is presented, a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances.

...read moreread less

Abstract: Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.

...read moreread less

566 citations

Proceedings Article•

Generalization through Memorization: Nearest Neighbor Language Models

[...]

Urvashi Khandelwal¹, Omer Levy², Dan Jurafsky¹, Luke Zettlemoyer³, Michael Lewis⁴ - Show less +1 more•Institutions (4)

Stanford University¹, Facebook², University of Washington³, University of Pittsburgh⁴

30 Apr 2020

TL;DR: This article proposed the $k$NN-LMs, which extend a pre-trained neural language model by linearly interpolating it with a $k-nearest neighbors model.

...read moreread less

Abstract: We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this transformation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 -- a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

...read moreread less

371 citations

Book Chapter•DOI•

ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms

[...]

Martin Aumüller¹, Erik Bernhardsson, Alexander John Faithfull¹•Institutions (1)

IT University of Copenhagen¹

01 Jan 2020

TL;DR: ANN-Benchmarks as discussed by the authors is a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms and provides a standard interface for measuring the performance and quality achieved by algorithms on different standard data sets.

...read moreread less

Abstract: This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets It supports several different ways of integrating k-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, Open image in new window plots, and websites with interactive plots ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of k-NN algorithms In the short term, this overview allows users to choose the correct k-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work Interestingly, very different approaches to k-NN search yield comparable quality-performance trade-offs The system is available at http://sssprojectsitudk/ann-benchmarks/

...read moreread less

204 citations

Journal Article•DOI•

Fast density peak clustering for large scale data based on kNN

[...]

Yewang Chen, Xiaoliang Hu¹, Wentao Fan¹, Lianlian Shen¹, Zheng Zhang¹, Xin Liu¹, Xin Liu², Ji-Xiang Du¹, Haibo Li¹, Yi Chen³, Hailin Li⁴ - Show less +7 more•Institutions (4)

Huaqiao University¹, Xidian University², Beijing Technology and Business University³, College of Business Administration⁴

01 Jan 2020-Knowledge Based Systems

TL;DR: A simple but fast DPeak, namely FastDPeak, 1 is proposed, which runs in about O ( n l o g ( n ) ) expected time in the intrinsic dimensionality and replaces density with kNN-density, which is computed by fast kNN algorithm such as cover tree, yielding huge improvement for density computations.

...read moreread less

Abstract: Density Peak (DPeak) clustering algorithm is not applicable for large scale data, due to two quantities, i.e, ρ and δ , are both obtained by brute force algorithm with complexity O ( n 2 ) . Thus, a simple but fast DPeak, namely FastDPeak, 1 is proposed, which runs in about O ( n l o g ( n ) ) expected time in the intrinsic dimensionality. It replaces density with kNN-density, which is computed by fast kNN algorithm such as cover tree, yielding huge improvement for density computations. Based on kNN-density, local density peaks and non-local density peaks are identified, and a fast algorithm, which uses two different strategies to compute δ for them, is also proposed with complexity O ( n ) . Experimental results show that FastDPeak is effective and outperforms other variants of DPeak.

...read moreread less

144 citations

Posted Content•

Nearest Neighbor Machine Translation

[...]

Urvashi Khandelwal¹, Angela Fan¹, Dan Jurafsky², Luke Zettlemoyer², Michael Lewis² - Show less +1 more•Institutions (2)

Stanford University¹, Facebook²

01 Oct 2020-arXiv: Computation and Language

TL;DR: This work introduces $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search.

...read moreread less

Abstract: We introduce $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results -- without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.

...read moreread less

141 citations

Journal Article•DOI•

Weakly-supervised Semantic Guided Hashing for Social Image Retrieval

[...]

Zechao Li¹, Jinhui Tang¹, Liyan Zhang², Jian Yang¹•Institutions (2)

Nanjing University of Science and Technology¹, Nanjing University of Aeronautics and Astronautics²

01 Sep 2020-International Journal of Computer Vision

TL;DR: This work proposes a novel Semantic Guided Hashing method coupled with binary matrix factorization to perform more effective nearest neighbor image search by simultaneously exploring the weakly-supervised rich community-contributed information and the underlying data structures.

...read moreread less

Abstract: Hashing has been widely investigated for large-scale image retrieval due to its search effectiveness and computation efficiency. In this work, we propose a novel Semantic Guided Hashing method coupled with binary matrix factorization to perform more effective nearest neighbor image search by simultaneously exploring the weakly-supervised rich community-contributed information and the underlying data structures. To uncover the underlying semantic information from the weakly-supervised user-provided tags, the binary matrix factorization model is leveraged for learning the binary features of images while the problem of imperfect tags is well addressed. The uncovered semantic information enables to well guide the discrete hash code learning. The underlying data structures are discovered by adaptively learning a discriminative data graph, which makes the learned hash codes preserve the meaningful neighbors. To the best of our knowledge, the proposed method is the first work that incorporates the hash code learning, the semantic information mining and the data structure discovering into one unified framework. Besides, the proposed method is extended to one deep approach for the optimal compatibility of discriminative feature learning and hash code learning. Experiments are conducted on two widely-used social image datasets and the proposed method achieves encouraging performance compared with the state-of-the-art hashing methods.

...read moreread less

129 citations

Journal Article•DOI•

Detection of lung cancer on chest CT images using minimum redundancy maximum relevance feature selection method with convolutional neural networks

[...]

Mesut Toğaçar¹, Burhan Ergen¹, Zafer Cömert•Institutions (1)

Fırat University¹

01 Jan 2020-Biocybernetics and Biomedical Engineering

TL;DR: The proposed model is consistent diagnosis model for lung cancer detection using chest CT images using LeNet, AlexNet and VGG-16 deep learning models.

...read moreread less

96 citations

Journal Article•DOI•

Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships

[...]

Steven B. Torrisi¹, Steven B. Torrisi², Matthew R. Carbone³, Brian A. Rohr¹, Joseph Montoya¹, Yang Ha⁴, Junko Yano⁴, Santosh K. Suram¹, Linda Hung¹ - Show less +5 more•Institutions (4)

Toyota¹, Harvard University², Columbia University³, Lawrence Berkeley National Laboratory⁴

29 Jul 2020

TL;DR: This work shows that across thousands of transition metal oxide spectra, the relative importance of features describing the curvature of the spectrum can be localized to individual energy ranges, and it can separate the importance of constant, linear, quadratic, and cubic trends, as well as the white line energy.

...read moreread less

Abstract: X-ray absorption spectroscopy (XAS) produces a wealth of information about the local structure of materials, but interpretation of spectra often relies on easily accessible trends and prior assumptions about the structure. Recently, researchers have demonstrated that machine learning models can automate this process to predict the coordinating environments of absorbing atoms from their XAS spectra. However, machine learning models are often difficult to interpret, making it challenging to determine when they are valid and whether they are consistent with physical theories. In this work, we present three main advances to the data-driven analysis of XAS spectra: we demonstrate the efficacy of random forests in solving two new property determination tasks (predicting Bader charge and mean nearest neighbor distance), we address how choices in data representation affect model interpretability and accuracy, and we show that multiscale featurization can elucidate the regions and trends in spectra that encode various local properties. The multiscale featurization transforms the spectrum into a vector of polynomial-fit features, and is contrasted with the commonly-used “pointwise” featurization that directly uses the entire spectrum as input. We find that across thousands of transition metal oxide spectra, the relative importance of features describing the curvature of the spectrum can be localized to individual energy ranges, and we can separate the importance of constant, linear, quadratic, and cubic trends, as well as the white line energy. This work has the potential to assist rigorous theoretical interpretations, expedite experimental data collection, and automate analysis of XAS spectra, thus accelerating the discovery of new functional materials.

...read moreread less

84 citations

Journal Article•DOI•

Cost-sensitive KNN classification

[...]

Shichao Zhang¹•Institutions (1)

Central South University¹

28 May 2020-Neurocomputing

TL;DR: Two efficient cost-sensitive KNN classification models are designed, referred to Direct-CS-KNN classifier and Distance- CS-Knn classifier, which are further improved with extant strategies, such as smoothing, minimum-cost k-value selection, feature selection and ensemble selection.

...read moreread less

82 citations

Journal Article•DOI•

Towards explaining anomalies: A deep Taylor decomposition of one-class models

[...]

Jacob R. Kauffmann¹, Klaus-Robert Müller¹, Klaus-Robert Müller², Klaus-Robert Müller³, Grégoire Montavon¹ - Show less +1 more•Institutions (3)

Technical University of Berlin¹, Max Planck Society², Korea University³

01 May 2020-Pattern Recognition

TL;DR: In this paper, the authors propose a principled approach for one-class SVMs (OC-SVM) that can be rewritten as distance/pooling neural networks, and apply deep Taylor decomposition (DTD), a methodology that leverages the model structure in order to quickly and reliably explain decisions in terms of input features.

...read moreread less

76 citations

Posted Content•

Deep Nearest Neighbor Anomaly Detection

[...]

Liron Bergman, Niv Cohen, Yedid Hoshen

24 Feb 2020-arXiv: Learning

TL;DR: The simple nearest-neighbor based-approach is experimentally shown to outperform self-supervised methods in: accuracy, few shot generalization, training time and noise robustness while making fewer assumptions on image distributions.

...read moreread less

Abstract: Nearest neighbors is a successful and long-standing technique for anomaly detection. Significant progress has been recently achieved by self-supervised deep methods (e.g. RotNet). Self-supervised features however typically under-perform Imagenet pre-trained features. In this work, we investigate whether the recent progress can indeed outperform nearest-neighbor methods operating on an Imagenet pretrained feature space. The simple nearest-neighbor based-approach is experimentally shown to outperform self-supervised methods in: accuracy, few shot generalization, training time and noise robustness while making fewer assumptions on image distributions.

...read moreread less

Journal Article•DOI•

Deep Metric Learning-Based Feature Embedding for Hyperspectral Image Classification

[...]

Bin Deng¹, Sen Jia², Daming Shi²•Institutions (2)

South China University of Technology¹, Shenzhen University²

01 Feb 2020-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: A deep metric learning-based feature embedding model, which can meet the tasks both for same- and cross-scene HSI classifications, and the nearest neighbor (NN) algorithm is selected as the classifier for the classification tasks.

...read moreread less

Abstract: Learning from a limited number of labeled samples (pixels) remains a key challenge in the hyperspectral image (HSI) classification. To address this issue, we propose a deep metric learning-based feature embedding model, which can meet the tasks both for same- and cross-scene HSI classifications. In the first task, when only a few labeled samples are available, we employ ideas from metric learning based on deep embedding features and make a similarity learning between pairs of samples. In this case, the proposed model can learn well to compare whether two samples belong to the same class. In another task, when an HSI image (target scene) that needs to be classified is not labeled at all, the embedding model can learn from another similar HSI image (source scene) with sufficient labeled samples and then transfer to the target model by using an unsupervised domain adaptation technique, which not only employs the adversarial approach to make the embedding features from the source and target samples indistinguishable but also encourages the target scene’s embeddings to form similar clusters with the source scene one. After the domain adaptation between the HSIs of the two scenes is finished, any traditional HSI classifier can be used. In a simple manner, the nearest neighbor (NN) algorithm is selected as the classifier for the classification tasks throughout this article. The experimental results from a series of popular HSIs demonstrate the advantages of the proposed model both in the same- and cross-scene classification tasks.

...read moreread less

Journal Article•DOI•

A New Deep CNN Model for Environmental Sound Classification

[...]

Fatih Demir¹, Daban Abdulsalam Abdullah², Abdulkadir Sengur¹•Institutions (2)

Fırat University¹, Sulaimani Polytechnic University²

01 Apr 2020-IEEE Access

TL;DR: Deep features are used in the environmental sound classification (ESC) problem by using a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images.

...read moreread less

Abstract: Cognitive prediction in the complicated and active environments is of great importance role in artificial learning. Classification accuracy of sound events has a robust relation with the feature extraction. In this paper, deep features are used in the environmental sound classification (ESC) problem. The deep features are extracted by using the fully connected layers of a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images. The feature vector is constituted with concatenating of the fully connected layers of the proposed CNN model. For testing the performance of the proposed method, the feature set is conveyed as input to the random subspaces K Nearest Neighbor (KNN) ensembles classifier. The experimental studies, which are carried out on the DCASE-2017 ASC and the UrbanSound8K datasets, show that the proposed CNN model achieves classification accuracies 96.23% and 86.70%, respectively.

...read moreread less

Journal Article•DOI•

Learning binary code for fast nearest subspace search

[...]

Lei Zhou¹, Xiao Bai¹, Xianglong Liu¹, Jun Zhou², Edwin R. Hancock¹, Edwin R. Hancock³ - Show less +2 more•Institutions (3)

Beihang University¹, Griffith University², University of York³

01 Feb 2020-Pattern Recognition

TL;DR: A new approach for hashing-based ANS search which can directly binarize a subspace without transforming it into a vector, and simultaneously leverages the learned binary codes for subspaces to train matrix classifiers as hash functions.

...read moreread less

Journal Article•DOI•

A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning

[...]

Hamid Saadatfar, Samiyeh Khosravi, Javad Hassannataj Joloudari, Amir Mosavi, Shahaboddin Shamshirband - Show less +1 more

20 Feb 2020

TL;DR: An approach has been proposed to improve the pruning phase of the LC-KNN method by taking into account factors that help to choose a more appropriate cluster of data for looking for the neighbors, thus, increasing the classification accuracy.

...read moreread less

Abstract: The K-nearest neighbors (KNN) machine learning algorithm is a well-known non-parametric classification method. However, like other traditional data mining methods, applying it on big data comes with computational challenges. Indeed, KNN determines the class of a new sample based on the class of its nearest neighbors; however, identifying the neighbors in a large amount of data imposes a large computational cost so that it is no longer applicable by a single computing machine. One of the proposed techniques to make classification methods applicable on large datasets is pruning. LC-KNN is an improved KNN method which first clusters the data into some smaller partitions using the K-means clustering method; and then applies the KNN for each new sample on the partition which its center is the nearest one. However, because the clusters have different shapes and densities, selection of the appropriate cluster is a challenge. In this paper, an approach has been proposed to improve the pruning phase of the LC-KNN method by taking into account these factors. The proposed approach helps to choose a more appropriate cluster of data for looking for the neighbors, thus, increasing the classification accuracy. The performance of the proposed approach is evaluated on different real datasets. The experimental results show the effectiveness of the proposed approach and its higher classification accuracy and lower time cost in comparison to other recent relevant methods.

...read moreread less

Posted Content•

Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference.

[...]

Jianguo Zhang¹, Kazuma Hashimoto¹, Wenhao Liu², Chien-Sheng Wu³, Yao Wan³, Philip S. Yu³, Richard Socher³, Caiming Xiong³ - Show less +4 more•Institutions (3)

University of Illinois at Chicago¹, Huazhong University of Science and Technology², Salesforce.com³

25 Oct 2020-arXiv: Computation and Language

TL;DR: This paper proposes to boost the discriminative ability by transferring a natural language inference (NLI) model, and achieves more stable and accurate in-domain and OOS detection accuracy than RoBERTa-based classifiers and embedding-based nearest neighbor approaches.

...read moreread less

Abstract: Intent detection is one of the core components of goal-oriented dialog systems, and detecting out-of-scope (OOS) intents is also a practically important skill. Few-shot learning is attracting much attention to mitigate data scarcity, but OOS detection becomes even more challenging. In this paper, we present a simple yet effective approach, discriminative nearest neighbor classification with deep self-attention. Unlike softmax classifiers, we leverage BERT-style pairwise encoding to train a binary classifier that estimates the best matched training example for a user input. We propose to boost the discriminative ability by transferring a natural language inference (NLI) model. Our extensive experiments on a large-scale multi-domain intent detection task show that our method achieves more stable and accurate in-domain and OOS detection accuracy than RoBERTa-based classifiers and embedding-based nearest neighbor approaches. More notably, the NLI transfer enables our 10-shot model to perform competitively with 50-shot or even full-shot classifiers, while we can keep the inference time constant by leveraging a faster embedding retrieval model.

...read moreread less

Proceedings Article•DOI•

SONG: Approximate Nearest Neighbor Search on GPU

[...]

Weijie Zhao¹, Shulong Tan¹, Ping Li¹•Institutions (1)

Baidu¹

20 Apr 2020

TL;DR: This paper presents a novel framework that decouples the searching on graph algorithm into 3 stages, in order to parallel the performance-crucial distance computation and proposes novel ANN-specific optimization methods that eliminate dynamic GPU memory allocations and trade computations for less GPU memory consumption.

...read moreread less

Abstract: Approximate nearest neighbor (ANN) searching is a fundamental problem in computer science with numerous applications in (e.g.,) machine learning and data mining. Recent studies show that graph-based ANN methods often outperform other types of ANN algorithms. For typical graph-based methods, the searching algorithm is executed iteratively and the execution dependency prohibits GPU adaptations. In this paper, we present a novel framework that decouples the searching on graph algorithm into 3 stages, in order to parallel the performance-crucial distance computation. Furthermore, to obtain better parallelism on GPU, we propose novel ANN-specific optimization methods that eliminate dynamic GPU memory allocations and trade computations for less GPU memory consumption. The proposed system is empirically compared against HNSW–the state-of-the-art ANN method on CPU–and Faiss–the popular GPU-accelerated ANN platform–on 6 datasets. The results confirm the effectiveness: SONG has around 50-180x speedup compared with single-thread HNSW, while it substantially outperforms Faiss.

...read moreread less

Journal Article•DOI•

Multi-object intergroup gesture recognition combined with fusion feature and KNN algorithm

[...]

Liao Shangchun¹, Gongfa Li¹, Li Jiahan¹, Du Jiang¹, Guozhang Jiang¹, Ying Sun¹, Bo Tao¹, Haoyi Zhao¹, Chen Disi² - Show less +5 more•Institutions (2)

Wuhan University of Science and Technology¹, University of Portsmouth²

01 Jan 2020-Journal of Intelligent and Fuzzy Systems

Journal Article•DOI•

A new locally adaptive k-nearest neighbor algorithm based on discrimination class

[...]

Zhibin Pan¹, Zhibin Pan², Yikun Wang², Yiwei Pan²•Institutions (2)

China Academy of Space Technology¹, Xi'an Jiaotong University²

26 Jun 2020-Knowledge Based Systems

TL;DR: Extensive experiments on eighteen real-world datasets show that the DC-LAKNN algorithm achieves better classification performance compared to standard kNN algorithm and nine other state-of-the-art kNN-based algorithms.

...read moreread less

Abstract: The k -nearest neighbor (kNN) rule is a classical non-parametric classification algorithm in pattern recognition, and has been widely used in many fields due to its simplicity, effectiveness and intuitiveness However, the classification performance of the kNN algorithm suffers from the choice of a fixed and single value of k for all queries in the search stage and the use of simple majority voting rule in the decision stage In this paper, we propose a new kNN-based algorithm, called locally adaptive k -nearest neighbor algorithm based on discrimination class (DC-LAKNN) In our method, the role of the second majority class in classification is for the first time considered Firstly, the discrimination classes at different values of k are selected from the majority class and the second majority class in the k -neighborhood of the query Then, the adaptive k value and the final classification result are obtained according to the quantity and distribution information on the neighbors in the discrimination classes at each value of k Extensive experiments on eighteen real-world datasets from UCI (University of California, Irvine) Machine Learning Repository and KEEL (Knowledge Extraction based on Evolutionary Learning) Repository show that the DC-LAKNN algorithm achieves better classification performance compared to standard kNN algorithm and nine other state-of-the-art kNN-based algorithms

...read moreread less

Journal Article•DOI•

Nearest Neighbor and Contact Distance Distribution for Binomial Point Process on Spherical Surfaces

[...]

Anna Talgat¹, Mustafa A. Kishk¹, Mohamed-Slim Alouini¹•Institutions (1)

King Abdullah University of Science and Technology¹

25 Aug 2020-IEEE Communications Letters

TL;DR: This letter characterizes the statistics of the contact distance and the nearest neighbor (NN) distance for binomial point processes (BPP) spatially-distributed on spherical surfaces.

...read moreread less

Abstract: This letter characterizes the statistics of the contact distance and the nearest neighbor (NN) distance for binomial point processes (BPP) spatially-distributed on spherical surfaces. We consider a setup of $n$ concentric spheres, with each sphere $S_{k}$ has a radius $r_{k}$ and $N_{k}$ points that are uniformly distributed on its surface. For that setup, we obtain the cumulative distribution function (CDF) of the distance to the nearest point from two types of observation points: (i) the observation point is not a part of the point process and located on a concentric sphere with a radius $r_{e} , which corresponds to the contact distance distribution, and (ii) the observation point belongs to the point process, which corresponds to the nearest-neighbor (NN) distance distribution.

...read moreread less

Journal Article•DOI•

An Improved DBSCAN Algorithm Based on the Neighbor Similarity and Fast Nearest Neighbor Query

[...]

Shan-Shan Li¹•Institutions (1)

Huaqiao University¹

06 Feb 2020-IEEE Access

TL;DR: An improved DBSCAN based on neighbor similarity is proposed, which utilizes and Cover Tree to retrieve neighbors for each point in parallel, and the triangle inequality to filter many unnecessary distance computations.

...read moreread less

Abstract: DBSCAN is the most famous density based clustering algorithm which is one of the main clustering paradigms. However, there are many redundant distance computations among the process of DBSCAN clustering, due to brute force Range-Query used to retrieve neighbors for each point in DBSCAN, which yields high complexity (O( $n^{2}$ )) and low efficiency. Thus, it is unsuitable and not applicable for large scale data. In this paper, an improved DBSCAN based on neighbor similarity is proposed, which utilizes and Cover Tree to retrieve neighbors for each point in parallel, and the triangle inequality to filter many unnecessary distance computations. From the experiments conducted on large scale data sets, it is demonstrated that the proposed algorithm greatly speedup the original DBSCAN, and outperform the main improvements of DBSCAN. Comparing with $\rho $ -approximate DBSCAN, which is the current fastest but approximate version of DBSCAN, the proposed algorithm has two advantages: one is faster and the other is that the result is accurate.

...read moreread less

Journal Article•DOI•

Optimization of distance formula in K-Nearest Neighbor method

[...]

Arif Ridho Lubis, Muharman Lubis¹, Al Khowarizmi•Institutions (1)

Telkom University¹

01 Feb 2020-Bulletin of Electrical Engineering and Informatics

TL;DR: This study will discuss the calculation of the euclidean distance formula in KNN compared with the normalized euclidesan distance, manhattan and normalized manhattan to achieve optimization results or optimal value in finding the distance of the nearest neighbor.

...read moreread less

Abstract: K-Nearest Neighbor (KNN) is a method applied in classifying objects based on learning data that is closest to the object based on comparison between previous and current data. In the learning process, KNN calculates the distance of the nearest neighbor by applying the euclidean distance formula, while in other methods, optimization has been done on the distance formula by comparing it with the other similar in order to get optimal results. This study will discuss the calculation of the euclidean distance formula in KNN compared with the normalized euclidean distance, manhattan and normalized manhattan to achieve optimization results or optimal value in finding the distance of the nearest neighbor.

...read moreread less

Journal Article•DOI•

Time series classification: nearest neighbor versus deep learning models

[...]

Weiwei Jiang¹•Institutions (1)

Tsinghua University¹

01 Apr 2020

TL;DR: This study gives a comprehensive comparison between nearest neighbor and deep learning models and indicates that deepLearning models are not significantly better than 1-NN classifiers with edit distance with real penalty and dynamic time warping.

...read moreread less

Abstract: Time series classification has been an important and challenging research task. In different domains, time series show different patterns, which makes it difficult to design a global optimal solution and requires a comprehensive evaluation of different classifiers across multiple datasets. With the rise of big data and cloud computing, deep learning models, especially deep neural networks, arise as a new paradigm for many problems, including image classification, object detection and natural language processing. In recent years, deep learning models are also applied for time series classification and show superiority over traditional models. However, the previous evaluation is usually limited to a small number of datasets and lack of significance analysis. In this study, we give a comprehensive comparison between nearest neighbor and deep learning models. Specifically, we compare 1-NN classifiers with eight different distance measures and three state-of-the-art deep learning models on 128 time series datasets. Our results indicate that deep learning models are not significantly better than 1-NN classifiers with edit distance with real penalty and dynamic time warping.

...read moreread less

Proceedings Article•DOI•

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning.

[...]

Yi Yang¹, Arzoo Katiyar²•Institutions (2)

University of Technology, Sydney¹, Pennsylvania State University²

01 Nov 2020

TL;DR: This work shows that the method of combining structured decoding with nearest neighbor learning achieves state-of-the-art performance on standard few-shot NER evaluation tasks, improving F1 scores by $6\%$ to $16$ absolute points over prior meta-learning based systems.

...read moreread less

Abstract: We present a simple few-shot named entity recognition (NER) system based on nearest neighbor learning and structured inference. Our system uses a supervised NER model trained on the source domain, as a feature extractor. Across several test domains, we show that a nearest neighbor classifier in this feature-space is far more effective than the standard meta-learning approaches. We further propose a cheap but effective method to capture the label dependencies between entity tags without expensive CRF training. We show that our method of combining structured decoding with nearest neighbor learning achieves state-of-the-art performance on standard few-shot NER evaluation tasks, improving F1 scores by $6\%$ to $16\%$ absolute points over prior meta-learning based systems.

...read moreread less

Journal Article•DOI•

On normalization and algorithm selection for unsupervised outlier detection

[...]

Sevvandi Kandanaarachchi¹, Mario A. Muñoz², Rob J. Hyndman¹, Kate Smith-Miles²•Institutions (2)

Monash University, Clayton campus¹, University of Melbourne²

01 Mar 2020-Data Mining and Knowledge Discovery

TL;DR: It is formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers, and an instance space analysis of combinations of normalization and detection methods enables the visualization of the strengths and weaknesses of these combinations.

...read moreread less

Abstract: This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.

...read moreread less

Journal Article•DOI•

A Tunable-Q wavelet transform and quadruple symmetric pattern based EEG signal classification method.

[...]

Emrah Aydemir¹, Turker Tuncer², Sengul Dogan²•Institutions (2)

Ahi Evran University¹, Fırat University²

01 Jan 2020-Medical Hypotheses

TL;DR: The proposed multilevel EEG classification method consists of pre-processing, feature extraction, feature concatenation, feature selection and classification phases, and achieved 98.4% success rate for 5 classes case.

...read moreread less

Journal Article•DOI•

Earthquake Declustering Using the Nearest-Neighbor Approach in Space-Time-Magnitude Domain

[...]

Ilya Zaliapin¹, Yehuda Ben-Zion²•Institutions (2)

University of Nevada, Reno¹, University of Southern California²

01 Apr 2020-Journal of Geophysical Research

TL;DR: In this paper, an algorithm for declustering earthquake catalogs based on the nearest-neighbor analysis of seismicity is introduced. But the algorithm discriminates between background and clustered events by random thinning that removes events according to a space-varying threshold.

...read moreread less

Abstract: We introduce an algorithm for declustering earthquake catalogs based on the nearest‐neighbor analysis of seismicity. The algorithm discriminates between background and clustered events by random thinning that removes events according to a space‐varying threshold. The threshold is estimated using randomized‐reshuffled catalogs that are stationary, have independent space and time components, and preserve the space distribution of the original catalog. Analysis of catalog produced by the Epidemic Type Aftershock Sequence model demonstrates that the algorithm correctly classifies over 80% of background and clustered events, correctly reconstructs the stationary and space‐dependent background intensity, and shows high stability with respect to random realizations (over 75% of events have the same estimated type in over 90% of random realizations). The declustering algorithm is applied to the global Northern California Earthquake Data Center catalog withmagnitudesm≥ 4 during 2000–2015; a Southern California catalog with m≥ 2.5, 3.5 during 1981–2017; an area around the 1992 Landers rupture zone withm≥ 0.0 during 1981–2015; and the Parkfield segment of San Andreas fault withm ≥ 1.0 during 1984–2014. The null hypotheses of stationarity and space‐time independence are not rejected by several tests applied to the estimated background events of the global and Southern California catalogs with magnitude ranges Δm < 4. However, both hypotheses are rejected for catalogs with larger range of magnitudes Δm> 4. The deviations from the nulls are mainly due to local temporal fluctuations of seismicity and activity switching among subregions; they can be traced back to the original catalogs and represent genuine features of background seismicity.

...read moreread less

Journal Article•DOI•

Surface electromyography feature extraction via convolutional neural network

[...]

Hongfeng Chen¹, Yue Zhang¹, Gongfa Li², Yinfeng Fang³, Honghai Liu⁴ - Show less +1 more•Institutions (4)

Zhejiang University¹, Chinese Ministry of Education², Hangzhou Dianzi University³, University of Portsmouth⁴

01 Jan 2020-International Journal of Machine Learning and Cybernetics

TL;DR: It is found that combining CNNFeat with traditional features can further improve the accuracy by 4.35%, 3.62% and 4.7% for SVM, LDA and KNN, respectively, and it is demonstrated thatCNNFeat can be potentially enhanced with more data for model training.

...read moreread less

Abstract: Although a large number of surface electromyography (sEMG) features have been proposed to improve hand gesture recognition accuracy, it is still hard to achieve acceptable performance in inter-session and inter-subject tests. To promote the application of sEMG-based human machine interaction, a convolutional neural network based feature extraction approach (CNNFeat) is proposed to improve hand gesture recognition accuracy. A sEMG database is recorded from eight subjects while performing ten hand gestures. Three classic classifiers, including linear discriminant analysis (LDA), support vector machine (SVM) and K nearest neighbor (KNN), are employed to compare the CNNFeat with 25 traditional features. This work concentrates on the analysis of CNNFeat through accuracy, safety index and repeatability index. The experimental results show that CNNFeat outperforms all the tested traditional features in inter-subject test and is listed as the best three features in inter-session test. Besides, it is also found that combining CNNFeat with traditional features can further improve the accuracy by 4.35%, 3.62% and 4.7% for SVM, LDA and KNN, respectively. Additionally, this work also demonstrates that CNNFeat can be potentially enhanced with more data for model training.

...read moreread less

Journal Article•DOI•

A Sample-Rebalanced Outlier-Rejected $k$ -Nearest Neighbor Regression Model for Short-Term Traffic Flow Forecasting

[...]

Lingru Cai¹, Yidan Yu¹, Shuangyi Zhang¹, Youyi Song², Zhi Xiong¹, Teng Zhou¹ - Show less +2 more•Institutions (2)

Shantou University¹, Hong Kong Polytechnic University²

29 Jan 2020-IEEE Access

TL;DR: This paper presents a sample-rebalanced and outlier-rejected $k$ -nearest neighbor regression model for short-term traffic flow forecasting that outperforms state-of-the-art parametric and non-parametric models.

...read moreread less

Abstract: Short-term traffic flow forecasting is a fundamental and challenging task due to the stochastic dynamics of the traffic flow, which is often imbalanced and noisy. This paper presents a sample-rebalanced and outlier-rejected $k$ -nearest neighbor regression model for short-term traffic flow forecasting. In this model, we adopt a new metric for the evolutionary traffic flow patterns, and reconstruct balanced training sets by relative transformation to tackle the imbalance issue. Then, we design a hybrid model that considers both local and global information to address the limited size of the training samples. We employ four real-world benchmark datasets often used in such tasks to evaluate our model. Experimental results show that our model outperforms state-of-the-art parametric and non-parametric models.

...read moreread less

Journal Article•DOI•

Fault detection and diagnosis via standardized k nearest neighbor for multimode process

[...]

Bing Song¹, Shuai Tan¹, Hongbo Shi¹, Bo Zhao¹•Institutions (1)

East China University of Science and Technology¹

01 Jan 2020-Journal of The Taiwan Institute of Chemical Engineers

TL;DR: This work proposes a standardized kNN (SkNN) based fault detection method, where a standardized distance is developed to characterize the distance between the data and its neighbors taking the scale information within mode and mode to mode into consideration.

...read moreread less

Abstract: For the multimode process, the scale information of every single mode never be considered in the distance calculation between the data and its neighbors in k nearest neighbor (kNN). This work proposes a standardized kNN (SkNN) based fault detection method, where a standardized distance is developed to characterize the distance between the data and its neighbors taking the scale information within mode and mode to mode into consideration. In addition, compared with the kNN based fault diagnosis method, the importance of various neighbors is considered through constructing the weights and giving to different neighbors in the SkNN based fault diagnosis method. Moreover, when there is more than one fault variable, in order to eliminate the influence of other fault variables on current reconstructed variable and reduce the computational complexity, concurrent reconstructed strategy and greedy algorithm are used in the SkNN based fault diagnosis method. At last, an industrial case study is employed to prove the effectiveness and advantage of the proposed SkNN based fault detection and diagnosis method.

...read moreread less

Collapse