scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Sequential Nonparametric Detection of Anomalous Data Streams

27 Apr 2021-IEEE Signal Processing Letters (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 28, pp 932-936
TL;DR: In this article, a nonparametric search problem to detect $L$ anomalous streams from a finite set of $ S$ data streams is studied, and universal distribution-free sequential tests that are consistent are proposed.
Abstract: We study a nonparametric search problem to detect $L$ anomalous streams from a finite set of $ S$ data streams. The $L$ anomalous streams are real-valued independent and identically distributed (i.i.d.) sequences drawn from the distribution $ q$ , while the remaining $S-L$ data streams are i.i.d. sequences drawn from the distribution $ p$ . The distributions $ p$ and $ q$ are assumed to be arbitrary and unknown , but distinct. We consider two cases: one where $L = 1$ , and the other where $0 \leq L \leq A$ . In both cases, we propose universal distribution-free sequential tests that are consistent. For the first case, we also: (1) show that the test is universally exponentially consistent and stops in finite time almost surely, and (2) bound the limiting growth rate of the expected stopping time as the probability of error decreases to zero. Simulations show that the performance of the proposed test is better than that of the fixed sample size test.
Citations
More filters
Proceedings ArticleDOI
24 May 2022
TL;DR: A universal sequential nonparametric clustering test for the case when K is known is proposed, thereby providing a new test for case of anomaly detection where the anomalous data streams can follow distinct probability distributions.
Abstract: We study a sequential nonparametric clustering problem to group a finite set of S data streams into K clusters. The data streams are real-valued i.i.d data sequences generated from unknown continuous distributions. The distributions them-selves are organized into clusters according to their proximity to each other based on a certain distance metric. We propose a universal sequential nonparametric clustering test for the case when K is known. We show that the proposed test stops in finite time almost surely and is universally exponentially consistent. We also bound the asymptotic growth rate of the expected stopping time as probability of error goes to zero. Our results generalize earlier work on sequential nonparametric anomaly detection to the more general sequential nonparametric clustering problem, thereby providing a new test for case of anomaly detection where the anomalous data streams can follow distinct probability distributions. Simulations show that our proposed sequential clustering test outperforms the corresponding fixed sample size test.

1 citations

Proceedings ArticleDOI
24 May 2022
TL;DR: In this article , a universal sequential nonparametric clustering test for the case when K is known is proposed and the proposed test stops in finite time almost surely and is universally exponentially consistent.
Abstract: We study a sequential nonparametric clustering problem to group a finite set of S data streams into K clusters. The data streams are real-valued i.i.d data sequences generated from unknown continuous distributions. The distributions them-selves are organized into clusters according to their proximity to each other based on a certain distance metric. We propose a universal sequential nonparametric clustering test for the case when K is known. We show that the proposed test stops in finite time almost surely and is universally exponentially consistent. We also bound the asymptotic growth rate of the expected stopping time as probability of error goes to zero. Our results generalize earlier work on sequential nonparametric anomaly detection to the more general sequential nonparametric clustering problem, thereby providing a new test for case of anomaly detection where the anomalous data streams can follow distinct probability distributions. Simulations show that our proposed sequential clustering test outperforms the corresponding fixed sample size test.
Journal ArticleDOI
TL;DR: In this article , the authors study a sequential nonparametric clustering problem to group a finite set of S data streams into K clusters, where the data streams are real-valued i.i.d data sequences generated from unknown continuous distributions.
References
More filters
Journal ArticleDOI
TL;DR: This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distribution free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

3,792 citations

Journal ArticleDOI
TL;DR: It is shown that the cumulative sum (CUSUM) test, which is well-known to be optimal for a non-Bayesian statistical change-point detection formulation, is optimal for the problem under study.
Abstract: The problem of sequentially finding an independent and identically distributed sequence that is drawn from a probability distribution Q1 by searching over multiple sequences, some of which are drawn from Q1 and the others of which are drawn from a different distribution Q0, is considered. In the problem considered, the number of sequences with distribution Q1 is assumed to be a random variable whose value is unknown. Within a Bayesian formulation, a sequential decision rule is derived that optimizes a trade-off between the probability of false alarm and the number of samples needed for the decision. In the case in which one can observe one sequence at a time, it is shown that the cumulative sum (CUSUM) test, which is well-known to be optimal for a non-Bayesian statistical change-point detection formulation, is optimal for the problem under study. Specifically, the CUSUM test is run on the first sequence. If a reset event occurs in the CUSUM test, then the sequence under examination is abandoned and the rule switches to the next sequence. If the CUSUM test stops, then the rule declares that the sequence under examination when the test stops is generated by Q1 . The result is derived by assuming that there are infinitely many sequences so that a sequence that has been examined once is not retested. If there are finitely many sequences, the result is also valid under a memorylessness condition. Expressions for the performance of the optimal sequential decision rule are also developed. The general case in which multiple sequences can be examined simultaneously is considered. The optimal solution for this general scenario is derived.

105 citations

Journal ArticleDOI
TL;DR: Outlier hypothesis testing is studied in a universal setting, and it is shown that a universally exponentially consistent test cannot exist, even when the typical distribution is known and the null hypothesis is excluded.
Abstract: Outlier hypothesis testing is studied in a universal setting. Multiple sequences of observations are collected, a small subset of which are outliers. A sequence is considered an outlier if the observations in that sequence are distributed according to an outlier distribution, distinct from the typical distribution governing the observations in all the other sequences. Nothing is known about the outlier and typical distributions except that they are distinct and have full supports. The goal is to design a universal test to best discern the outlier sequence(s). For models with exactly one outlier sequence, the generalized likelihood test is shown to be universally exponentially consistent. A single-letter characterization of the error exponent achievable by the test is derived, and it is shown that the test achieves the optimal error exponent asymptotically as the number of sequences approaches infinity. When the null hypothesis with no outlier is included, a modification of the generalized likelihood test is shown to achieve the same error exponent under each non-null hypothesis, and also consistency under the null hypothesis. Then, models with more than one outlier are studied in the following settings. For the setting with a known number of distinctly distributed outliers, the achievable error exponent of the generalized likelihood test is characterized. The limiting error exponent achieved by such a test is characterized, and the test is shown to be asymptotically exponentially consistent. For the setting with an unknown number of identically distributed outliers, a modification of the generalized likelihood test is shown to achieve a positive error exponent under each non-null hypothesis, and also consistency under the null hypothesis. When the outlier sequences can be distinctly distributed (with their total number being unknown), it is shown that a universally exponentially consistent test cannot exist, even when the typical distribution is known and the null hypothesis is excluded.

74 citations

Journal ArticleDOI
TL;DR: This article poses and analyzes outlying sequence detection in a hypothesis testing framework under different outlier recovery objectives and different degrees of knowledge about the underlying statistics of the outliers.
Abstract: Outliers refer to observations that do not conform to the expected patterns in high-dimensional data sets. When such outliers signify risks (e.g., in fraud detection) or opportunities (e.g., in spectrum sensing), harnessing the costs associated with the risks or missed opportunities necessitates mechanisms that can identify them effectively. Designing such mechanisms involves striking an appropriate balance between reliability and cost of sensing, as two opposing performance measures, where improving one tends to penalize the other. This article poses and analyzes outlying sequence detection in a hypothesis testing framework under different outlier recovery objectives and different degrees of knowledge about the underlying statistics of the outliers.

63 citations

Journal ArticleDOI
TL;DR: This paper provides a statistical framework for detecting rare events so that an optimal balance between detection reliability and agility, as two opposing performance measures, is established.
Abstract: Rare events can potentially occur in many applications. When manifested as opportunities to be exploited, risks to be ameliorated, or certain features to be extracted, such events become of paramount significance. Due to their sporadic nature, the information-bearing signals associated with rare events often lie in a large set of irrelevant signals and are not easily accessible. This paper provides a statistical framework for detecting such events so that an optimal balance between detection reliability and agility, as two opposing performance measures, is established. The core component of this framework is a sampling procedure that adaptively and quickly focuses the information-gathering resources on the segments of the dataset that bear the information pertinent to the rare events. Particular focus is placed on Gaussian signals with the aim of detecting signals with rare mean and variance values.

48 citations