scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Testing the equality of distributions of random vectors with categorical components

01 Aug 2001-Computational Statistics & Data Analysis (Elsevier Science Publishers B. V.)-Vol. 37, Iss: 2, pp 195-208
TL;DR: In this paper, a method for testing the equality of two or more distributions of random vectors with categorical components is developed. But the method is not suitable for the case of categorical data.
About: This article is published in Computational Statistics & Data Analysis.The article was published on 2001-08-01. It has received 22 citations till now. The article focuses on the topics: Multivariate random variable & Categorical variable.
Citations
More filters
Journal ArticleDOI
TL;DR: It is concluded that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated and probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall.
Abstract: Next-generation sequencing technologies can be used to analyse genetically heterogeneous samples at unprecedented detail. The high coverage achievable with these methods enables the detection of many low-frequency variants. However, sequencing errors complicate the analysis of mixed populations and result in inflated estimates of genetic diversity. We developed a probabilistic Bayesian approach to minimize the effect of errors on the detection of minority variants. We applied it to pyrosequencing data obtained from a 1.5-kb-fragment of the HIV-1 gag/pol gene in two control and two clinical samples. The effect of PCR amplification was analysed. Error correction resulted in a two- and five-fold decrease of the pyrosequencing base substitution rate, from 0.05% to 0.03% and from 0.25% to 0.05% in the non-PCR and PCR-amplified samples, respectively. We were able to detect viral clones as rare as 0.1% with perfect sequence reconstruction. Probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall. Genetic diversity observed within and between two clinical samples resulted in various patterns of phenotypic drug resistance and suggests a close epidemiological link. We conclude that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated.

229 citations


Cites methods from "Testing the equality of distributio..."

  • ...A general non-parametric procedure for comparing vectors with categorical components was used to detect differences between the two haplotype distributions (36)....

    [...]

Journal ArticleDOI
TL;DR: The formal definition of the paradigm, the analysis of its impact on the literature, its main applications, works developed, pitfalls and guidelines, and ongoing research are presented.
Abstract: Multi-label learning is quite a recent supervised learning paradigm. Owing to its capabilities to improve performance in problems where a pattern may have more than one associated class, it has attracted the attention of researchers, producing an increasing number of publications. This study presents an up-to-date overview about multi-label learning with the aim of sorting and describing the main approaches developed till now. The formal definition of the paradigm, the analysis of its impact on the literature, its main applications, works developed, pitfalls and guidelines, and ongoing research are presented. WIREs Data Mining Knowl Discov 2014, 4:411-444. doi: 10.1002/widm.1139

188 citations


Cites methods from "Testing the equality of distributio..."

  • ...It was a two-stage method that separated the splitting-variable selection (using the statistic test of Nettleton and Banerjee170) and the splitting-point selection (that generates binary partitions of data) steps....

    [...]

  • ...It was a two-stage method that separated the splitting-variable selection (using the statistic test of Nettleton and Banerjee [144]) and the splittingpoint selection (that generates binary partitions of data) steps....

    [...]

Journal ArticleDOI
TL;DR: In this article, the problem of two-sample comparison with categorical data when the contingency table is sparsely populated is studied, and a general nonparametric approach that utilizes similarity information on the space of all categories in two sample tests is proposed.
Abstract: We study the problem of two-sample comparison with categorical data when the contingency table is sparsely populated. In modern applications, the number of categories is often comparable to the sample size, causing existing meth- ods to have low power. When the number of categories is large, there is often underlying structure on the sample space that can be exploited. We propose a general non-parametric approach that utilizes similarity information on the space of all categories in two sample tests. Our approach extends the graph-based tests of Friedman and Rafsky (1979) and Rosenbaum (2005), which are tests base on graphs connecting observations by similarity. Both tests require uniqueness of the underlying graph and cannot be directly applied on categorical data. We explored different ways to extend graph-based tests to the categorical setting and found two types of statistics that are both powerful and fast to compute. We showed that their permutation null distributions are asymptotically normal and that their p-value ap- proximations under typical settings are quite accurate, facilitating the application of the new approach. The approach is illustrated through several examples.

25 citations

Journal ArticleDOI
TL;DR: A new method for constructing multilabel classification trees is provided and it is compared with some existing methods in terms of bias and power in variable selection.

21 citations


Cites background or methods from "Testing the equality of distributio..."

  • ...Recently, Nettleton and Banerjee (2001) proposed a method for testing the equality of distributions of random vectors with categorical components, which is a specialization of the methods of Friedman and Rafsky (1979, 1983)....

    [...]

  • ...According to the results of Nettleton and Banerjee (2001), the conditional expectation and variance of T are E(T | ex; ey) = ey [ 1 − 2ex nt(nt − 1) ] ; (1) Var(T | ex; Cx; ey; Cy) = 2exeynt(nt − 1) [ 1 − 2exey nt(nt − 1) ] + 4 nt(nt − 1)(nt − 2) × [ CxCy + {ex(ex − 1) − 2Cx}{ey(ey − 1) − 2Cy} nt −…...

    [...]

  • ...Nettleton and Banerjee (2001) proposed the test statistic T for testing the equality of several distributions....

    [...]

  • ...…= the number of elements in Nt ; Cy = the number of edge pairs consisting of elements in Nt that share a common Y: Under H0 : FtL = FtR with some regularity conditions, S = T − E(T | ex; ey)√ Var(T |ex; Cx; ey; Cy) (3) has an asymptotic N(0; 1) distribution (see Nettleton and Banerjee, 2001)....

    [...]

Journal ArticleDOI
TL;DR: This article proposes two novel nonparametric tests for comparing species assemblages based on the concept of data depth that can be considered as a natural generalization of the Kolmogorov-Smirnov and the Cramér-von Mises tests (KS and CM).
Abstract: Testing homogeneity of species assemblages has important applications in ecology. Due to the unique structure of abundance data often collected in ecological studies, most classical statistical tests cannot be applied directly. In this article, we propose two novel nonparametric tests for comparing species assemblages based on the concept of data depth. They can be considered as a natural generalization of the Kolmogorov-Smirnov and the Cramer-von Mises tests (KS and CM) in this species assemblage comparison context. Our simulation studies show that the proposed test is more powerful than other existing methods under various settings. A real example is used to demonstrate how the proposed method is applied to compare species assemblages using plant community data from a highly diverse tropical forest at Barro Colorado Island, Panama.

19 citations


Cites methods from "Testing the equality of distributio..."

  • ...The first one is the test proposed by Nettleton and Banerjee (2001) (NB hereafter), which applied the testing procedure of Friedman and Rafsky (1979) to compare distributions of random vectors with categorical components....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors discuss two kinds of failure to make the best use of x2 tests which I have observed from time to time in reading reports of biological research, and propose a number of methods for strengthening or supplementing the most common uses of the ordinary x2 test.
Abstract: Since the x2 tests of goodness of fit and of association in contingency tables are presented in many courses on statistical methods for beginners in the subject, it is not surprising that x2 has become one of the most commonly-used techniques, even by scientists who profess only a smattering of knowledge of statistics. It is also not surprising that the technique is sometimes misused, e.g. by calculating x2 from data that are not frequencies or by errors in counting the number of degrees of freedom. A good catalogue of mistakes of this kind has been given by Lewis and Burke (1). In this paper I want to discuss two kinds of failure to make the best use of x2 tests which I have observed from time to time in reading reports of biological research. The first arises because x2 tests, as has often been pointed out, are not directed against any specific alternative to the null hypothesis. In the computation of x2, the deviations (fi mi) between observed and expected frequencies are squared, divided by mi in order to equalize the variances (approximately), and added. No attempt is made to detect any particular pattern of deviations (fi mi) that may hold if the null hypothesis is false. One consequence is that the usual x2 tests are often insensitive, and do not indicate significant results when the null hypothesis is actually false. Some forethought about the kind of alternative hypothesis that is likely to hold may lead to alternative tests that are more powerful and appropriate. Further, when the ordinary x2 test does give a significant result, it does not direct attention to the way in which the null hypothesis disagrees with the data, although the pattern of deviations may be informative and suggestive for future research. The remedy here is to supplement the ordinary test by additional tests that help to reveal the significant type of deviation. In this paper a number of methods for strengthening or supplementing the most common uses of the ordinary x2 test will be presented and illustrated by numerical examples. The principal devices are as follows:

3,351 citations

Journal ArticleDOI
TL;DR: The Kolmogorov test as discussed by the authors is a distribution-free test of goodness of fit that is sensitive to discrepancies at the tails of the distribution rather than near the median.
Abstract: Some (large sample) significance points are tabulated for a distribution-free test of goodness of fit which was introduced earlier by the authors. The test, which uses the actual observations without grouping, is sensitive to discrepancies at the tails of the distribution rather than near the median. An illustration is given, using a numerical example used previously by Birnbaum in illustrating the Kolmogorov test.

2,013 citations

Journal ArticleDOI
24 Jan 1987

1,717 citations

Book
01 Jan 1988

1,522 citations