scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A consistent multivariate test of association based on ranks of distances

01 Jun 2013-Biometrika (Oxford University Press)-Vol. 100, Iss: 2, pp 503-510
TL;DR: In this paper, the problem of detecting associations between random vectors of any dimension is considered and a powerful test that is applicable in all dimensions and consistent against all alternatives is proposed. But the test has a simple form, is easy to implement, and has good power.
Abstract: SUMMARY We consider the problem of detecting associations between random vectors of any dimension. Few tests of independence exist that are consistent against all dependent alternatives. We propose a powerful test that is applicable in all dimensions and consistent against all alternatives. The test has a simple form, is easy to implement, and has good power.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is argued that equitability is properly formalized by a self-consistency condition closely related to Data Processing Inequality, and shown that estimating mutual information provides a natural and practical method for equitably quantifying associations in large datasets.
Abstract: How should one quantify the strength of association between two random variables without bias for relationships of a specific form? Despite its conceptual simplicity, this notion of statistical "equitability" has yet to receive a definitive mathematical formalization. Here we argue that equitability is properly formalized by a self-consistency condition closely related to Data Processing Inequality. Mutual information, a fundamental quantity in information theory, is shown to satisfy this equitability criterion. These findings are at odds with the recent work of Reshef et al. [Reshef DN, et al. (2011) Science 334(6062):1518-1524], which proposed an alternative definition of equitability and introduced a new statistic, the "maximal information coefficient" (MIC), said to satisfy equitability in contradistinction to mutual information. These conclusions, however, were supported only with limited simulation evidence, not with mathematical arguments. Upon revisiting these claims, we prove that the mathematical definition of equitability proposed by Reshef et al. cannot be satisfied by any (nontrivial) dependence measure. We also identify artifacts in the reported simulation evidence. When these artifacts are removed, estimates of mutual information are found to be more equitable than estimates of MIC. Mutual information is also observed to have consistently higher statistical power than MIC. We conclude that estimating mutual information provides a natural (and often practical) way to equitably quantify statistical associations in large datasets.

524 citations

Proceedings Article
11 May 2015
TL;DR: This work introduces a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude.
Abstract: We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.

124 citations


Cites background from "A consistent multivariate test of a..."

  • ...While several problems (Simon and Tibshirani, 2014; Gorfine et al.) and alternatives (Heller et al., 2013; Székely et al., 2009) were pointed out, Kinney and Atwal (KA) were the first to point out that MIC’s apparent superiority to MI was actually due to flaws in estimation (Kinney and Atwal,…...

    [...]

  • ...) and alternatives (Heller et al., 2013; Székely et al., 2009) were pointed out, Kinney and Atwal (KA) showed that MIC’s apparent superiority to MI was actually due to flaws in estimation (Kinney and Atwal, 2014)....

    [...]

Journal ArticleDOI
TL;DR: This work seeks to summarize the main methods used to identify dependency between random variables, especially gene expression data, and also to evaluate the strengths and limitations of each method.
Abstract: One major task in molecular biology is to understand the dependency among genes to model gene regulatory networks. Pearson's correlation is the most common method used to measure dependence between gene expression signals, but it works well only when data are linearly associated. For other types of association, such as non-linear or non-functional relationships, methods based on the concepts of rank correlation and information theory-based measures are more adequate than the Pearson's correlation, but are less used in applications, most probably because of a lack of clear guidelines for their use. This work seeks to summarize the main methods (Pearson's, Spearman's and Kendall's correlations; distance correlation; Hoeffding's D: measure; Heller-Heller-Gorfine measure; mutual information and maximal information coefficient) used to identify dependency between random variables, especially gene expression data, and also to evaluate the strengths and limitations of each method. Systematic Monte Carlo simulation analyses ranging from sample size, local dependence and linear/non-linear and also non-functional relationships are shown. Moreover, comparisons in actual gene expression data are carried out. Finally, we provide a suggestive list of methods that can be used for each type of data set.

112 citations


Cites background or methods from "A consistent multivariate test of a..."

  • ...Heller, Heller and Gorfine [11] propose a test of independence based on the distances among values of X and Y, i.e. dðxi, xjÞ and dðyi, yjÞ for i, j 2 f1, . . . ng, respectively....

    [...]

  • ...To estimate the P-value under H0, a permutation test [9,11] can be used to test if dCor 1⁄4 0 (which occurs if and only if dCov 1⁄4 0)....

    [...]

  • ...Heller, Heller and Gorfine measure Heller, Heller and Gorfine [11] propose a test of independence based on the distances among values of X and Y, i....

    [...]

  • ...measure [11], mutual information (MI) [12] and...

    [...]

  • ...Methods that are applicable in multivariate scenarios are distance correlation and HHG [9,11]....

    [...]

Posted Content
TL;DR: In this article, a nonparametric mutual information (MI) estimator based on k-nearest-neighbor graphs is proposed, which is robust to local non-uniformity and works well with limited data.
Abstract: We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.

107 citations

References
More filters
Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

272,030 citations

Journal ArticleDOI
TL;DR: In this paper, the basic theory of analysis of variance by considering several different mathematical models is examined, including fixed-effects models with independent observations of equal variance and other models with different observations of variance.
Abstract: Originally published in 1959, this classic volume has had a major impact on generations of statisticians. Newly issued in the Wiley Classics Series, the book examines the basic theory of analysis of variance by considering several different mathematical models. Part I looks at the theory of fixed-effects models with independent observations of equal variance, while Part II begins to explore the analysis of variance in the case of other models.

5,728 citations

Journal ArticleDOI
TL;DR: Distance correlation is a new measure of dependence between random vectors that is based on certain Euclidean distances between sample elements rather than sample moments, yet has a compact representation analogous to the classical covariance and correlation.
Abstract: Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.

2,042 citations


"A consistent multivariate test of a..." refers background or methods or result in this paper

  • ...Moreover, our aim is to investigate the performanceof our test for nonmonotone relationships, and these classical tests, or related tests for higher dimensions found in Taskinen et al. (2005), are ineffective for testingnon-monotone types of dependence (Szekely et al., 2007)....

    [...]

  • ...In the following two examples from Szekely et al. (2007), none of the likelihood ratio type of tests considered performed well....

    [...]

  • ...Szekely et al. (2007) considered multivariate examples andcompared them to like- lihood ratio type of tests....

    [...]

  • ...We revisit some of the examples of Szekely et al. (2007), and add new examples....

    [...]

  • ...A very elegant test with a simple formula is provided in Szekely et al. (2007), and has been further investigated in Szekely and Rizzo (2009) and in the discussions that followed it....

    [...]

Journal ArticleDOI
TL;DR: 1. Density estimation for exploring data 2. D density estimation for inference 3. Nonparametric regression for explore data 4. Inference with nonparametric regressors 5. Checking parametric regression models 6. Comparing regression curves and surfaces
Abstract: 1. Density estimation for exploring data 2. Density estimation for inference 3. Nonparametric regression for exploring data 4. Inference with nonparametric regression 5. Checking parametric regression models 6. Comparing regression curves and surfaces 7. Time series data 8. An introduction to semiparametric and additive models References

1,424 citations


"A consistent multivariate test of a..." refers background in this paper

  • ...ft designs during the twentieth century. They consider two variables, wing span (m) and speed (km/h) for the 230 designs of the third (of three) periods. This example and the data (aircraft) are from Bowman and Azzalini (1997). They showed that the dCov test of independence of log(Speed) and log(Span) in period 3 is significant (p-value ≤ 0.00001), while the Pearson correlation test is not significant (p-value = 0.8001). Our...

    [...]

Journal ArticleDOI

1,275 citations


"A consistent multivariate test of a..." refers methods in this paper

  • ...This can be done using multiple comparisons procedures, similar to post-hoc testing in the analysis of variance (Scheffe, 1959)....

    [...]