scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

01 Nov 2006-The American Statistician (Taylor & Francis)-Vol. 60, Iss: 4, pp 328-331
TL;DR: The authors pointed out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities, which encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference.
Abstract: It is common to summarize statistical comparisons by declarations of statistical significance or nonsignificance. Here we discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. By this, we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.The error we describe is conceptually different from other oft-cited problems—that statistical significance is not the same as practical importance, that dichotomization into significant and nonsignificant results encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference, and that...

Summary (2 min read)

1 Introduction

  • A common statistical error is to summarize comparisons by statistical significance and then draw a sharp distinction between significant and non-significant results.
  • The approach of summarizing by statistical significance has a number of pitfalls, most of which are covered in standard statistics courses but one that the authors believe is less well known.
  • The authors refer to the fact that changes in statistical significance are not themselves significant.
  • If the estimated effect of a drug is to decrease blood pressure by 0.10 with a standard error of 0.03, this would be statistically significant but probably not important in practice (or so the authors suppose, given their general knowledge that blood pressure values are typically around 100).

2 Theoretical example: comparing the results of two exper-

  • The first study is statistically significant at the 1% level, and the second is not at all statistically significant, being only one standard error away from 0.
  • Both find a positive effect but with much different magnitudes.
  • In fact, the third study finds an effect size much closer to that of the second study, but now because of the sample size it attains significance.
  • Declarations of statistical significance are often associated with decision making.

3 Applied example: homosexuality and the number of older

  • The paper, “Biological versus nonbiological older brothers and men’s sexual orientation,” (Bogaert, 2006), appeared recently in the Proceedings of the National Academy of Sciences and was picked up by several leading science news organizations (Bower, 2006, Motluk, 2006, Staedter, 2006).
  • Only the number of biological older brothers reared with the participant, and not any other sibling characteristic including the number of nonbiological brothers reared with the participant, was significantly related to sexual orientation.
  • The conclusions appear to be based on a comparison of significance (for the coefficient of the number of older brothers) with nonsignificance (for the other coefficients), even though the differences between the coefficients do not appear to be statistically significant.
  • (Again the authors cannot be certain but they strongly suspect so from the graph and the table.).
  • Given that the 95% confidence level is standard (and the authors are pretty sure the paper would not have been published had the results not been statistically significant at that level), it is appropriate that the rule should be applied consistently to hypotheses consistent with the data.

4 Applied example: health effects of low-frequency electro-

  • The issue of comparisons between significance and non-significance is of even more concern in the increasingly common setting where there are a large number of comparisons.
  • The researchers used this sort of display to hypothesize that one process was occurring at 255, 285, and 315 Hz (where effects were highly significant), another at 135 and 225 Hz (where effects were only moderately significant), and so forth.
  • The estimates are all of relative calcium efflux, so that an effect of 0.1, for example, corresponds to a 10% increase compared to the control condition.
  • At the very least, it is more informative to show the estimated treatment effect and standard error at each frequency, as in Figure 2b.
  • The authors simple hierarchical model is not intended to be definitive, merely a model that the authors believe improves upon the separate judgments of statistical significance for each experiment.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

The difference between “significant” and “not significant” is
not itself statistically significant
Andrew Gelman
Hal Stern
August 22, 2006
Abstract
It is common to summarize statistical comparisons by declarations of statistical
significance or non-significance. Here we discuss one problem with such declarations,
namely that changes in statistical significance are often not themselves statistically
significant. By this, we ar e not merely making the commo nplace observation that
any particular threshold is arbitrary—for example, only a small change is required to
move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical
significance. Rather, we are pointing out that e ven large changes in significance levels
can correspond to small, non-significant changes in the underlying variables.
The error we des cribe is conceptually different from other oft-cited problems—that
statistical significance is not the same as practical importance, that dichotomization into
significant and non-significant results encourages the dismissal of observed differences
in favor of the usually less interesting null hypothesis of no difference, and that any
particular thresho ld for declaring significance is arbitrary. We are troubled by all of
these concerns and do not intend to minimize their importance. Rather, our goal is to
bring attention to what we have found is an important but much less discussed point.
We illustrate with a theo retical example and two applied examples.
Keywords: multilevel modeling, multiple comparisons, replication, statistical signif-
icance
1 Introduction
A common statistical error is to summarize comparisons by statistical significance and then
draw a sharp distinction between significant and non-significant resu lts. The approach of
summarizing by statistical significance has a number of pitfalls, most of which are covered
in standard statistics courses but one that we believe is less well known. We refer to the fact
that changes in statistical significance are not themselves significant. By this, we are not
We thank Howard Wainer, Peter Westfall, and an anonymous reviewer for helpful comments and the
National Science Foundation for financial support.
Department of Statistics an d Department of Political Science, Columbia University, New York,
gelman@stat.columbia.edu, www.stat.columbia.edu/gelman
Department of Statistics, Un iversity of California, Irvine, sternh@uci.edu, www.ics.uci.edu/sternh
1

merely making the commonplace observation that any particular threshold is arbitrary—for
example, only a small change is required to move an estimate from a 5.1% significance level
to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even
large changes in significance levels can correspond to small, non-significant changes in the
underlying variables. We shall illustrate with three examples.
This article does not attempt to provide a comprehensive discussion of significance
testing. There are several such discussions; see, for example, Krantz (1999). Indeed many
of the pitfalls of relying on declarations of statistical significance appear to be well k nown.
For example, by now practically all introductory texts point out that statistical significance
does not equal practical importance. If the estimated effect of a drug is to decrease blood
pressure by 0.10 with a standard error of 0.03, this would be statistically significant but
probably not important in practice (or so we suppose, given our general knowledge that
blood pressure values are typically around 100). Conversely, an estimated effect of 10 with
a standard error of 10 would not be statistically significant, but it has the possibility of
being important in practice. As well, introdu ctory courses r egularly warn students about
the perils of strict adherence to a particular threshold (the point mentioned above regarding
5.1% and 4.9% significance levels). Similarly most statisticians and many practitioners are
familiar with the notion that automatic us e of a binary significant/non-significant decision
rule encourages practitioners to ignore potentially important observed differences in favor
of the usually less interesting null hypothesis. Thus, from this point forward we focus only
on the less widely known but equally important error of comparing two or more results by
comparing their degree of statistical significance.
2 Theoretical example: comparing the results of two exper-
iments
Consider two independent studies with effect estimates and standard errors of 25 ± 10 and
10 ± 10. The first s tudy is statistically significant at the 1% level, and the second is not
at all statistically significant, being only one standard error away from 0. Thus it would
be tempting to conclude that there is a large difference between the two studies. In fact,
however, the difference is not even close to being statistically significant: the estimated
difference is 15, with a standard error of
10
2
+ 10
2
= 14.
Additional problems arise when comparing estimates with different levels of information.
Suppose in our example that there is a third independent study with much larger sample
size that yields an effect estimate of 2.5 with standard error of 1.0. This third study attains
2

the same significance level as the first study, yet the difference between the two is itself also
significant. Both find a positive effect b ut with much different magnitudes. Does the third
study replicate the first study? If we restrict attention only to judgments of significance we
might say yes, but if we thin k about the effect being estimated we would say no, as noted
by Utts (1991). In fact, th e third s tudy finds an effect size much closer to that of the second
study, but now because of the sample size it attains significance.
Declarations of statistical significance are often associated with decision making. For
example, if the two estimates in the first paragraph concerned efficacy of blood pressure
drugs, then one might conclude that the first drug works and the second does not, making
the choice between them obvious. But is this obvious conclusion reasonable? The two drugs
do not appear to be significantly different fr om each other. One way of interpreting lack of
statistical significance is that further information might change one’s decision recommen-
dations. Our key point is not that we object to looking at statistical significance but that
comparing statistical significance levels is a bad idea. In making a comparison between two
treatments, one should look at the statistical significance of the difference rather than the
difference between their significance levels.
3 Applied example: homosexuality and the number of older
brothers and sisters
The paper, “Biological versus nonbiological older brothers and men’s sexual orientation,”
(Bogaert, 2006), appeared recently in the Proceedings of the National Academy of Sciences
and was picked up by several leading science news organizations (Bower, 2006, Motluk,
2006, Staedter, 2006). As the article in Science News put it:
The number of biological older brothers correlated with the likelihood of a man
being homosexual, regardless of the amount of time spent with those siblings
during childh ood, Bogaert says. No other sibling characteristic, such as number
of older sisters, displayed a link to male sexual orientation.
We were curious abou t this—why older brothers and not older sisters? The article
referred b ack to Blanchard and Bogaert (1996), which had the graph and table shown in
Figure 1, along with the following summary:
Significant beta coefficients differ statistically from zero and, when positive, in -
dicate a greater probability of homosexuality. Only the number of biological
3

Wald
Predictor β SE statistic p e
β
Initial equation
Number of older brothers 0.29 0.11 7.26 0.007 1.33
Number of older sisters 0.08 0.10 0.63 0.43 1.08
Number of young e r brothers 0.14 0.10 2.14 0.14 0.87
Number of young e r s isters 0.02 0.10 0.05 0.82 0.98
Father’s age at time of proband’s birth 0.02 0.02 1.06 0.30 1.02
Mother’s age at time of proband’s birth -0.03 0.02 1.83 0.18 0.97
Final equation—number of older br others 0.28 0.10 8.77 0.003 1.33
Figure 1: From Blanchard and Bogaert (1996): (a) mean numbers of older and younger
brothers and s isters for 302 homosexual men and 302 matched heterosexual men, (b) logistic
regression of sexual orientation on family variables from these data. The graph and table
illustrate that, in these data, homosexuality is more strongly associated with number of
older brothers than w ith number of older sisters. However, no evidence is presented that
would indicate that this difference is statistically significant.
4

older brothers reared with the participant, and not any other sibling character-
istic including the number of nonbiological brothers reared with the participant,
was significantly related to sexual orientation.
The conclusions appear to be based on a comparison of s ignificance (for the coefficient of
the number of older brothers) with nonsignificance (for the other coefficients), even though
the differences between the coefficients do not appear to be statistically significant. One
cannot quite be sure—it is a regression analysis and the different coefficient estimates are
not independent—but based on the picture we strongly doubt that the difference between
the coefficient of the number of older brothers and the coefficient of the number of older
sisters is significant.
Is it appropriate to criticize an an alysis of this type? After all, th e data are consistent
with the hypothesis that only the number of older brothers matters. But the data are also
consistent with the hypothesis that only the birth order (the total number of older s iblings)
matters. (Again we cannot be certain but we strongly suspect so from the graph and the
table.) Given that the 95% confidence level is standard (and we are pretty sure the paper
would not have been published had the results not been statistically significant at that level),
it is appropriate that the rule should be applied consistently to hypotheses consistent with
the data. We are speaking here not as experts in biology but rather as statisticians: the
published article and its media reception suggest unquestioning acceptance of a result (only
the number of older brothers matters) which , if properly expressed as a comparison, would
be better described as “suggestive.” For example, the authors could have written that the
sexual preference of the men in the sample is statistically significantly related to birth order
and, in addition, more strongly related to number of older brothers th an number of older
sisters, but with the latter difference not being statistically significant.
4 Applied example: health effects of low-frequency electro-
magnetic fields
The issue of comparisons between significance and non-significance is of even more concern
in the in creasingly common setting where there are a large number of comparisons. We
illustrate with an example of a laboratory study with pub lic health applications.
In the wake of concerns about the health effects of low-frequency electric and magnetic
fields, Blackman et al. (1988) performed a series of experiments to measure the effect of
electromagnetic elds at various frequencies on the functioning of chick brains. At each of
5

Citations
More filters
Journal ArticleDOI
TL;DR: The American Statistical Association (ASA) released a policy statement on p-values and statistical significance in 2015 as discussed by the authors, which was based on a discussion with the ASA Board of Trustees and concerned with reproducibility and replicability of scientific conclusions.
Abstract: Cobb’s concern was a long-worrisome circularity in the sociology of science based on the use of bright lines such as p< 0.05: “We teach it because it’s what we do; we do it because it’s what we teach.” This concern was brought to the attention of the ASA Board. The ASA Board was also stimulated by highly visible discussions over the last few years. For example, ScienceNews (Siegfried 2010) wrote: “It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” A November 2013, article in Phys.org Science News Wire (2013) cited “numerous deep flaws” in null hypothesis significance testing. A ScienceNews article (Siegfried 2014) on February 7, 2014, said “statistical techniques for testing hypotheses...havemore flaws than Facebook’s privacy policies.” Aweek later, statistician and “Simply Statistics” blogger Jeff Leek responded. “The problem is not that people use P-values poorly,” Leek wrote, “it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis” (Leek 2014). That same week, statistician and science writer Regina Nuzzo published an article in Nature entitled “Scientific Method: Statistical Errors” (Nuzzo 2014). That article is nowone of the most highly viewedNature articles, as reported by altmetric.com (http://www.altmetric.com/details/2115792#score). Of course, it was not simply a matter of responding to some articles in print. The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. Without getting into definitions and distinctions of these terms, we observe that much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices, such as the one taken by the editors of Basic andApplied Social Psychology, who decided to ban p-values (null hypothesis significance testing) (Trafimow and Marks 2015). Misunderstanding or misuse of statistical inference is only one cause of the “reproducibility crisis” (Peng 2015), but to our community, it is an important one. When the ASA Board decided to take up the challenge of developing a policy statement on p-values and statistical significance, it did so recognizing this was not a lightly taken step. The ASA has not previously taken positions on specific matters of statistical practice. The closest the association has come to this is a statement on the use of value-added models (VAM) for educational assessment (Morganstein and Wasserstein 2014) and a statement on risk-limiting post-election audits (American Statistical Association 2010). However, these were truly policy-related statements. The VAM statement addressed a key educational policy issue, acknowledging the complexity of the issues involved, citing limitations of VAMs as effective performance models, and urging that they be developed and interpreted with the involvement of statisticians. The statement on election auditing was also in response to a major but specific policy issue (close elections in 2008), and said that statistically based election audits should become a routine part of election processes. By contrast, the Board envisioned that the ASA statement on p-values and statistical significance would shed light on an aspect of our field that is too often misunderstood and misused in the broader research community, and, in the process, provides the community a service. The intended audience would be researchers, practitioners, and science writers who are not primarily statisticians. Thus, this statementwould be quite different from anything previously attempted. The Board tasked Wasserstein with assembling a group of experts representing a wide variety of points of view. On behalf of the Board, he reached out to more than two dozen such people, all of whom said theywould be happy to be involved. Several expressed doubt about whether agreement could be reached, but those who did said, in effect, that if there was going to be a discussion, they wanted to be involved. Over the course of many months, group members discussed what format the statement should take, tried to more concretely visualize the audience for the statement, and began to find points of agreement. That turned out to be relatively easy to do, but it was just as easy to find points of intense disagreement. The time came for the group to sit down together to hash out these points, and so in October 2015, 20 members of the group met at the ASA Office in Alexandria, Virginia. The 2-day meeting was facilitated by Regina Nuzzo, and by the end of the meeting, a good set of points around which the statement could be built was developed. The next 3 months saw multiple drafts of the statement, reviewed by group members, by Board members (in a lengthy discussion at the November 2015 ASA Board meeting), and by members of the target audience. Finally, on January 29, 2016, the Executive Committee of the ASA approved the statement. The statement development process was lengthier and more controversial than anticipated. For example, there was considerable discussion about how best to address the issue of multiple potential comparisons (Gelman and Loken 2014). We debated at some length the issues behind the words “a p-value near 0.05 taken by itself offers only weak evidence against the null

4,361 citations

Journal ArticleDOI
TL;DR: This test can be used for models that integrate moderation and mediation in which the relationship between the indirect effect and the moderator is estimated as linear, including many of the models described by Edwards and Lambert (2007) and Preacher, Rucker, and Hayes (2007), as well as extensions of these models to processes involving multiple mediators operating in parallel or in serial.
Abstract: I describe a test of linear moderated mediation in path analysis based on an interval estimate of the parameter of a function linking the indirect effect to values of a moderator-a parameter that I call the index of moderated mediation. This test can be used for models that integrate moderation and mediation in which the relationship between the indirect effect and the moderator is estimated as linear, including many of the models described by Edwards and Lambert ( 2007 ) and Preacher, Rucker, and Hayes ( 2007 ) as well as extensions of these models to processes involving multiple mediators operating in parallel or in serial. Generalization of the method to latent variable models is straightforward. Three empirical examples describe the computation of the index and the test, and its implementation is illustrated using Mplus and the PROCESS macro for SPSS and SAS.

2,437 citations


Cites background from "The Difference Between “Significant..."

  • ...Difference in significance does not imply significantly different (Gelman & Stern, 2006)....

    [...]

Journal ArticleDOI
TL;DR: Some of you exploring this special issue of The American Statistician might be wondering if it’s a scolding from pedantic statisticians lecturing you about what not to do with p-values, without offering any real ideas of what to do about the very hard problem of separating signal from noise in data.
Abstract: Some of you exploring this special issue of The American Statistician might be wondering if it’s a scolding from pedantic statisticians lecturing you about what not to do with p-values, without offering any real ideas of what to do about the very hard problem of separating signal from noise in data and making decisions under uncertainty. Fear not. In this issue, thanks to 43 innovative and thought-provoking papers from forward-looking statisticians, help is on the way.

1,761 citations

Journal ArticleDOI
TL;DR: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant as discussed by the authors, and there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists.
Abstract: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power We conclude with guidelines for improving statistical interpretation and reporting

1,584 citations

Journal Article
TL;DR: This paper provided definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions, and provided an explanatory list of 25 misinterpretations of P values, confidence intervals, and power.
Abstract: Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.

1,354 citations

References
More filters
Book
01 Jan 1995
TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

16,079 citations

01 Jan 1947
TL;DR: This chapter discusses Statistical Training and Curricular Revision, which aims to provide a history of the discipline and some of the techniques used to train teachers.
Abstract: Statistical Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 124, 254, 297 History Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20, 179 Teacher’s Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26, 173, 263, 335 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 147, 211, 366 Statistical Computing and Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Statistical Computing and Software Reviews . . . . . . . . . . . . . . . . . . . . . . . . 75, 187 Reviews of Books and Teaching Materials . . . . . . . . . . . . . . . . . 92, 189, 281, 401 Brief Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100, 195, 292, 404 Letters to the Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102, 197, 294, 406 Special Section: Statistical Training and Curricular Revision . . . . . . . . . . . . . 105 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Special Section: Opportunities and Challenges for the Discipline . . . . . . . . . 201 Software Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

318 citations

Journal ArticleDOI
TL;DR: The range of possible theories of the birth order phenomenon is restricted to those that can explain not only why older brothers increase the probability of homosexuality in later-born males but also why older sisters neither enhance this effect nor counteract it.
Abstract: Objective: This study investigated whether homosexual men have a higher mean birth order than heterosexual men primarily because they have more older brothers or because they have more older siblings ofboth sexes. Method: For the main analyses, 302 homosexual men were individually matched on year of birth with an equal number of heterosexual men. Each completed a self-administered, anonymous questionnaire concerning family background and other biodemographic information. Results: Logistic regression analysis showed that homosexuality was positively correlated with the proband’s number of older brothers but not with older sisters, younger brothers, younger sisters, or parental age at the time of the proband’s birth. Each additional older brother increased the odds of homosexuality by 33 % . Conclusions: These results restrict the range of possible theories of the birth order phenomenon to those that can explain not only why older brothers increase the probability of homosexuality in later-born males but also why older sisters neither enhance this effect nor counteract it. (AmJ Psychiatry 1996; 153:27-31)

242 citations


"The Difference Between “Significant..." refers background in this paper

  • ...The article referred back to Blanchard and Bogaert (1996), which had the graph and table shown in Figure 1, along with the following summary: Significant beta coefficients differ statistically from The American Statistician, November 2006, Vol. 60, No. 4 329 zero and, when positive, indicate a…...

    [...]

  • ...We were curious about this—why older brothers and not older sisters? The article referred back to Blanchard and Bogaert (1996), which had the graph and table shown in Figure 1, along with the following summary:...

    [...]

  • ...From Blanchard and Bogaert (1996): (a) mean numbers of older and younger brothers and sisters for 302 homosexual men and 302 matched heterosexual men, (b) logistic regression of sexual orientation on family variables from these data....

    [...]

Frequently Asked Questions (4)
Q1. What have the authors contributed in "The difference between “significant” and “not significant” is not itself statistically significant∗" ?

Here the authors discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. The error the authors describe is conceptually different from other oft-cited problems—that statistical significance is not the same as practical importance, that dichotomization into significant and non-significant results encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference, and that any particular threshold for declaring significance is arbitrary. Rather, their goal is to bring attention to what the authors have found is an important but much less discussed point. Rather, the authors are pointing out that even large changes in significance levels can correspond to small, non-significant changes in the underlying variables. 

The multilevel analysis can be seen as a way to estimate the effects at each frequency j, without setting apparently “non-significant” results to zero. 

Another way to handle the large number of related experiments in a single data analysis is to fit a multilevel model of the sort used in meta-analysis. 

The researchers in the chick-brain experiment made the common mistake of using statistical significance as a criterion for separating the estimates of different effects, an approach that does not make sense.