scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Common pitfalls in statistical analysis: Logistic regression.

TL;DR: In this article, a statistical technique to evaluate the relationship between various predictor variables and an outcome which is binary is discussed.
Abstract: Logistic regression analysis is a statistical technique to evaluate the relationship between various predictor variables (either categorical or continuous) and an outcome which is binary (dichotomous). In this article, we discuss logistic regression analysis and the limitations of this technique.
Citations
More filters
Journal ArticleDOI
TL;DR: This article looks at statistical measures of agreement for different types of data and discusses the differences between these and those for assessing correlation.
Abstract: Agreement between measurements refers to the degree of concordance between two (or more) sets of measurements. Statistical methods to test agreement are used to assess inter-rater variability or to decide whether one technique for measuring a variable can substitute another. In this article, we look at statistical measures of agreement for different types of data and discuss the differences between these and those for assessing correlation.

239 citations


Cites methods from "Common pitfalls in statistical anal..."

  • ...Superficially, these data may appear to be amenable to analysis using methods used for 2 × 2 tables (if the variable is categorical) or correlation (if numeric), which we have discussed previously in this series.[1,2] However, a closer look would show that this is not true....

    [...]

Journal ArticleDOI
01 Jan 2020-Database
TL;DR: This study focused on analyzing and discussing various published artificial intelligence and machine learning solutions, approaches and perspectives, aiming to advance academic solutions in paving the way for a new data-centric era of discovery in healthcare.
Abstract: Precision medicine is one of the recent and powerful developments in medical care, which has the potential to improve the traditional symptom-driven practice of medicine, allowing earlier interventions using advanced diagnostics and tailoring better and economically personalized treatments. Identifying the best pathway to personalized and population medicine involves the ability to analyze comprehensive patient information together with broader aspects to monitor and distinguish between sick and relatively healthy people, which will lead to a better understanding of biological indicators that can signal shifts in health. While the complexities of disease at the individual level have made it difficult to utilize healthcare information in clinical decision-making, some of the existing constraints have been greatly minimized by technological advancements. To implement effective precision medicine with enhanced ability to positively impact patient outcomes and provide real-time decision support, it is important to harness the power of electronic health records by integrating disparate data sources and discovering patient-specific patterns of disease progression. Useful analytic tools, technologies, databases, and approaches are required to augment networking and interoperability of clinical, laboratory and public health systems, as well as addressing ethical and social issues related to the privacy and protection of healthcare data with effective balance. Developing multifunctional machine learning platforms for clinical data extraction, aggregation, management and analysis can support clinicians by efficiently stratifying subjects to understand specific scenarios and optimize decision-making. Implementation of artificial intelligence in healthcare is a compelling vision that has the potential in leading to the significant improvements for achieving the goals of providing real-time, better personalized and population medicine at lower costs. In this study, we focused on analyzing and discussing various published artificial intelligence and machine learning solutions, approaches and perspectives, aiming to advance academic solutions in paving the way for a new data-centric era of discovery in healthcare.

221 citations

Journal ArticleDOI
03 Sep 2019
TL;DR: This review provides definitions and basic knowledge of machine learning categories, introduces the underlying concept of the bias-variance trade-off as an important foundation in supervisedMachine learning, and discusses approaches to the supervised machine learning study design.
Abstract: Increased interest in the opportunities provided by artificial intelligence and machine learning has spawned a new field of health-care research. The new tools under development are targeting many aspects of medical practice, including changes to the practice of pathology and laboratory medicine. Optimal design in these powerful tools requires cross-disciplinary literacy, including basic knowledge and understanding of critical concepts that have traditionally been unfamiliar to pathologists and laboratorians. This review provides definitions and basic knowledge of machine learning categories (supervised, unsupervised, and reinforcement learning), introduces the underlying concept of the bias-variance trade-off as an important foundation in supervised machine learning, and discusses approaches to the supervised machine learning study design along with an overview and description of common supervised machine learning algorithms (linear regression, logistic regression, Naive Bayes, k-nearest neighbor, support vector machine, random forest, convolutional neural networks).

172 citations


Cites methods from "Common pitfalls in statistical anal..."

  • ...Additionally, this approach assumes that the relationship between the independent variables (features) and the dependent variables (target) are uniform which may limit the model’s performance.(31,32) Naive Bayes Naive Bayes classifiers use a probabilistic approach that is based on the Bayes theorem....

    [...]

Journal ArticleDOI
04 Jan 2021-BMJ Open
TL;DR: In this article, the authors examined risk perceptions and behavioural responses of the UK adult population during the early phase of the COVID-19 epidemic in the UK. And they found that the willingness to self-isolate was high across all respondents.
Abstract: Objective To examine risk perceptions and behavioural responses of the UK adult population during the early phase of the COVID-19 epidemic in the UK. Design A cross-sectional survey. Setting Conducted with a nationally representative sample of UK adults within 48 hours of the UK Government advising the public to stop non-essential contact with others and all unnecessary travel. Participants 2108 adults living in the UK aged 18 years and over. Response rate was 84.3% (2108/2500). Data collected between 17 March and 18 March 2020. Main outcome measures Descriptive statistics for all survey questions, including number of respondents and weighted percentages. Robust Poisson regression used to identify sociodemographic variation in: (1) adoption of social distancing measures, (2) ability to work from home, and (3) ability and (4) willingness to self-isolate. Results Overall, 1992 (94.2%) respondents reported at least one preventive measure: 85.8% washed their hands with soap more frequently; 56.5% avoided crowded areas and 54.5% avoided social events. Adoption of social distancing measures was higher in those aged over 70 years compared with younger adults aged 18–34 years (adjusted relative risk/aRR: 1.2; 95% CI: 1.1 to 1.5). Those with lowest household income were three times less likely to be able to work from home (aRR: 0.33; 95% CI: 0.24 to 0.45) and less likely to be able to self-isolate (aRR: 0.92; 95% CI: 0.88 to 0.96). Ability to self-isolate was also lower in black and minority ethnic groups (aRR: 0.89; 95% CI: 0.79 to 1.0). Willingness to self-isolate was high across all respondents. Conclusions Ability to adopt and comply with certain non-pharmaceutical interventions (NPIs) is lower in the most economically disadvantaged in society. Governments must implement appropriate social and economic policies to mitigate this. By incorporating these differences in NPIs among socioeconomic subpopulations into mathematical models of COVID-19 transmission dynamics, our modelling of epidemic outcomes and response to COVID-19 can be improved.

122 citations

Journal ArticleDOI
TL;DR: Evidence is provided that older age, male gender, Asian, indigenous or unknown race, comorbidities (smoking, kidney disease, obesity, pulmonary disease, diabetes, and cardiovascular disease), as well as fever and shortness of breath increased the risk of hospitalization and death outcome in hospitalized patients.
Abstract: Brazil is, at the time of writing, the global epicenter of COVID-19, but information on risk factors for hospitalization and mortality in the country is still limited. Demographic and clinical data of COVID-19 patients until June 11th, 2020 were retrieved from the State Health Secretariat of Espirito Santo, Brazil. Potential risk factors for COVID-19 hospitalization and death were analyzed by univariate and multivariable logistic regression models. A total of 10,713 COVID-19 patients were included in this study; 81.0% were younger than 60 years, 55.2% were female, 89.2% were not hospitalized, 32.9% had at least one comorbidity, and 7.7% died. The most common symptoms on admission were cough (67.7%) and fever (62.6%); 7.1% of the patients were asymptomatic. Cardiovascular diseases (23.7%) and diabetes (10.3%) were the two most common chronic diseases. Multivariate logistic regression analysis identified an association of all explanatory variables, except for cough and diarrhea, with hospitalization. Older age (odds ratio [OR] = 3.95, P < 0.001) and shortness of breath (OR = 3.55, P < 0.001) were associated with increase of odds to COVID-19 death in hospitalized patients. Our study provided evidence that older age, male gender, Asian, indigenous or unknown race, comorbidities (smoking, kidney disease, obesity, pulmonary disease, diabetes, and cardiovascular disease), as well as fever and shortness of breath increased the risk of hospitalization. For death outcome in hospitalized patients, only older age and shortness of breath increased the risk.

116 citations


Cites background from "Common pitfalls in statistical anal..."

  • ...Another limitation is that complex interactions involving many variables may not be correctly understood through multivariate binary logistic regression.(49) For example, cardiovascular disease interacts in many ways with gender, symptoms, age, and other comorbidities, and these interac-...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Findings indicate that low EPV can lead to major problems, and the regression coefficients were biased in both positive and negative directions, and paradoxical associations (significance in the wrong direction) were increased.

6,490 citations


"Common pitfalls in statistical anal..." refers methods in this paper

  • ...Restricting the number of variables entered into a multivariate logistic regression model? It has been suggested that the data should contain at least ten events for each variable entered into a logistic regression model.[3] Hence, if we wish to find predictors of mortality using a sample in which there have been sixty deaths, we can study no more than 6 ( =60/10) predictor variables....

    [...]

Journal ArticleDOI
TL;DR: A large simulation study of other influences on confidence interval coverage, type I error, relative bias, and other model performance measures found a range of circumstances in which coverage and bias were within acceptable levels despite less than 10 EPV.
Abstract: The rule of thumb that logistic and Cox models should be used with a minimum of 10 outcome events per predictor variable (EPV), based on two simulation studies, may be too conservative. The authors conducted a large simulation study of other influences on confidence interval coverage, type I error, relative bias, and other model performance measures. They found a range of circumstances in which coverage and bias were within acceptable levels despite less than 10 EPV, as well as other factors that were as influential as or more influential than EPV. They conclude that this rule can be relaxed, in particular for sensitivity analyses undertaken to demonstrate adequate control of confounding.

2,943 citations


"Common pitfalls in statistical anal..." refers background in this paper

  • ...However, the validity of this thumb rule has been questioned.[4]...

    [...]

Journal ArticleDOI
TL;DR: The meaning of risk and odds and the difference between the two are explained.
Abstract: In biomedical research, we are often interested in quantifying the relationship between an exposure and an outcome. "Odds" and "Risk" are the most common terms which are used as measures of association between variables. In this article, which is the fourth in the series of common pitfalls in statistical analysis, we explain the meaning of risk and odds and the difference between the two.

115 citations


"Common pitfalls in statistical anal..." refers background in this paper

  • ...This means that a person receiving sclerotherapy is nearly twice as likely to die than a patient receiving ligation (please note that these are odds and not actual risks – for more on this, please refer to our article on odds and risk).[2]...

    [...]

  • ...As discussed in our previous article on odds and risk,[2] standard errors and hence confidence intervals can be Table 1: Relation of death (a dichotomous outcome) with (a) treatment given (variceal ligation versus sclerotherapy), (b) prior beta‐blocker therapy, and (c) both treatment given and prior beta‐blocker therapy...

    [...]

  • ...The odds differ from the risk, and while the odds may appear to be high, the absolute risk may be low.[2]...

    [...]

Journal ArticleDOI
TL;DR: This article deals with linear regression analysis which predicts the value of one continuous variable from another and discusses the assumptions and pitfalls associated with this analysis.
Abstract: In a previous article in this series, we explained correlation analysis which describes the strength of relationship between two continuous variables. In this article, we deal with linear regression analysis which predicts the value of one continuous variable from another. We also discuss the assumptions and pitfalls associated with this analysis.

41 citations


"Common pitfalls in statistical anal..." refers background in this paper

  • ...In a previous article in this series,[1] we discussed linear regression analysis which estimates the relationship of an outcome (dependent) variable on a continuous scale with continuous predictor (independent) variables....

    [...]

Journal ArticleDOI
01 Jan 2017-BMJ Open
TL;DR: The prediction model showed adequate performance after validation in an independent cohort and can be used to classify women into high, moderate or low risk of developing GH, contributing to efforts to provide clinical decision-making support to improve maternal health and birth outcomes.
Abstract: Objective To develop and validate a prediction model for identifying women at increased risk of developing gestational hypertension (GH) in Ghana. Design A prospective study. We used frequencies for descriptive analysis, χ2 test for associations and logistic regression to derive the prediction model. Discrimination was estimated by the c-statistic. Calibration was assessed by calibration plot of actual versus predicted probability. Setting Primary care antenatal clinics in Ghana. Participants 2529 pregnant women in the development cohort and 647 pregnant women in the validation cohort. Inclusion criterion was women without chronic hypertension. Primary outcome Gestational hypertension. Results Predictors of GH were diastolic blood pressure, family history of hypertension in parents, history of GH in a previous pregnancy, parity, height and weight. The c-statistic of the original model was 0.70 (95% CI 0.67–0.74) and 0.68 (0.60 to 0.77) in the validation cohort. Calibration was good in both cohorts. The negative predictive value of women in the development cohort at high risk of GH was 92.0% compared to 94.0% in the validation cohort. Conclusions The prediction model showed adequate performance after validation in an independent cohort and can be used to classify women into high, moderate or low risk of developing GH. It contributes to efforts to provide clinical decision-making support to improve maternal health and birth outcomes.

20 citations


"Common pitfalls in statistical anal..." refers methods in this paper

  • ...developed and validated a prediction model for gestational hypertension (GH).[5] They first compared groups of women with and without GH, using the independent t-test for continuous variables and the Chi-square test for categorical variables (univariate analyses)....

    [...]

Trending Questions (3)
What are the limitations of logistic regression in speech یثحقثسسهخد?

The limitations of logistic regression in speech are not mentioned in the provided paper. The paper discusses the technique of logistic regression analysis and its limitations in general, but does not specifically address its limitations in speech analysis.

What are the advantages and disadvantages of logistic regression for binary classification?

Advantages: Can handle both categorical and continuous predictor variables. Disadvantages: Assumes linearity, requires large sample size, prone to overfitting.

What are the advantages and disadvantages of logistic regression?

Advantages: Can handle both categorical and continuous predictor variables. Disadvantages: Assumes linearity between predictors and log odds of the outcome.