scispace - formally typeset
Search or ask a question

Showing papers in "BMC Medical Research Methodology in 2022"


Journal ArticleDOI
donatatdzf1
TL;DR: A systematic review of machine learning-based prediction model studies to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Statement is presented in this paper .
Abstract: While many studies have consistently found incomplete reporting of regression-based prediction model studies, evidence is lacking for machine learning-based prediction model studies. We aim to systematically review the adherence of Machine Learning (ML)-based prediction model studies to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Statement.We included articles reporting on development or external validation of a multivariable prediction model (either diagnostic or prognostic) developed using supervised ML for individualized predictions across all medical fields. We searched PubMed from 1 January 2018 to 31 December 2019. Data extraction was performed using the 22-item checklist for reporting of prediction model studies ( www.TRIPOD-statement.org ). We measured the overall adherence per article and per TRIPOD item.Our search identified 24,814 articles, of which 152 articles were included: 94 (61.8%) prognostic and 58 (38.2%) diagnostic prediction model studies. Overall, articles adhered to a median of 38.7% (IQR 31.0-46.4%) of TRIPOD items. No article fully adhered to complete reporting of the abstract and very few reported the flow of participants (3.9%, 95% CI 1.8 to 8.3), appropriate title (4.6%, 95% CI 2.2 to 9.2), blinding of predictors (4.6%, 95% CI 2.2 to 9.2), model specification (5.2%, 95% CI 2.4 to 10.8), and model's predictive performance (5.9%, 95% CI 3.1 to 10.9). There was often complete reporting of source of data (98.0%, 95% CI 94.4 to 99.3) and interpretation of the results (94.7%, 95% CI 90.0 to 97.3).Similar to prediction model studies developed using conventional regression-based techniques, the completeness of reporting is poor. Essential information to decide to use the model (i.e. model specification and its performance) is rarely reported. However, some items and sub-items of TRIPOD might be less suitable for ML-based prediction model studies and thus, TRIPOD requires extensions. Overall, there is an urgent need to improve the reporting quality and usability of research to avoid research waste.PROSPERO, CRD42019161764.

26 citations


Journal ArticleDOI
TL;DR: In this paper , the authors present and discuss four parameters (namely level of confidence, precision, variability of the data, and anticipated loss) required for sample size calculation for prevalence studies.
Abstract: Abstract Background Although books and articles guiding the methods of sample size calculation for prevalence studies are available, we aim to guide, assist and report sample size calculation using the present calculators. Results We present and discuss four parameters (namely level of confidence, precision, variability of the data, and anticipated loss) required for sample size calculation for prevalence studies. Choosing correct parameters with proper understanding, and reporting issues are mainly discussed. We demonstrate the use of a purposely-designed calculators that assist users to make proper informed-decision and prepare appropriate report. Conclusion Two calculators can be used with free software (Spreadsheet and RStudio) that benefit researchers with limited resources. It will, hopefully, minimize the errors in parameter selection, calculation, and reporting. The calculators are available at: ( https://sites.google.com/view/sr-ln/ssc ).

22 citations


Journal ArticleDOI
TL;DR: Real-world data for evidence-based decision making as discussed by the authors provides a brief overview on the type and sources of real-world datasets and the common models and approaches to utilize and analyze realworld data.
Abstract: The increased adoption of the internet, social media, wearable devices, e-health services, and other technology-driven services in medicine and healthcare has led to the rapid generation of various types of digital data, providing a valuable data source beyond the confines of traditional clinical trials, epidemiological studies, and lab-based experiments.We provide a brief overview on the type and sources of real-world data and the common models and approaches to utilize and analyze real-world data. We discuss the challenges and opportunities of using real-world data for evidence-based decision making This review does not aim to be comprehensive or cover all aspects of the intriguing topic on RWD (from both the research and practical perspectives) but serves as a primer and provides useful sources for readers who interested in this topic.Real-world hold great potential for generating real-world evidence for designing and conducting confirmatory trials and answering questions that may not be addressed otherwise. The voluminosity and complexity of real-world data also call for development of more appropriate, sophisticated, and innovative data processing and analysis techniques while maintaining scientific rigor in research findings, and attentions to data ethics to harness the power of real-world data.

19 citations


Journal ArticleDOI
TL;DR: The authors conducted a systematic review in MEDLINE and Embase between 01/01/2019 and 05/09/2019, for studies developing a prognostic prediction model using machine learning methods in oncology.
Abstract: Describe and evaluate the methodological conduct of prognostic prediction models developed using machine learning methods in oncology.We conducted a systematic review in MEDLINE and Embase between 01/01/2019 and 05/09/2019, for studies developing a prognostic prediction model using machine learning methods in oncology. We used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement, Prediction model Risk Of Bias ASsessment Tool (PROBAST) and CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) to assess the methodological conduct of included publications. Results were summarised by modelling type: regression-, non-regression-based and ensemble machine learning models.Sixty-two publications met inclusion criteria developing 152 models across all publications. Forty-two models were regression-based, 71 were non-regression-based and 39 were ensemble models. A median of 647 individuals (IQR: 203 to 4059) and 195 events (IQR: 38 to 1269) were used for model development, and 553 individuals (IQR: 69 to 3069) and 50 events (IQR: 17.5 to 326.5) for model validation. A higher number of events per predictor was used for developing regression-based models (median: 8, IQR: 7.1 to 23.5), compared to alternative machine learning (median: 3.4, IQR: 1.1 to 19.1) and ensemble models (median: 1.7, IQR: 1.1 to 6). Sample size was rarely justified (n = 5/62; 8%). Some or all continuous predictors were categorised before modelling in 24 studies (39%). 46% (n = 24/62) of models reporting predictor selection before modelling used univariable analyses, and common method across all modelling types. Ten out of 24 models for time-to-event outcomes accounted for censoring (42%). A split sample approach was the most popular method for internal validation (n = 25/62, 40%). Calibration was reported in 11 studies. Less than half of models were reported or made available.The methodological conduct of machine learning based clinical prediction models is poor. Guidance is urgently needed, with increased awareness and education of minimum prediction modelling standards. Particular focus is needed on sample size estimation, development and validation analysis methods, and ensuring the model is available for independent validation, to improve quality of machine learning based clinical prediction models.

17 citations


Journal ArticleDOI
TL;DR: In this article , an evidence-based, practical toolkit was developed to help researchers maximise recruitment of BAME groups in research, which is based on a detailed literature review, feedback from focus groups and further workshops and communication with participants to review the draft and final versions.
Abstract: It is recognised that Black, Asian and Minority Ethnic (BAME) populations are generally underrepresented in research studies. The key objective of this work was to develop an evidence based, practical toolkit to help researchers maximise recruitment of BAME groups in research.Development of the toolkit was an iterative process overseen by an expert steering group. Key steps included a detailed literature review, feedback from focus groups (including researchers and BAME community members) and further workshops and communication with participants to review the draft and final versions.Poor recruitment of BAME populations in research is due to complex reasons, these include factors such as inadequate attention to recruitment strategies and planning, poor engagement with communities and individuals due to issues such as cultural competency of researchers, historical poor experience of participating in research, and lack of links with community networks. Other factors include language issues, relevant expertise in research team and a lack of adequate resources that might be required in recruitment of BAME populations.A toolkit was developed with key sections providing guidance on planning research and ensuring adequate engagement of communities and individuals. Together with sections suggesting how the research team can address training needs and adopt best practice. Researchers highlighted the issue of funding and how best to address BAME recruitment in grant applications, so a section on preparing a grant application was also included. The final toolkit document is practical, and includes examples of best practice and 'top tips' for researchers.

16 citations


Journal ArticleDOI
TL;DR: In this article , the authors explored the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios, and concluded that Transformer encoder is the best choice if the computation resource is not an issue.
Abstract: Abstract Background Discharge medical notes written by physicians contain important information about the health condition of patients. Many deep learning algorithms have been successfully applied to extract important information from unstructured medical notes data that can entail subsequent actionable results in the medical domain. This study aims to explore the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios. Methods In this study, we employed seven artificial intelligence models, a CNN (Convolutional Neural Network), a Transformer encoder, a pretrained BERT (Bidirectional Encoder Representations from Transformers), and four typical sequence neural networks models, namely, RNN (Recurrent Neural Network), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory), and Bi-LSTM (Bi-directional Long Short-Term Memory) to classify the presence or absence of 16 disease conditions from patients’ discharge summary notes. We analyzed this question as a composition of 16 binary separate classification problems. The model performance of the seven models on each of the 16 datasets with various levels of imbalance between classes were compared in terms of AUC-ROC (Area Under the Curve of the Receiver Operating Characteristic), AUC-PR (Area Under the Curve of Precision and Recall), F1 Score, and Balanced Accuracy as well as the training time. The model performances were also compared in combination with different word embedding approaches (GloVe, BioWordVec, and no pre-trained word embeddings). Results The analyses of these 16 binary classification problems showed that the Transformer encoder model performs the best in nearly all scenarios. In addition, when the disease prevalence is close to or greater than 50%, the Convolutional Neural Network model achieved a comparable performance to the Transformer encoder, and its training time was 17.6% shorter than the second fastest model, 91.3% shorter than the Transformer encoder, and 94.7% shorter than the pre-trained BERT-Base model. The BioWordVec embeddings slightly improved the performance of the Bi-LSTM model in most disease prevalence scenarios, while the CNN model performed better without pre-trained word embeddings. In addition, the training time was significantly reduced with the GloVe embeddings for all models. Conclusions For classification tasks on medical notes, Transformer encoders are the best choice if the computation resource is not an issue. Otherwise, when the classes are relatively balanced, CNNs are a leading candidate because of their competitive performance and computational efficiency.

16 citations


Journal ArticleDOI
TL;DR: In this paper , the authors explored the quantitative characterization of the pandemic impact on public mental health by studying an online survey dataset of the United States and found that risk predictors for an individual to experience mental health issues include age, gender, race, marital status, health conditions, number of household members, employment status, the level of confidence of the future food affordability, availability of health insurance, mortgage status, and the information of kids enrolling in school.
Abstract: The coronavirus disease 2019 (COVID-19) pandemic has posed a significant influence on public mental health. Current efforts focus on alleviating the impacts of the disease on public health and the economy, with the psychological effects due to COVID-19 relatively ignored. In this research, we are interested in exploring the quantitative characterization of the pandemic impact on public mental health by studying an online survey dataset of the United States.The analyses are conducted based on a large scale of online mental health-related survey study in the United States, conducted over 12 consecutive weeks from April 23, 2020 to July 21, 2020. We are interested in examining the risk factors that have a significant impact on mental health as well as in their estimated effects over time. We employ the multiple imputation by chained equations (MICE) method to deal with missing values and take logistic regression with the least absolute shrinkage and selection operator (Lasso) method to identify risk factors for mental health.Our analysis shows that risk predictors for an individual to experience mental health issues include the pandemic situation of the State where the individual resides, age, gender, race, marital status, health conditions, the number of household members, employment status, the level of confidence of the future food affordability, availability of health insurance, mortgage status, and the information of kids enrolling in school. The effects of most of the predictors seem to change over time though the degree varies for different risk factors. The effects of risk factors, such as States and gender show noticeable change over time, whereas the factor age exhibits seemingly unchanged effects over time.The analysis results unveil evidence-based findings to identify the groups who are psychologically vulnerable to the COVID-19 pandemic. This study provides helpful evidence for assisting healthcare providers and policymakers to take steps for mitigating the pandemic effects on public mental health, especially in boosting public health care, improving public confidence in future food conditions, and creating more job opportunities.This article does not report the results of a health care intervention on human participants.

16 citations


Journal ArticleDOI
TL;DR: In this article , the authors examined the web interest of users in scientific and infodemic monikers linked to the identification of COVID-19 and the related novel coronavirus 2019 variants.
Abstract: The scientific community has classified COVID-19 as the worst pandemic in human history. The damage caused by the new disease was direct (e.g., deaths) and indirect (e.g., closure of economic activities). Within the latter category, we find infodemic phenomena such as the adoption of generic and stigmatizing names used to identify COVID-19 and the related novel coronavirus 2019 variants. These monikers have fostered the spread of health disinformation and misinformation and fomented racism and segregation towards the Chinese population. In this regard, we present a comprehensive infodemiological picture of Italy from the epidemic outbreak in December 2019 until September 2021. In particular, we propose a new procedure to examine in detail the web interest of users in scientific and infodemic monikers linked to the identification of COVID-19. To do this, we exploited the online tool Google Trends. Our findings reveal the widespread use of multiple COVID-19-related names not considered in the previous literature, as well as a persistent trend in the adoption of stigmatizing and generic terms. Inappropriate names for cataloging novel coronavirus 2019 variants of concern have even been adopted by national health agencies. Furthermore, we also showed that early denominations influenced user behavior for a long time and were difficult to replace. For these reasons, we suggest that the assignments of scientific names to new diseases are more timely and advise against mass media and international health authorities using terms linked to the geographical origin of the novel coronavirus 2019 variants.

15 citations


Journal ArticleDOI
TL;DR: In this paper , the authors used the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach for rating the certainty of systematic reviews (SRs) evidence published in urology and nephrology journals.
Abstract: To identify and describe the use of the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach for rating the certainty of systematic reviews (SRs) evidence published in urology and nephrology journals.SRs that were published in the top ten "urology and nephrology" journals with the highest impact factor according to the 2020 Journal Citation Reports (covering 2016-2020) were systematically searched and evaluated using the GRADE approach.A total of 445 SRs were researched. Sixty SRs of randomized control trials (RCTs) and/or non-randomized studies (NRSs) were evaluated using the GRADE approach. Forty-nine SRs (11%) rated the outcome-specific certainty of evidence (n = 29 in 2019-2020). We identified 811 certainty of evidence outcome ratings (n = 544 RCT ratings) as follows: very low (33.0%); low (32.1%); moderate (24.5%); and high (10.4%). Very low and high certainty of evidence ratings accounted for 55.0% and 0.4% of ratings in SRs of NRSs compared to 23.0% and 15.3% in SRs of RCTs. The certainty of evidence for RCTs and NRSs was downgraded most often for risk of bias and imprecision.We recommend increased emphasis on acceptance of the GRADE approach, as well as optimal use of the GRADE approach, in the synthesis of urinary tract evidence.

14 citations


Journal ArticleDOI
TL;DR: In this article , the authors consider a platform trial with two treatment arms and a common control arm, and assess the robustness of recently proposed model-based approaches to adjust for time trends when utilizing non-concurrent controls.
Abstract: Abstract Background Platform trials can evaluate the efficacy of several experimental treatments compared to a control. The number of experimental treatments is not fixed, as arms may be added or removed as the trial progresses. Platform trials are more efficient than independent parallel group trials because of using shared control groups. However, for a treatment entering the trial at a later time point, the control group is divided into concurrent controls, consisting of patients randomised to control when that treatment arm is in the platform, and non-concurrent controls, patients randomised before. Using non-concurrent controls in addition to concurrent controls can improve the trial’s efficiency by increasing power and reducing the required sample size, but can introduce bias due to time trends. Methods We focus on a platform trial with two treatment arms and a common control arm. Assuming that the second treatment arm is added at a later time, we assess the robustness of recently proposed model-based approaches to adjust for time trends when utilizing non-concurrent controls. In particular, we consider approaches where time trends are modeled either as linear in time or as a step function, with steps at time points where treatments enter or leave the platform trial. For trials with continuous or binary outcomes, we investigate the type 1 error rate and power of testing the efficacy of the newly added arm, as well as the bias and root mean squared error of treatment effect estimates under a range of scenarios. In addition to scenarios where time trends are equal across arms, we investigate settings with different time trends or time trends that are not additive in the scale of the model. Results A step function model, fitted on data from all treatment arms, gives increased power while controlling the type 1 error, as long as the time trends are equal for the different arms and additive on the model scale. This holds even if the shape of the time trend deviates from a step function when patients are allocated to arms by block randomisation. However, if time trends differ between arms or are not additive to treatment effects in the scale of the model, the type 1 error rate may be inflated. Conclusions The efficiency gained by using step function models to incorporate non-concurrent controls can outweigh potential risks of biases, especially in settings with small sample sizes. Such biases may arise if the model assumptions of equality and additivity of time trends are not satisfied. However, the specifics of the trial, scientific plausibility of different time trends, and robustness of results should be carefully considered.

14 citations


Journal ArticleDOI
TL;DR: In this article , a review of the most recent two-armed clinical oncology trials with crossing survival curves is presented, and the p-values of the log-rank and Peto-Peto test are compared with nine different tests developed for detection of survival differences in the presence of non-proportional or crossing hazards.
Abstract: The exchange of knowledge between statisticians developing new methodology and clinicians, reviewers or authors applying them is fundamental. This is specifically true for clinical trials with time-to-event endpoints. Thereby, one of the most commonly arising questions is that of equal survival distributions in two-armed trial. The log-rank test is still the gold-standard to infer this question. However, in case of non-proportional hazards, its power can become poor and multiple extensions have been developed to overcome this issue. We aim to facilitate the choice of a test for the detection of survival differences in the case of crossing hazards.We restricted the review to the most recent two-armed clinical oncology trials with crossing survival curves. Each data set was reconstructed using a state-of-the-art reconstruction algorithm. To ensure reproduction quality, only publications with published number at risk at multiple time points, sufficient printing quality and a non-informative censoring pattern were included. This article depicts the p-values of the log-rank and Peto-Peto test as references and compares them with nine different tests developed for detection of survival differences in the presence of non-proportional or crossing hazards.We reviewed 1400 recent phase III clinical oncology trials and selected fifteen studies that met our eligibility criteria for data reconstruction. After including further three individual patient data sets, for nine out of eighteen studies significant differences in survival were found using the investigated tests. An important point that reviewers should pay attention to is that 28% of the studies with published survival curves did not report the number at risk. This makes reconstruction and plausibility checks almost impossible.The evaluation shows that inference methods constructed to detect differences in survival in presence of non-proportional hazards are beneficial and help to provide guidance in choosing a sensible alternative to the standard log-rank test.

Journal ArticleDOI
TL;DR: In this article , a working group from the US Pain Management Collaboratory developed guidance for complete reporting of telehealth interventions, and an extension focused on unique considerations relevant to tele-health interventions was developed for the Template for the Intervention Description and Replication (TIDieR) checklist.
Abstract: Abstract Background Recent international health events have led to an increased proliferation of remotely delivered health interventions. Even with the pandemic seemingly coming under control, the experiences of the past year have fueled a growth in ideas and technology for increasing the scope of remote care delivery. Unfortunately, clinicians and health systems will have difficulty with the adoption and implementation of these interventions if ongoing and future clinical trials fail to report necessary details about execution, platforms, and infrastructure related to these interventions. The purpose was to develop guidance for reporting of telehealth interventions. Methods A working group from the US Pain Management Collaboratory developed guidance for complete reporting of telehealth interventions. The process went through 5-step process from conception to final checklist development with input for many stakeholders, to include all 11 primary investigators with trials in the Collaboratory. Results An extension focused on unique considerations relevant to telehealth interventions was developed for the Template for the Intervention Description and Replication (TIDieR) checklist. Conclusion The Telehealth Intervention guideline encourages use of the Template for the Intervention Description and Replication (TIDieR) checklist as a valuable tool (TIDieR-Telehealth) to improve the quality of research through a reporting guide of relevant interventions that will help maximize reproducibility and implementation.

Journal ArticleDOI
TL;DR: In this paper , the authors used the publicly available eICU database to construct a number of ML models before examining their internal behaviour with SHapley additive explanations (SHAP) values.
Abstract: Machine learning (ML) holds the promise of becoming an essential tool for utilising the increasing amount of clinical data available for analysis and clinical decision support. However, the lack of trust in the models has limited the acceptance of this technology in healthcare. This mistrust is often credited to the shortage of model explainability and interpretability, where the relationship between the input and output of the models is unclear. Improving trust requires the development of more transparent ML methods.In this paper, we use the publicly available eICU database to construct a number of ML models before examining their internal behaviour with SHapley Additive exPlanations (SHAP) values. Our four models predicted hospital mortality in ICU patients using a selection of the same features used to calculate the APACHE IV score and were based on random forest, logistic regression, naive Bayes, and adaptive boosting algorithms.The results showed the models had similar discriminative abilities and mostly agreed on feature importance while calibration and impact of individual features differed considerably and did in multiple cases not correspond to common medical theory.We already know that ML models treat data differently depending on the underlying algorithm. Our comparative analysis visualises implications of these differences and their importance in a healthcare setting. SHAP value analysis is a promising method for incorporating explainability in model development and usage and might yield better and more trustworthy ML models in the future.

Journal ArticleDOI
TL;DR: Ripple effects mapping (REM) as discussed by the authors is a qualitative method which can capture the wider impacts, and adaptive nature, of a systems approach, by bringing multi-sectoral stakeholders together to develop a collective understanding of the system and then to identify places where they can leverage change across the system.
Abstract: Systems approaches are currently being advocated and implemented to address complex challenges in Public Health. These approaches work by bringing multi-sectoral stakeholders together to develop a collective understanding of the system, and then to identify places where they can leverage change across the system. Systems approaches are unpredictable, where cause-and-effect cannot always be disentangled, and unintended consequences - positive and negative - frequently arise. Evaluating such approaches is difficult and new methods are warranted.Ripple Effects Mapping (REM) is a qualitative method which can capture the wider impacts, and adaptive nature, of a systems approach. Using a case study example from the evaluation of a physical activity-orientated systems approach in Gloucestershire, we: a) introduce the adapted REM method; b) describe how REM was applied in the example; c) explain how REM outputs were analysed; d) provide examples of how REM outputs were used; and e) describe the strengths, limitations, and future uses of REM based on our reflections.Ripple Effects Mapping is a participatory method that requires the active input of programme stakeholders in data gathering workshops. It produces visual outputs (i.e., maps) of the programme activities and impacts, which are mapped along a timeline to understand the temporal dimension of systems change efforts. The REM outputs from our example were created over several iterations, with data collected every 3-4 months, to build a picture of activities and impacts that have continued or ceased. Workshops took place both in person and online. An inductive content analysis was undertaken to describe and quantify the patterns within the REM outputs. Detailed guidance related to the preparation, delivery, and analysis of REM are included in this paper.REM may help to advance our understanding and evaluation of complex systems approaches, especially within the field of Public Health. We therefore invite other researchers, practitioners and policymakers to use REM and continuously evolve the method to enhance its application and practical utility.

Journal ArticleDOI
TL;DR: In this paper , the authors argue that realist research paradigms provide a useful framework to express the effect of contextual factors within implementation strategy causal processes, and they provide a sound approach to help optimise delivery of the right care in the right setting and at the right time.
Abstract: Implementation science in healthcare aims to understand how to get evidence into practice. Once this is achieved in one setting, it becomes increasingly difficult to replicate elsewhere. The problem is often attributed to differences in context that influence how and whether implementation strategies work. We argue that realist research paradigms provide a useful framework to express the effect of contextual factors within implementation strategy causal processes. Realist studies are theory-driven evaluations that focus on understanding how and why interventions work under different circumstances. They consider the interaction between contextual circumstances, theoretical mechanisms of change and the outcomes they produce, to arrive at explanations of conditional causality (i.e., what tends to work, for whom, under what circumstances). This Commentary provides example applications using preliminary findings from a large realist implementation study of system-wide value-based healthcare initiatives in New South Wales, Australia. If applied judiciously, realist implementation studies may represent a sound approach to help optimise delivery of the right care in the right setting and at the right time.

Journal ArticleDOI
TL;DR: In this paper , the authors investigated study retention and attrition and its associated sociodemographic and clinical factors among head and neck cancer (HNC) patients and informal caregivers included in the Netherlands Quality of Life and Biomedical Cohort Study (NET-QUBIC) and found that better performance and comorbidity score (among patients) and higher age (among caregivers) were associated with study retention at 2-years follow-up.
Abstract: Longitudinal observational cohort studies in cancer patients are important to move research and clinical practice forward. Continued study participation (study retention) is of importance to maintain the statistical power of research and facilitate representativeness of study findings. This study aimed to investigate study retention and attrition (drop-out) and its associated sociodemographic and clinical factors among head and neck cancer (HNC) patients and informal caregivers included in the Netherlands Quality of Life and Biomedical Cohort Study (NET-QUBIC).NET-QUBIC is a longitudinal cohort study among 739 HNC patients and 262 informal caregivers with collection of patient-reported outcome measures (PROMs), fieldwork data (interview, objective tests and medical examination) and biobank materials. Study retention and attrition was described from baseline (before treatment) up to 2-years follow-up (after treatment). Sociodemographic and clinical characteristics associated with retention in NET-QUBIC components at baseline (PROMs, fieldwork and biobank samples) and retention in general (participation in at least one component) were investigated using Chi-square, Fisher exact or independent t-tests (p< 0.05).Study retention at 2-years follow-up was 80% among patients alive (66% among all patients) and 70% among caregivers of patients who were alive and participating (52% among all caregivers). Attrition was most often caused by mortality, and logistic, physical, or psychological-related reasons. Tumor stage I/II, better physical performance and better (lower) comorbidity score were associated with participation in the PROMs component among patients. No factors associated with participation in the fieldwork component (patients), overall sample collection (patients and caregivers) or PROMs component (caregivers) were identified. A better performance and comorbidity score (among patients) and higher age (among caregivers) were associated with study retention at 2-years follow-up.Retention rates were high at two years follow-up (i.e. 80% among HNC patients alive and 70% among informal caregivers with an active patient). Nevertheless, some selection was shown in terms of tumor stage, physical performance, comorbidity and age, which might limit representativeness of NET-QUBIC data and samples. To facilitate representativeness of study findings future cohort studies might benefit from oversampling specific subgroups, such as patients with poor clinical outcomes or higher comorbidity and younger caregivers.

Journal ArticleDOI
TL;DR: In this article , the authors compared weighted and unweighted association measures after adjustment over potential confounding, taking into account dataset properties such as the initial gap between the population and the selected sample, the sample size, and the variable types.
Abstract: Online surveys have triggered a heated debate regarding their scientific validity. Many authors have adopted weighting methods to enhance the quality of online survey findings, while others did not find an advantage for this method. This work aims to compare weighted and unweighted association measures after adjustment over potential confounding, taking into account dataset properties such as the initial gap between the population and the selected sample, the sample size, and the variable types.This study assessed seven datasets collected between 2019 and 2021 during the COVID-19 pandemic through online cross-sectional surveys using the snowball sampling technique. Weighting methods were applied to adjust the online sample over sociodemographic features of the target population.Despite varying age and gender gaps between weighted and unweighted samples, strong similarities were found for dependent and independent variables. When applied on the same datasets, the regression analysis results showed a high relative difference between methods for some variables, while a low difference was found for others. In terms of absolute impact, the highest impact on the association measure was related to the sample size, followed by the age gap, the gender gap, and finally, the significance of the association between weighted age and the dependent variable.The results of this analysis of online surveys indicate that weighting methods should be used cautiously, as weighting did not affect the results in some databases, while it did in others. Further research is necessary to define situations in which weighting would be beneficial.

Journal ArticleDOI
TL;DR: In this article , the authors developed COVID-19 Estimated Risk (COVER) scores that quantify a patient's risk of hospital admission with pneumonia, hospitalization with pneumonia requiring intensive services or death, or fatality in the 30-days following a diagnosis using historical data from patients with influenza or flu-like symptoms.
Abstract: We investigated whether we could use influenza data to develop prediction models for COVID-19 to increase the speed at which prediction models can reliably be developed and validated early in a pandemic. We developed COVID-19 Estimated Risk (COVER) scores that quantify a patient's risk of hospital admission with pneumonia (COVER-H), hospitalization with pneumonia requiring intensive services or death (COVER-I), or fatality (COVER-F) in the 30-days following COVID-19 diagnosis using historical data from patients with influenza or flu-like symptoms and tested this in COVID-19 patients.We analyzed a federated network of electronic medical records and administrative claims data from 14 data sources and 6 countries containing data collected on or before 4/27/2020. We used a 2-step process to develop 3 scores using historical data from patients with influenza or flu-like symptoms any time prior to 2020. The first step was to create a data-driven model using LASSO regularized logistic regression, the covariates of which were used to develop aggregate covariates for the second step where the COVER scores were developed using a smaller set of features. These 3 COVER scores were then externally validated on patients with 1) influenza or flu-like symptoms and 2) confirmed or suspected COVID-19 diagnosis across 5 databases from South Korea, Spain, and the United States. Outcomes included i) hospitalization with pneumonia, ii) hospitalization with pneumonia requiring intensive services or death, and iii) death in the 30 days after index date.Overall, 44,507 COVID-19 patients were included for model validation. We identified 7 predictors (history of cancer, chronic obstructive pulmonary disease, diabetes, heart disease, hypertension, hyperlipidemia, kidney disease) which combined with age and sex discriminated which patients would experience any of our three outcomes. The models achieved good performance in influenza and COVID-19 cohorts. For COVID-19 the AUC ranges were, COVER-H: 0.69-0.81, COVER-I: 0.73-0.91, and COVER-F: 0.72-0.90. Calibration varied across the validations with some of the COVID-19 validations being less well calibrated than the influenza validations.This research demonstrated the utility of using a proxy disease to develop a prediction model. The 3 COVER models with 9-predictors that were developed using influenza data perform well for COVID-19 patients for predicting hospitalization, intensive services, and fatality. The scores showed good discriminatory performance which transferred well to the COVID-19 population. There was some miscalibration in the COVID-19 validations, which is potentially due to the difference in symptom severity between the two diseases. A possible solution for this is to recalibrate the models in each location before use.

Journal ArticleDOI
TL;DR: In this paper , the authors provide an illustrative guide to summarising nonlinear growth trajectories for repeatedly measured continuous outcomes using (i) linear spline and (ii) natural cubic spline linear mixed-effects (LME) models, (iii) superposition by translation and rotation (SITAR) nonlinear mixed effects models, and (iv) latent trajectory models.
Abstract: Longitudinal data analysis can improve our understanding of the influences on health trajectories across the life-course. There are a variety of statistical models which can be used, and their fitting and interpretation can be complex, particularly where there is a nonlinear trajectory. Our aim was to provide an accessible guide along with applied examples to using four sophisticated modelling procedures for describing nonlinear growth trajectories.This expository paper provides an illustrative guide to summarising nonlinear growth trajectories for repeatedly measured continuous outcomes using (i) linear spline and (ii) natural cubic spline linear mixed-effects (LME) models, (iii) Super Imposition by Translation and Rotation (SITAR) nonlinear mixed effects models, and (iv) latent trajectory models. The underlying model for each approach, their similarities and differences, and their advantages and disadvantages are described. Their application and correct interpretation of their results is illustrated by analysing repeated bone mass measures to characterise bone growth patterns and their sex differences in three cohort studies from the UK, USA, and Canada comprising 8500 individuals and 37,000 measurements from ages 5-40 years. Recommendations for choosing a modelling approach are provided along with a discussion and signposting on further modelling extensions for analysing trajectory exposures and outcomes, and multiple cohorts.Linear and natural cubic spline LME models and SITAR provided similar summary of the mean bone growth trajectory and growth velocity, and the sex differences in growth patterns. Growth velocity (in grams/year) peaked during adolescence, and peaked earlier in females than males e.g., mean age at peak bone mineral content accrual from multicohort SITAR models was 12.2 years in females and 13.9 years in males. Latent trajectory models (with trajectory shapes estimated using a natural cubic spline) identified up to four subgroups of individuals with distinct trajectories throughout adolescence.LME models with linear and natural cubic splines, SITAR, and latent trajectory models are useful for describing nonlinear growth trajectories, and these methods can be adapted for other complex traits. Choice of method depends on the research aims, complexity of the trajectory, and available data. Scripts and synthetic datasets are provided for readers to replicate trajectory modelling and visualisation using the R statistical computing software.

Journal ArticleDOI
TL;DR: In this article , the authors evaluated the feasibility of adopting RobotReviewer within a national public health institute using a randomized, real-time, user-centered study and found that participants were equally likely to accept judgment by RobotReviewer as each other's judgement during the consensus process when measured dichotomously.
Abstract: Abstract Background Machine learning and automation are increasingly used to make the evidence synthesis process faster and more responsive to policymakers’ needs. In systematic reviews of randomized controlled trials (RCTs), risk of bias assessment is a resource-intensive task that typically requires two trained reviewers. One function of RobotReviewer, an off-the-shelf machine learning system, is an automated risk of bias assessment. Methods We assessed the feasibility of adopting RobotReviewer within a national public health institute using a randomized, real-time, user-centered study. The study included 26 RCTs and six reviewers from two projects examining health and social interventions. We randomized these studies to one of two RobotReviewer platforms. We operationalized feasibility as accuracy, time use, and reviewer acceptability. We measured accuracy by the number of corrections made by human reviewers (either to automated assessments or another human reviewer’s assessments). We explored acceptability through group discussions and individual email responses after presenting the quantitative results. Results Reviewers were equally likely to accept judgment by RobotReviewer as each other’s judgement during the consensus process when measured dichotomously; risk ratio 1.02 (95% CI 0.92 to 1.13; p = 0.33). We were not able to compare time use. The acceptability of the program by researchers was mixed. Less experienced reviewers were generally more positive, and they saw more benefits and were able to use the tool more flexibly. Reviewers positioned human input and human-to-human interaction as superior to even a semi-automation of this process. Conclusion Despite being presented with evidence of RobotReviewer’s equal performance to humans, participating reviewers were not interested in modifying standard procedures to include automation. If further studies confirm equal accuracy and reduced time compared to manual practices, we suggest that the benefits of RobotReviewer may support its future implementation as one of two assessors, despite reviewer ambivalence. Future research should study barriers to adopting automated tools and how highly educated and experienced researchers can adapt to a job market that is increasingly challenged by new technologies.

Journal ArticleDOI
TL;DR: In this article , the authors used the Clinical Practice Research Datalink (CPRD) and linked national mortality data in England from 2000 to 2019 to investigate immortal time bias for a specific life-long condition, intellectual disability.
Abstract: Immortal time bias is common in observational studies but is typically described for pharmacoepidemiology studies where there is a delay between cohort entry and treatment initiation.This study used the Clinical Practice Research Datalink (CPRD) and linked national mortality data in England from 2000 to 2019 to investigate immortal time bias for a specific life-long condition, intellectual disability. Life expectancy (Chiang's abridged life table approach) was compared for 33,867 exposed and 980,586 unexposed individuals aged 10+ years using five methods: (1) treating immortal time as observation time; (2) excluding time before date of first exposure diagnosis; (3) matching cohort entry to first exposure diagnosis; (4) excluding time before proxy date of inputting first exposure diagnosis (by the physician); and (5) treating exposure as a time-dependent measure.When not considered in the design or analysis (Method 1), immortal time bias led to disproportionately high life expectancy for the exposed population during the first calendar period (additional years expected to live: 2000-2004: 65.6 [95% CI: 63.6,67.6]) compared to the later calendar periods (2005-2009: 59.9 [58.8,60.9]; 2010-2014: 58.0 [57.1,58.9]; 2015-2019: 58.2 [56.8,59.7]). Date of entry of diagnosis (Method 4) was unreliable in this CPRD cohort. The final methods (Method 2, 3 and 5) appeared to solve the main theoretical problem but residual bias may have remained.We conclude that immortal time bias is a significant issue for studies of life-long conditions that use electronic health record data and requires careful consideration of how clinical diagnoses are entered onto electronic health record systems.

Journal ArticleDOI
TL;DR: In this paper , a new configurational comparative method, called Combinational Regularity Analysis (CORA), is proposed for multi-morbidity data analysis, which can simultaneously explain individual conditions as well as complex conjunctions of conditions.
Abstract: Modern configurational comparative methods (CCMs) of causal inference, such as Qualitative Comparative Analysis (QCA) and Coincidence Analysis (CNA), have started to make inroads into medical and health research over the last decade. At the same time, these methods remain unable to process data on multi-morbidity, a situation in which at least two chronic conditions are simultaneously present. Such data require the capability to analyze complex effects. Against a background of fast-growing numbers of patients with multi-morbid diagnoses, we present a new member of the family of CCMs with which multiple conditions and their complex conjunctions can be analyzed: Combinational Regularity Analysis (CORA).The technical heart of CORA consists of algorithms that have originally been developed in electrical engineering for the analysis of multi-output switching circuits. We have adapted these algorithms for purposes of configurational data analysis. To demonstrate CORA, we provide several example applications, both with simulated and empirical data, by means of the eponymous software package CORA. Also included in CORA is the possibility to mine configurational data and to visualize results via logic diagrams.For simple single-condition analyses, CORA's solution is identical with that of QCA or CNA. However, analyses of multiple conditions with CORA differ in important respects from analyses with QCA or CNA. Most importantly, CORA is presently the only configurational method able to simultaneously explain individual conditions as well as complex conjunctions of conditions.Through CORA, problems of multi-morbidity in particular, and configurational analyses of complex effects in general, come into the analytical reach of CCMs. Future research aims to further broaden and enhance CORA's capabilities for refining such analyses.

Journal ArticleDOI
TL;DR: In this paper , the impact of different lengths of lookback window (LW), a retrospective time period to observe diagnoses in administrative data, on the prevalence and incidence of eight chronic diseases was described.
Abstract: We described the impact of different lengths of lookback window (LW), a retrospective time period to observe diagnoses in administrative data, on the prevalence and incidence of eight chronic diseases.Our study populations included people living with HIV (N = 5151) and 1:5 age-sex-matched HIV-negative individuals (N = 25,755) in British Columbia, Canada, with complete follow-up between 1996 and 2012. We measured period prevalence and incidence of diseases in 2012 using LWs ranging from 1 to 16 years. Cases were deemed prevalent if identified in 2012 or within a defined LW, and incident if newly identified in 2012 with no previous cases detected within a defined LW. Chronic disease cases were ascertained using published case-finding algorithms applied to population-based provincial administrative health datasets.Overall, using cases identified by the full 16-year LW as the reference, LWs ≥8 years and ≥ 4 years reduced the proportion of misclassified prevalent and incidence cases of most diseases to < 20%, respectively. The impact of LWs varied across diseases and populations.This study underscored the importance of carefully choosing LWs and demonstrated data-driven approaches that may inform these choices. To improve comparability of prevalence and incidence estimates across different settings, we recommend transparent reporting of the rationale and limitations of chosen LWs.

Journal ArticleDOI
TL;DR: In this article , a review aimed to identify all observational studies employing improved approaches to mitigate confounding in characterizing alcohol-long-term health relationships, and to qualitatively synthesize their findings.
Abstract: Research has long found 'J-shaped' relationships between alcohol consumption and certain health outcomes, indicating a protective effect of moderate consumption. However, methodological limitations in most studies hinder causal inference. This review aimed to identify all observational studies employing improved approaches to mitigate confounding in characterizing alcohol-long-term health relationships, and to qualitatively synthesize their findings.Eligible studies met the above description, were longitudinal (with pre-defined exceptions), discretized alcohol consumption, and were conducted with human populations. MEDLINE, PsycINFO, Embase and SCOPUS were searched in May 2020, yielding 16 published manuscripts reporting on cancer, diabetes, dementia, mental health, cardiovascular health, mortality, HIV seroconversion, and musculoskeletal health. Risk of bias of cohort studies was evaluated using the Newcastle-Ottawa Scale, and a recently developed tool was used for Mendelian Randomization studies.A variety of functional forms were found, including reverse J/J-shaped relationships for prostate cancer and related mortality, dementia risk, mental health, and certain lipids. However, most outcomes were only evaluated by a single study, and few studies provided information on the role of alcohol consumption pattern.More research employing enhanced causal inference methods is urgently required to accurately characterize alcohol-long-term health relationships. Those studies that have been conducted find a variety of linear and non-linear functional forms, with results tending to be discrepant even within specific health outcomes.PROSPERO registration number CRD42020185861.

Journal ArticleDOI
TL;DR: In this article , discrete-time survival models are applied to a person-period data set to predict the hazard of experiencing the failure event in pre-specified time intervals, which can be extended to accommodate new binary classification algorithms as they become available.
Abstract: Abstract Background Prediction models for time-to-event outcomes are commonly used in biomedical research to obtain subject-specific probabilities that aid in making important clinical care decisions. There are several regression and machine learning methods for building these models that have been designed or modified to account for the censoring that occurs in time-to-event data. Discrete-time survival models, which have often been overlooked in the literature, provide an alternative approach for predictive modeling in the presence of censoring with limited loss in predictive accuracy. These models can take advantage of the range of nonparametric machine learning classification algorithms and their available software to predict survival outcomes. Methods Discrete-time survival models are applied to a person-period data set to predict the hazard of experiencing the failure event in pre-specified time intervals. This framework allows for any binary classification method to be applied to predict these conditional survival probabilities. Using time-dependent performance metrics that account for censoring, we compare the predictions from parametric and machine learning classification approaches applied within the discrete time-to-event framework to those from continuous-time survival prediction models. We outline the process for training and validating discrete-time prediction models, and demonstrate its application using the open-source R statistical programming environment. Results Using publicly available data sets, we show that some discrete-time prediction models achieve better prediction performance than the continuous-time Cox proportional hazards model. Random survival forests, a machine learning algorithm adapted to survival data, also had improved performance compared to the Cox model, but was sometimes outperformed by the discrete-time approaches. In comparing the binary classification methods in the discrete time-to-event framework, the relative performance of the different methods varied depending on the data set. Conclusions We present a guide for developing survival prediction models using discrete-time methods and assessing their predictive performance with the aim of encouraging their use in medical research settings. These methods can be applied to data sets that have continuous time-to-event outcomes and multiple clinical predictors. They can also be extended to accommodate new binary classification algorithms as they become available. We provide R code for fitting discrete-time survival prediction models in a github repository.

Journal ArticleDOI
TL;DR: In this paper , the authors presented the extension of the case time series design, originally proposed for individual-level analyses on short-term associations with time-varying exposures, for applications using data aggregated over small geographical areas.
Abstract: The increased availability of data on health outcomes and risk factors collected at fine geographical resolution is one of the main reasons for the rising popularity of epidemiological analyses conducted at small-area level. However, this rich data setting poses important methodological issues related to modelling complexities and computational demands, as well as the linkage and harmonisation of data collected at different geographical levels.This tutorial illustrated the extension of the case time series design, originally proposed for individual-level analyses on short-term associations with time-varying exposures, for applications using data aggregated over small geographical areas. The case time series design embeds the longitudinal structure of time series data within the self-matched framework of case-only methods, offering a flexible and highly adaptable analytical tool. The methodology is well suited for modelling complex temporal relationships, and it provides an efficient computational scheme for large datasets including longitudinal measurements collected at a fine geographical level.The application of the case time series for small-area analyses is demonstrated using a real-data case study to assess the mortality risks associated with high temperature in the summers of 2006 and 2013 in London, UK. The example makes use of information on individual deaths, temperature, and socio-economic characteristics collected at different geographical levels. The tutorial describes the various steps of the analysis, namely the definition of the case time series structure and the linkage of the data, as well as the estimation of the risk associations and the assessment of vulnerability differences. R code and data are made available to fully reproduce the results and the graphical descriptions.The extension of the case time series for small-area analysis offers a valuable analytical tool that combines modelling flexibility and computational efficiency. The increasing availability of data collected at fine geographical scales provides opportunities for its application to address a wide range of epidemiological questions.

Journal ArticleDOI
TL;DR: KMSubtraction as mentioned in this paper is an R-package that retrieves patients from unreported subgroups by matching participants on KM plots of the overall cohort to participants on known subgroup with follow-up time.
Abstract: Data from certain subgroups of clinical interest may not be presented in primary manuscripts or conference abstract presentations. In an effort to enable secondary data analyses, we propose a workflow to retrieve unreported subgroup survival data from published Kaplan-Meier (KM) plots.We developed KMSubtraction, an R-package that retrieves patients from unreported subgroups by matching participants on KM plots of the overall cohort to participants on KM plots of a known subgroup with follow-up time. By excluding matched patients, the opposing unreported subgroup may be retrieved. Reproducibility and limits of error of the KMSubtraction workflow were assessed by comparing unmatched patients against the original survival data of subgroups from published datasets and simulations. Monte Carlo simulations were utilized to evaluate the limits of error of KMSubtraction.The validation exercise found no material systematic error and demonstrates the robustness of KMSubtraction in deriving unreported subgroup survival data. Limits of error were small and negligible on marginal Cox proportional hazard models comparing reconstructed and original survival data of unreported subgroups. Extensive Monte Carlo simulations demonstrate that datasets with high reported subgroup proportion (r = 0.467, p < 0.001), small dataset size (r = - 0.374, p < 0.001) and high proportion of missing data in the unreported subgroup (r = 0.553, p < 0.001) were associated with uncertainty are likely to yield high limits of error with KMSubtraction.KMSubtraction demonstrates robustness in deriving survival data from unreported subgroups. The limits of error of KMSubtraction derived from converged Monte Carlo simulations may guide the interpretation of reconstructed survival data of unreported subgroups.

Journal ArticleDOI
TL;DR: In this paper , a two-stage matching-adjusted indirect comparison (2SMAIC) method is proposed to balance covariates between treatment arms and across studies, which is the most widely used covariate adjusted indirect comparison method in health technology assessment.
Abstract: Anchored covariate-adjusted indirect comparisons inform reimbursement decisions where there are no head-to-head trials between the treatments of interest, there is a common comparator arm shared by the studies, and there are patient-level data limitations. Matching-adjusted indirect comparison (MAIC), based on propensity score weighting, is the most widely used covariate-adjusted indirect comparison method in health technology assessment. MAIC has poor precision and is inefficient when the effective sample size after weighting is small.A modular extension to MAIC, termed two-stage matching-adjusted indirect comparison (2SMAIC), is proposed. This uses two parametric models. One estimates the treatment assignment mechanism in the study with individual patient data (IPD), the other estimates the trial assignment mechanism. The first model produces inverse probability weights that are combined with the odds weights produced by the second model. The resulting weights seek to balance covariates between treatment arms and across studies. A simulation study provides proof-of-principle in an indirect comparison performed across two randomized trials. Nevertheless, 2SMAIC can be applied in situations where the IPD trial is observational, by including potential confounders in the treatment assignment model. The simulation study also explores the use of weight truncation in combination with MAIC for the first time.Despite enforcing randomization and knowing the true treatment assignment mechanism in the IPD trial, 2SMAIC yields improved precision and efficiency with respect to MAIC in all scenarios, while maintaining similarly low levels of bias. The two-stage approach is effective when sample sizes in the IPD trial are low, as it controls for chance imbalances in prognostic baseline covariates between study arms. It is not as effective when overlap between the trials' target populations is poor and the extremity of the weights is high. In these scenarios, truncation leads to substantial precision and efficiency gains but induces considerable bias. The combination of a two-stage approach with truncation produces the highest precision and efficiency improvements.Two-stage approaches to MAIC can increase precision and efficiency with respect to the standard approach by adjusting for empirical imbalances in prognostic covariates in the IPD trial. Further modules could be incorporated for additional variance reduction or to account for missingness and non-compliance in the IPD trial.

Journal ArticleDOI
TL;DR: In this paper , the authors used open-source software and a publicly available dataset to train and validate multiple ML models to classify breast masses into benign or malignant using mammography image features and patient age.
Abstract: Abstract Background There is growing enthusiasm for the application of machine learning (ML) and artificial intelligence (AI) techniques to clinical research and practice. However, instructions on how to develop robust high-quality ML and AI in medicine are scarce. In this paper, we provide a practical example of techniques that facilitate the development of high-quality ML systems including data pre-processing, hyperparameter tuning, and model comparison using open-source software and data. Methods We used open-source software and a publicly available dataset to train and validate multiple ML models to classify breast masses into benign or malignant using mammography image features and patient age. We compared algorithm predictions to the ground truth of histopathologic evaluation. We provide step-by-step instructions with accompanying code lines. Findings Performance of the five algorithms at classifying breast masses as benign or malignant based on mammography image features and patient age was statistically equivalent ( P > 0.05). Area under the receiver operating characteristics curve (AUROC) for the logistic regression with elastic net penalty was 0.89 (95% CI 0.85 – 0.94), for the Extreme Gradient Boosting Tree 0.88 (95% CI 0.83 – 0.93), for the Multivariate Adaptive Regression Spline algorithm 0.88 (95% CI 0.83 – 0.93), for the Support Vector Machine 0.89 (95% CI 0.84 – 0.93), and for the neural network 0.89 (95% CI 0.84 – 0.93). Interpretation Our paper allows clinicians and medical researchers who are interested in using ML algorithms to understand and recreate the elements of a comprehensive ML analysis. Following our instructions may help to improve model generalizability and reproducibility in medical ML studies.

Journal ArticleDOI
TL;DR: In this article , the authors used simulated and real-life data to illustrate that GBTM is susceptible to generating spurious findings in some circumstances and proposed several model adequacy criteria to assess classification adequacy.
Abstract: Abstract Background Group-based trajectory modelling (GBTM) is increasingly used to identify subgroups of individuals with similar patterns. In this paper, we use simulated and real-life data to illustrate that GBTM is susceptible to generating spurious findings in some circumstances. Methods Six plausible scenarios, two of which mimicked published analyses, were simulated. Models with 1 to 10 trajectory subgroups were estimated and the model that minimized the Bayes criterion was selected. For each scenario, we assessed whether the method identified the correct number of trajectories, the correct shapes of the trajectories, and the mean number of participants of each trajectory subgroup. The performance of the average posterior probabilities, relative entropy and mismatch criteria to assess classification adequacy were compared. Results Among the six scenarios, the correct number of trajectories was identified in two, the correct shapes in four and the mean number of participants of each trajectory subgroup in only one. Relative entropy and mismatch outperformed the average posterior probability in detecting spurious trajectories. Conclusion Researchers should be aware that GBTM can generate spurious findings, especially when the average posterior probability is used as the sole criterion to evaluate model fit. Several model adequacy criteria should be used to assess classification adequacy.