scispace - formally typeset
Search or ask a question

Showing papers in "BMC Medical Research Methodology in 2020"


Journal ArticleDOI
TL;DR: A tool that enables researchers with and without thorough knowledge on measurement properties to assess the quality of a study on reliability and measurement error of outcome measurement instruments is developed.
Abstract: Scores on an outcome measurement instrument depend on the type and settings of the instrument used, how instructions are given to patients, how professionals administer and score the instrument, etc. The impact of all these sources of variation on scores can be assessed in studies on reliability and measurement error, if properly designed and analyzed. The aim of this study was to develop standards to assess the quality of studies on reliability and measurement error of clinician-reported outcome measurement instruments, performance-based outcome measurement instrument, and laboratory values. We conducted a 3-round Delphi study involving 52 panelists. Consensus was reached on how a comprehensive research question can be deduced from the design of a reliability study to determine how the results of a study inform us about the quality of the outcome measurement instrument at issue. Consensus was reached on components of outcome measurement instruments, i.e. the potential sources of variation. Next, we reached consensus on standards on design requirements (n = 5), standards on preferred statistical methods for reliability (n = 3) and measurement error (n = 2), and their ratings on a four-point scale. There was one term for a component and one rating of one standard on which no consensus was reached, and therefore required a decision by the steering committee. We developed a tool that enables researchers with and without thorough knowledge on measurement properties to assess the quality of a study on reliability and measurement error of outcome measurement instruments.

148 citations


Journal ArticleDOI
TL;DR: This paper evaluates three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks and discusses the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
Abstract: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

141 citations


Journal ArticleDOI
TL;DR: The results indicate that there are significant inconsistencies regarding how these reviews are conducted, and the need for clearer reporting standards and consensus on methodological guidance for systematic reviews of prevalence data is indicated.
Abstract: There is a notable lack of methodological and reporting guidance for systematic reviews of prevalence data. This information void has the potential to result in reviews that are inconsistent and inadequate to inform healthcare policy and decision making. The aim of this meta-epidemiological study is to describe the methodology of recently published prevalence systematic reviews. We searched MEDLINE (via PubMed) from February 2017 to February 2018 for systematic reviews of prevalence studies. We included systematic reviews assessing the prevalence of any clinical condition using patients as the unit of measurement and we summarized data related to reporting and methodology of the reviews. A total of 235 systematic reviews of prevalence were analyzed. The median number of authors was 5 (interquartile range [IQR] 4–7), the median number of databases searched was 4 (3–6) and the median number of studies included in each review was 24 (IQR 15–41.5). Search strategies were presented for 68% of reviews. Forty five percent of reviews received external funding, and 24% did not provide funding information. Twenty three percent of included reviews had published or registered the systematic review protocol. Reporting guidelines were used in 72% of reviews. The quality of included studies was assessed in 80% of reviews. Nine reviews assessed the overall quality of evidence (4 using GRADE). Meta-analysis was conducted in 65% of reviews; 1% used Bayesian methods. Random effect meta-analysis was used in 94% of reviews; among them, 75% did not report the variance estimator used. Among the reviews with meta-analysis, 70% did not report how data was transformed; 59% percent conducted subgroup analysis, 38% conducted meta-regression and 2% estimated prediction interval; I2 was estimated in 95% of analysis. Publication bias was examined in 48%. The most common software used was STATA (55%). Our results indicate that there are significant inconsistencies regarding how these reviews are conducted. Many of these differences arose in the assessment of methodological quality and the formal synthesis of comparable data. This variability indicates the need for clearer reporting standards and consensus on methodological guidance for systematic reviews of prevalence data.

136 citations


Journal ArticleDOI
TL;DR: The social media advertisement campaign was an effective and efficient strategy to collect large scale, nationwide data on COVID-19 within a short time period and can inform future research on the use of social media recruitment for the rapid collection of survey data related to rapidly evolving health crises, such as CO VID-19.
Abstract: The COVID-19 pandemic has evolved into one of the most impactful health crises in modern history, compelling researchers to explore innovative ways to efficiently collect public health data in a timely manner. Social media platforms have been explored as a research recruitment tool in other settings; however, their feasibility for collecting representative survey data during infectious disease epidemics remain unexplored. This study has two aims 1) describe the methodology used to recruit a nationwide sample of adults residing in the United States (U.S.) to participate in a survey on COVID-19 knowledge, beliefs, and practices, and 2) outline the preliminary findings related to recruitment, challenges using social media as a recruitment platform, and strategies used to address these challenges. An original web-based survey informed by evidence from past literature and validated scales was developed. A Facebook advertisement campaign was used to disseminate the link to an online Qualtrics survey between March 20–30, 2020. Two supplementary male-only and racial minority- targeted advertisements were created on the sixth and tenth day of recruitment, respectively, to address issues of disproportionate female- and White-oriented gender- and ethnic-skewing observed in the advertisement’s reach and response trends. In total, 6602 participant responses were recorded with representation from all U.S. 50 states, the District of Columbia, and Puerto Rico. The advertisements cumulatively reached 236,017 individuals and resulted in 9609 clicks (4.07% reach). Total cost of the advertisement was $906, resulting in costs of $0.09 per click and $0.18 per full response (completed surveys). Implementation of the male-only advertisement improved the cumulative percentage of male respondents from approximately 20 to 40%. The social media advertisement campaign was an effective and efficient strategy to collect large scale, nationwide data on COVID-19 within a short time period. Although the proportion of men who completed the survey was lower than those who didn’t, interventions to increase male responses and enhance representativeness were successful. These findings can inform future research on the use of social media recruitment for the rapid collection of survey data related to rapidly evolving health crises, such as COVID-19.

132 citations


Journal ArticleDOI
TL;DR: This is the first tool to appraise research quality from the perspective of Indigenous peoples, with the potential for greater improvements in Aboriginal and Torres Strait Islander health and wellbeing.
Abstract: The lack of attention to Indigenous epistemologies and, more broadly, Indigenous values in primary research, is mirrored in the standardised critical appraisal tools used to guide evidence-based practice and systematic reviews and meta-syntheses. These critical appraisal tools offer no guidance on how validity or contextual relevance should be assessed for Indigenous populations and cultural contexts. Failure to tailor the research questions, design, analysis, dissemination and knowledge translation to capture understandings that are specific to Indigenous peoples results in research of limited acceptability and benefit and potentially harms Indigenous peoples. A specific Aboriginal and Torres Strait Islander Quality Appraisal Tool is needed to address this gap. The Aboriginal and Torres Strait Islander Quality Appraisal Tool (QAT) was developed using a modified Nominal Group and Delphi Techniques and the tool’s validity, reliability, and feasibility were assessed over three stages of independent piloting. National and international research guidelines were used as points of reference. Piloting of the Aboriginal and Torres Strait Islander QAT with Aboriginal and Torres Strait Islander and non-Indigenous experts led to refinement of the tool. The Aboriginal and Torres Strait Islander QAT consists of 14 questions that assess the quality of health research from an Aboriginal and Torres Strait Islander perspective. The questions encompass setting appropriate research questions; community engagement and consultation; research leadership and governance; community protocols; intellectual and cultural property rights; the collection and management of research material; Indigenous research paradigms; a strength-based approach to research; the translation of findings into policy and practice; benefits to participants and communities involved; and capacity strengthening and two-way learning. Outcomes from the assessment of the tool’s validity, reliability, and feasibility were overall positive. This is the first tool to appraise research quality from the perspective of Indigenous peoples. Through the uptake of the Aboriginal and Torres Strait Islander QAT we hope to improve the quality and transparency of research with Aboriginal and Torres Strait Islander peoples, with the potential for greater improvements in Aboriginal and Torres Strait Islander health and wellbeing.

106 citations


Journal ArticleDOI
TL;DR: Covidence and Rayyan are recommended to systematic reviewers looking for suitable and easy to use tools to support T&Ab screening within healthcare research because they consistently demonstrated good alignment with user requirements.
Abstract: Systematic reviews are vital to the pursuit of evidence-based medicine within healthcare. Screening titles and abstracts (TA a search of the online “systematic review toolbox”; and screening of references in existing literature. We included tools that were accessible and available for testing at the time of the study (December 2018), do not require specific computing infrastructure and provide basic screening functionality for systematic reviews. Key properties of each software tool were identified using a feature analysis adapted for this purpose. This analysis included a weighting developed by a group of medical researchers, therefore prioritising the most relevant features. The highest scoring tools from the feature analysis were then included in a user survey, in which we further investigated the suitability of the tools for supporting T&Ab screening amongst systematic reviewers working in medical research. Fifteen tools met our inclusion criteria. They vary significantly in relation to cost, scope and intended user community. Six of the identified tools (Abstrackr, Colandr, Covidence, DRAGON, EPPI-Reviewer and Rayyan) scored higher than 75% in the feature analysis and were included in the user survey. Of these, Covidence and Rayyan were the most popular with the survey respondents. Their usability scored highly across a range of metrics, with all surveyed researchers (n = 6) stating that they would be likely (or very likely) to use these tools in the future. Based on this study, we would recommend Covidence and Rayyan to systematic reviewers looking for suitable and easy to use tools to support T&Ab screening within healthcare research. These two tools consistently demonstrated good alignment with user requirements. We acknowledge, however, the role of some of the other tools we considered in providing more specialist features that may be of great importance to many researchers.

106 citations


Journal ArticleDOI
TL;DR: RF-SLAM is a novel statistical and machine learning method that improves risk prediction by incorporating time-varying information and accommodating a large number of predictors, their interactions, and missing values, and demonstrates superior performance relative to standard random forest methods for survival data.
Abstract: Clinical research and medical practice can be advanced through the prediction of an individual’s health state, trajectory, and responses to treatments. However, the majority of current clinical risk prediction models are based on regression approaches or machine learning algorithms that are static, rather than dynamic. To benefit from the increasing emergence of large, heterogeneous data sets, such as electronic health records (EHRs), novel tools to support improved clinical decision making through methods for individual-level risk prediction that can handle multiple variables, their interactions, and time-varying values are necessary. We introduce a novel dynamic approach to clinical risk prediction for survival, longitudinal, and multivariate (SLAM) outcomes, called random forest for SLAM data analysis (RF-SLAM). RF-SLAM is a continuous-time, random forest method for survival analysis that combines the strengths of existing statistical and machine learning methods to produce individualized Bayes estimates of piecewise-constant hazard rates. We also present a method-agnostic approach for time-varying evaluation of model performance. We derive and illustrate the method by predicting sudden cardiac arrest (SCA) in the Left Ventricular Structural (LV) Predictors of Sudden Cardiac Death (SCD) Registry. We demonstrate superior performance relative to standard random forest methods for survival data. We illustrate the importance of the number of preceding heart failure hospitalizations as a time-dependent predictor in SCA risk assessment. RF-SLAM is a novel statistical and machine learning method that improves risk prediction by incorporating time-varying information and accommodating a large number of predictors, their interactions, and missing values. RF-SLAM is designed to easily extend to simultaneous predictions of multiple, possibly competing, events and/or repeated measurements of discrete or continuous variables over time.Trial registration: LV Structural Predictors of SCD Registry (clinicaltrials.gov, NCT01076660), retrospectively registered 25 February 2010

83 citations


Journal ArticleDOI
Riko Kelter1
TL;DR: A non-technical introduction to Bayesian hypothesis testing in JASP is provided by comparing traditional tests and statistical methods with their Bayesian counterparts by showing the strengths and limitations of JASp for frequentist NHST and Bayesian inference.
Abstract: Although null hypothesis significance testing (NHST) is the agreed gold standard in medical decision making and the most widespread inferential framework used in medical research, it has several drawbacks. Bayesian methods can complement or even replace frequentist NHST, but these methods have been underutilised mainly due to a lack of easy-to-use software. JASP is an open-source software for common operating systems, which has recently been developed to make Bayesian inference more accessible to researchers, including the most common tests, an intuitive graphical user interface and publication-ready output plots. This article provides a non-technical introduction to Bayesian hypothesis testing in JASP by comparing traditional tests and statistical methods with their Bayesian counterparts. The comparison shows the strengths and limitations of JASP for frequentist NHST and Bayesian inference. Specifically, Bayesian hypothesis testing via Bayes factors can complement and even replace NHST in most situations in JASP. While p-values can only reject the null hypothesis, the Bayes factor can state evidence for both the null and the alternative hypothesis, making confirmation of hypotheses possible. Also, effect sizes can be precisely estimated in the Bayesian paradigm via JASP. Bayesian inference has not been widely used by now due to the dearth of accessible software. Medical decision making can be complemented by Bayesian hypothesis testing in JASP, providing richer information than single p-values and thus strengthening the credibility of an analysis. Through an easy point-and-click interface researchers used to other graphical statistical packages like SPSS can seemlessly transition to JASP and benefit from the listed advantages with only few limitations.

83 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used multistate models to study COVID-19 patients' time-dependent progress and provide a statistical framework to estimate hazard rates and transition probabilities, which can be used to quantify average sojourn times of clinically important states such as intensive care and invasive ventilation.
Abstract: The clinical progress of patients hospitalized due to COVID-19 is often associated with severe pneumonia which may require intensive care, invasive ventilation, or extracorporeal membrane oxygenation (ECMO). The length of intensive care and the duration of these supportive therapies are clinically relevant outcomes. From the statistical perspective, these quantities are challenging to estimate due to episodes being time-dependent and potentially multiple, as well as being determined by the competing, terminal events of discharge alive and death. We used multistate models to study COVID-19 patients’ time-dependent progress and provide a statistical framework to estimate hazard rates and transition probabilities. These estimates can then be used to quantify average sojourn times of clinically important states such as intensive care and invasive ventilation. We have made two real data sets of COVID-19 patients (n = 24* and n = 53**) and the corresponding statistical code publically available. The expected lengths of intensive care unit (ICU) stay at day 28 for the two cohorts were 15.05* and 19.62** days, while expected durations of mechanical ventilation were 7.97* and 9.85** days. Predicted mortality stood at 51%* and 15%**. Patients mechanically ventilated at the start of the example studies had a longer expected duration of ventilation (12.25*, 14.57** days) compared to patients non-ventilated (4.34*, 1.41** days) after 28 days. Furthermore, initially ventilated patients had a higher risk of death (54%* and 20%** vs. 48%* and 6%**) after 4 weeks. These results are further illustrated in stacked probability plots for the two groups from time zero, as well as for the entire cohort which depicts the predicted proportions of the patients in each state over follow-up. The multistate approach gives important insights into the progress of COVID-19 patients in terms of ventilation duration, length of ICU stay, and mortality. In addition to avoiding frequent pitfalls in survival analysis, the methodology enables active cases to be analyzed by allowing for censoring. The stacked probability plots provide extensive information in a concise manner that can be easily conveyed to decision makers regarding healthcare capacities. Furthermore, clear comparisons can be made among different baseline characteristics.

67 citations


Journal ArticleDOI
TL;DR: BMC Medical Research Methodology would like to contribute to this global endeavour by setting up a collection of articles called “Methodologies for COVID-19 research and data analysis”, by offering the views regarding methodological challenges where researchers can help.
Abstract: Editorial On March 11, 2020, the World Health Organization (WHO) declared that COVID-19 can be characterized as a pandemic [1]. The disease is caused by the novel coronavirus SARS-CoV-2, which rapidly overwhelmed the entire world. The virus was first described in China in December 2019, in early January it was already characterized, and already on January 30, 2020, the outbreak was declared a Public Health Emergency of International Concern, which later evolved into a pandemic [1]. Devastating and unpredictable spread of COVID-19 throughout the world has caused unprecedented global lockdowns and immense burden for healthcare systems. The WHO called for immediate research actions including “immediately assess available data to learn what standard of care approaches are the most effective” and “evaluate as fast as possible the effect of adjunctive and supportive therapies” [1]. This pandemic is now an enormous challenge for researchers, clinicians, health-care workers, epidemiologists and decision-makers. BMC Medical Research Methodology would like to contribute to this global endeavour by setting up a collection of articles called “Methodologies for COVID-19 research and data analysis”. As Guest Editors of the Collection, we would like to offer our views regarding methodological challenges where researchers can help.

65 citations


Journal ArticleDOI
TL;DR: The growth of the early COVID-19 medical literature is characterised using evidence maps and bibliometric analyses to elicit cross-sectional and longitudinal trends and systematically identify gaps.
Abstract: Since the beginning of the COVID-19 outbreak in December 2019, a substantial body of COVID-19 medical literature has been generated As of June 2020, gaps and longitudinal trends in the COVID-19 medical literature remain unidentified, despite potential benefits for research prioritisation and policy setting in both the COVID-19 pandemic and future large-scale public health crises In this paper, we searched PubMed and Embase for medical literature on COVID-19 between 1 January and 24 March 2020 We characterised the growth of the early COVID-19 medical literature using evidence maps and bibliometric analyses to elicit cross-sectional and longitudinal trends and systematically identify gaps The early COVID-19 medical literature originated primarily from Asia and focused mainly on clinical features and diagnosis of the disease Many areas of potential research remain underexplored, such as mental health, the use of novel technologies and artificial intelligence, pathophysiology of COVID-19 within different body systems, and indirect effects of COVID-19 on the care of non-COVID-19 patients Few articles involved research collaboration at the international level (247%) The median submission-to-publication duration was 8 days (interquartile range: 4–16) Although in its early phase, COVID-19 research has generated a large volume of publications However, there are still knowledge gaps yet to be filled and areas for improvement for the global research community Our analysis of early COVID-19 research may be valuable in informing research prioritisation and policy planning both in the current COVID-19 pandemic and similar global health crises

Journal ArticleDOI
TL;DR: RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR.
Abstract: Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Journal ArticleDOI
TL;DR: This paper uses a framework to clarify some concepts in prognostic research that remain poorly understood and implemented, to stimulate discussion about how prognostic studies can be strengthened and appropriately interpreted.
Abstract: Prognostic research has many important purposes, including (i) describing the natural history and clinical course of health conditions, (ii) investigating variables associated with health outcomes of interest, (iii) estimating an individual’s probability of developing different outcomes, (iv) investigating the clinical application of prediction models, and (v) investigating determinants of recovery that can inform the development of interventions to improve patient outcomes But much prognostic research has been poorly conducted and interpreted, indicating that a number of conceptual areas are often misunderstood Recent initiatives to improve this include the Prognosis Research Strategy (PROGRESS) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) Statement In this paper, we aim to show how different categories of prognostic research relate to each other, to differentiate exploratory and confirmatory studies, discuss moderators and mediators, and to show how important it is to understand study designs and the differences between prediction and causation We propose that there are four main objectives of prognostic studies – description, association, prediction and causation By causation, we mean the effect of prediction and decision rules on outcomes as determined by intervention studies and the investigation of whether a prognostic factor is a determinant of outcome (on the causal pathway) These either fall under the umbrella of exploratory (description, association, and prediction model development) or confirmatory (prediction model external validation and investigation of causation) Including considerations of causation within a prognostic framework provides a more comprehensive roadmap of how different types of studies conceptually relate to each other, and better clarity about appropriate model performance measures and the inferences that can be drawn from different types of prognostic studies We also propose definitions of ‘candidate prognostic factors’, ‘prognostic factors’, ‘prognostic determinants (causal)’ and ‘prognostic markers (non-causal)’ Furthermore, we address common conceptual misunderstandings related to study design, analysis, and interpretation of multivariable models from the perspectives of association, prediction and causation This paper uses a framework to clarify some concepts in prognostic research that remain poorly understood and implemented, to stimulate discussion about how prognostic studies can be strengthened and appropriately interpreted

Journal ArticleDOI
Riko Kelter1
TL;DR: An extensive simulation study is conducted to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student's (and Welch’s) t-test.
Abstract: The replication crisis hit the medical sciences about a decade ago, but today still most of the flaws inherent in null hypothesis significance testing (NHST) have not been solved. While the drawbacks of p-values have been detailed in endless venues, for clinical research, only a few attractive alternatives have been proposed to replace p-values and NHST. Bayesian methods are one of them, and they are gaining increasing attention in medical research, as some of their advantages include the description of model parameters in terms of probability, as well as the incorporation of prior information in contrast to the frequentist framework. While Bayesian methods are not the only remedy to the situation, there is an increasing agreement that they are an essential way to avoid common misconceptions and false interpretation of study results. The requirements necessary for applying Bayesian statistics have transitioned from detailed programming knowledge into simple point-and-click programs like JASP. Still, the multitude of Bayesian significance and effect measures which contrast the gold standard of significance in medical research, the p-value, causes a lack of agreement on which measure to report. Therefore, in this paper, we conduct an extensive simulation study to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution. In it, we analyse the behaviour of these measures for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student’s (and Welch’s) t-test. The results show that some measures cannot state evidence for both the null and the alternative. While the different indices behave similarly regarding increasing sample size and noise, the prior modelling influences the obtained results and extreme priors allow for cherry-picking similar to p-hacking in the frequentist paradigm. The indices behave quite differently regarding their ability to control the type I error rates and regarding their ability to detect an existing effect. Based on the results, two of the commonly used indices can be recommended for more widespread use in clinical and biomedical research, as they improve the type I error control compared to the classic two-sample t-test and enjoy multiple other desirable properties.

Journal ArticleDOI
TL;DR: The objective of the e(lectronic)-Delphi study was to determine minimum standards for emergency departments in the Netherlands and advises those considering Delphi study to follow the CREDES guideline, consider a two-part design, invest in personal commitment of the panellists, set clear decision rules, use a consistent lay-out and send out your reminders early.
Abstract: A proper application of the Delphi technique is essential for obtaining valid research results. Medical researchers regularly use Delphi studies, but reports often lack detailed information on methodology and controlled feedback: in the medical literature, papers focusing on Delphi methodology issues are rare. Since the introduction of electronic surveys, details on response times remain scarce. We aim to bridge a number of gaps by providing a real world example covering methodological choices and response times in detail. The objective of our e(lectronic)-Delphi study was to determine minimum standards for emergency departments (EDs) in the Netherlands. We opted for a two-part design with explicit decision rules. Part 1 focused on gathering and defining items; Part 2 addressed the main research question using an online survey tool. A two-person consensus rule was applied throughout: even after consensus on specific items was reached, panellists could reopen the discussion as long as at least two panellists argued similarly. Per round, the number of reminders sent and individual response times were noted. We also recorded the methodological considerations and evaluations made by the research team prior to as well as during the study. The study was performed in eight rounds and an additional confirmation round. Response rates were 100% in all rounds, resulting in 100% consensus in Part 1 and 96% consensus in Part 2. Our decision rules proved to be stable and easily applicable. Items with negative advice required more rounds before consensus was reached. Response delays were mostly due to late starts, but once panellists started, they nearly always finished the questionnaire on the same day. Reminders often yielded rapid responses. Intra-individual differences in response time were large, but quick responders remained quick. We advise those considering Delphi study to follow the CREDES guideline, consider a two-part design, invest in personal commitment of the panellists, set clear decision rules, use a consistent lay-out and send out your reminders early. Adopting this overall approach may assist researchers in future Delphi studies and may help to improve the quality of Delphi designs in terms of improved rigor and higher response rates.

Journal ArticleDOI
TL;DR: An overview of some of the key aspects of methodological studies such as what they are, and when, how and why they are done is provided and multiple examples are provided to help guide researchers interested in conducting methodological studies.
Abstract: Methodological studies – studies that evaluate the design, analysis or reporting of other research-related reports – play an important role in health research. They help to highlight issues in the conduct of research with the aim of improving health research methodology, and ultimately reducing research waste. We provide an overview of some of the key aspects of methodological studies such as what they are, and when, how and why they are done. We adopt a “frequently asked questions” format to facilitate reading this paper and provide multiple examples to help guide researchers interested in conducting methodological studies. Some of the topics addressed include: is it necessary to publish a study protocol? How to select relevant research reports and databases for a methodological study? What approaches to data extraction and statistical analysis should be considered when conducting a methodological study? What are potential threats to validity and is there a way to appraise the quality of methodological studies? Appropriate reflection and application of basic principles of epidemiology and biostatistics are required in the design and analysis of methodological studies. This paper provides an introduction for further discussion about the conduct of methodological studies.

Journal ArticleDOI
TL;DR: Tools should be available to mandate protocol registration of any SRs beforehand and increasing awareness about the benefits of protocol registration among researchers, according to the awareness, obstacles, and opinions of SR/MA authors.
Abstract: Although protocol registration of systematic reviews/meta-analysis (SR/MA) is still not mandatory, it is highly recommended that authors publish their SR/MA protocols prior to submitting their manuscripts for publication as recommended by the Cochrane guidelines for conducting SR/MAs. our aim was to assess the awareness, obstacles, and opinions of SR/MA authors about the protocol registration process. A cross-sectional survey study included the authors who published SR/MAs during the period from 2010 to 2016, and they were contacted for participation in our survey study. They were identified through the literature search of SR/MAs in Scopus database. An online questionnaire was sent to each participant via e-mail after receiving their approval to join the study. We have sent 6650 emails and received 275 responses. A total of 270 authors responses were complete and included in the final analysis. Our results has shown that PROSPERO was the most common database used for protocol registration (71.3%). The registration-to-acceptance time interval in PROSPERO was less than 1 month (99.1%). Almost half of the authors (44.2%) did not register their protocols prior to publishing their SR/MAs and according to their opinion that the other authors lack knowledge of protocol importance and mandance to be registered, was the most commonly reported reason (44.9%). A significant percenatge of respondents (37.4%) believed that people would steal their ideas from protocol databases, while only 5.3% reported that their SR/MA had been stolen. However, the majority (72.9%) of participants have agreed that protocol registries play a role in preventing unnecessary duplication of reviews. Finally, 37.4% of participants agree that SR/MA protocol registration should be mandatory. About half of the participants believes that the main reason for not registering protocols, is that the other authors lack knowledge concerning obligation and importance to register the SR/MA protocols in advance. Therefore, tools should be available to mandate protocol registration of any SRs beforehand and increasing awareness about the benefits of protocol registration among researchers.

Journal ArticleDOI
TL;DR: Early articles on COVID-19 were predominantly retrospective case reports and modeling studies, but Chinese scholars had a head start in reporting about the new disease, but publishing articles in Chinese may limit their global reach.
Abstract: The research community reacted rapidly to the emergence of COVID-19 We aimed to assess characteristics of journal articles, preprint articles, and registered trial protocols about COVID-19 and its causal agent SARS-CoV-2 We analyzed characteristics of journal articles with original data indexed by March 19, 2020, in World Health Organization (WHO) COVID-19 collection, articles published on preprint servers medRxiv and bioRxiv by April 3, 2010 Additionally, we assessed characteristics of clinical trials indexed in the WHO International Clinical Trials Registry Platform (WHO ICTRP) by April 7, 2020 Among the first 2118 articles on COVID-19 published in scholarly journals, 533 (25%) contained original data The majority was published by authors from China (75%) and funded by Chinese sponsors (75%); a quarter was published in the Chinese language Among 312 articles that self-reported study design, the most frequent were retrospective studies (N = 88; 28%) and case reports (N = 86; 28%), analyzing patients’ characteristics (38%) Median Journal Impact Factor of journals where articles were published was 5099 Among 1088 analyzed preprint articles, the majority came from authors affiliated in China (51%) and were funded by sources in China (46%) Less than half reported study design; the majority were modeling studies (62%), and analyzed transmission/risk/prevalence (43%) Of the 927 analyzed registered trials, the majority were interventional (58%) Half were already recruiting participants The location for the conduct of the trial in the majority was China (N = 522; 63%) The median number of planned participants was 140 (range: 1 to 15,000,000) Registered intervention trials used highly heterogeneous primary outcomes and tested highly heterogeneous interventions; the most frequently studied interventions were hydroxychloroquine (N = 39; 72%) and chloroquine (N = 16; 3%) Early articles on COVID-19 were predominantly retrospective case reports and modeling studies The diversity of outcomes used in intervention trial protocols indicates the urgent need for defining a core outcome set for COVID-19 research Chinese scholars had a head start in reporting about the new disease, but publishing articles in Chinese may limit their global reach Mapping publications with original data can help finding gaps that will help us respond better to the new public health emergency

Journal ArticleDOI
TL;DR: This study investigated a documented translation method that includes the careful specification of descriptions of item intents, and demonstrated how documented data from the TIP contributes evidence to a validity argument for construct equivalence between translated and source language PROMs.
Abstract: Cross-cultural research with patient-reported outcomes measures (PROMs) assumes that the PROM in the target language will measure the same construct in the same way as the PROM in the source language. Yet translation methods are rarely used to qualitatively maximise construct equivalence or to describe the intents of each item to support common understanding within translation teams. This study aimed to systematically investigate the utility of the Translation Integrity Procedure (TIP), in particular the use of item intent descriptions, to maximise construct equivalence during the translation process, and to demonstrate how documented data from the TIP contributes evidence to a validity argument for construct equivalence between translated and source language PROMs. Analysis of secondary data was conducted on routinely collected data in TIP Management Grids of translations (n = 9) of the Health Literacy Questionnaire (HLQ) that took place between August 2014 and August 2015: Arabic, Czech, French (Canada), French (France), Hindi, Indonesian, Slovak, Somali and Spanish (Argentina). Two researchers initially independently deductively coded the data to nine common types of translation errors. Round two of coding included an identified 10th code. Coded data were compared for discrepancies, and checked when needed with a third researcher for final code allocation. Across the nine translations, 259 changes were made to provisional forward translations and were coded into 10 types of errors. Most frequently coded errors were Complex word or phrase (n = 99), Semantic (n = 54) and Grammar (n = 27). Errors coded least frequently were Cultural errors (n = 7) and Printed errors (n = 5). To advance PROM validation practice, this study investigated a documented translation method that includes the careful specification of descriptions of item intents. Assumptions that translated PROMs have construct equivalence between linguistic contexts can be incorrect due to errors in translation. Of particular concern was the use of high level complex words by translators, which, if undetected, could cause flawed interpretation of data from people with low literacy. Item intent descriptions can support translations to maximise construct equivalence, and documented translation data can contribute evidence to justify score interpretation and use of translated PROMS in new linguistic contexts.

Journal ArticleDOI
TL;DR: A cohort study of children that selects a sample of schools, then selects students within schools, and conducts multiple measurements over time in the same students, would be a 3-level dataset: with school as the highest level, student as a lower level, and time-point as the lowest level.
Abstract: Background Researchers have been utilizing linear mixed models (LMMs) for different hierarchical study designs and under different names, which emphasizes the need for a standard in reporting such models [1, 2]. Mixed effects models, multilevel data, contextual analysis, hierarchical studies, longitudinal studies, panel data and repeatedmeasures designs are some of the different names used when referring to study designs and/or analytical tools for correlated data. In addition, there is usually no distinction made between having a data structure that is multilevel, and having a research question that requires a multilevel analysis. There are multiple excellent tutorials on multilevel analyses [3–5]. However, there is inconsistency in how the results of LMMs are reported in the literature [6]. Casals et al. conducted a systematic review of how various LMMs were reported in the medical literature, and found that important aspects were not reported in most cases [6]. As an example, a cohort study of children that selects a sample of schools, then selects students within schools, and conducts multiple measurements over time in the same students, would be a 3-level dataset: with school as the highest level (Level 3), student as a lower level (Level 2), and time-point as the lowest level (Level 1). Repeated measurements of a variable over time within a student are likely to be similar, i.e. positively correlated. Also, values of a variable measured on students of a particular school may be more similar to each other than to the

Journal ArticleDOI
TL;DR: It is concluded that a web-survey might be a feasible and good alternative in surveys targeting people in the retirement age range, however, without offering a paper-questionnaire, a small but important group will likely be missing with potential biased estimates as the result.
Abstract: Web-surveys are increasingly used in population studies. Yet, web-surveys targeting older individuals are still uncommon for various reasons. However, with younger cohorts approaching older age, the potentials for web-surveys among older people might be improved. In this study, we investigated response patterns in a web-survey targeting older adults and the potential importance of offering a paper-questionnaire as an alternative to the web-questionnaire. We analyzed data from three waves of a retirement study, in which a web-push methodology was used and a paper questionnaire was offered as an alternative to the web questionnaire in the last reminder. We mapped the response patterns, compared web- and paper respondents and compared different key outcomes resulting from the sample with and without the paper respondents, both at baseline and after two follow-ups. Paper-respondents, that is, those that did not answer until they got a paper questionnaire with the last reminder, were more likely to be female, retired, single, and to report a lower level of education, higher levels of depression and lower self-reported health, compared to web-respondents. The association between retirement status and depression was only present among web-respondents. The differences between web and paper respondents were stronger in the longitudinal sample (after two follow-ups) than at baseline. We conclude that a web-survey might be a feasible and good alternative in surveys targeting people in the retirement age range. However, without offering a paper-questionnaire, a small but important group will likely be missing with potential biased estimates as the result.

Journal ArticleDOI
TL;DR: Evidence is provided that consensus depends on the rating scale and consensus threshold within one population and the three-point scale proves to be the most reasonable choice, as its translation into the clinical context is the most straightforward among the scales.
Abstract: Consensus-orientated Delphi studies are increasingly used in various areas of medical research using a variety of different rating scales and criteria for reaching consensus. We explored the influence of using three different rating scales and different consensus criteria on the results for reaching consensus and assessed the test-retest reliability of these scales within a study aimed at identification of global treatment goals for total knee arthroplasty (TKA). We conducted a two-stage study consisting of two surveys and consecutively included patients scheduled for TKA from five German hospitals. Patients were asked to rate 19 potential treatment goals on different rating scales (three-point, five-point, nine-point). Surveys were conducted within a 2 week period prior to TKA, order of questions (scales and treatment goals) was randomized. Eighty patients (mean age 68 ± 10 years; 70% females) completed both surveys. Different rating scales (three-point, five-point and nine-point rating scale) lead to different consensus despite moderate to high correlation between rating scales (r = 0.65 to 0.74). Final consensus was highly influenced by the choice of rating scale with 14 (three-point), 6 (five-point), 15 (nine-point) out of 19 treatment goals reaching the pre-defined 75% consensus threshold. The number of goals reaching consensus also highly varied between rating scales for other consensus thresholds. Overall, concordance differed between the three-point (percent agreement [p] = 88.5%, weighted kappa [k] = 0.63), five-point (p = 75.3%, k = 0.47) and nine-point scale (p = 67.8%, k = 0.78). This study provides evidence that consensus depends on the rating scale and consensus threshold within one population. The test-retest reliability of the three rating scales investigated differs substantially between individual treatment goals. This variation in reliability can become a potential source of bias in consensus studies. In our setting aimed at capturing patients’ treatment goals for TKA, the three-point scale proves to be the most reasonable choice, as its translation into the clinical context is the most straightforward among the scales. Researchers conducting Delphi studies should be aware that final consensus is substantially influenced by the choice of rating scale and consensus criteria.

Journal ArticleDOI
TL;DR: A blockchain-based framework for CT data management is proposed, using Ethereum smart contracts, which employs IPFS as the file storage system to automate processes and information exchange among CT stakeholders and can be deployed in private blockchain networks using their native smart contract technologies.
Abstract: Clinical Trials (CTs) help in testing and validating the safety and efficacy of newly discovered drugs on specific patient population cohorts. However, these trials usually experience many challenges, such as extensive time frames, high financial cost, regulatory and administrative barriers, and insufficient workforce. In addition, CTs face several data management challenges pertaining to protocol compliance, patient enrollment, transparency, traceability, data integrity, and selective reporting. Blockchain can potentially address such challenges because of its intrinsic features and properties. Although existing literature broadly discusses the applicability of blockchain-based solutions for CTs, only a few studies present their working proof-of-concept. We propose a blockchain-based framework for CT data management, using Ethereum smart contracts, which employs IPFS as the file storage system to automate processes and information exchange among CT stakeholders. CT documents stored in the IPFS are difficult to tamper with as they are given unique cryptographic hashes. We present algorithms that capture various stages of CT data management. We develop the Ethereum smart contract using Remix IDE that is validated under different scenarios. The proposed framework results are advantageous to all stakeholders ensuring transparency, data integrity, and protocol compliance. Although the proposed solution is tested on the Ethereum blockchain platform, it can be deployed in private blockchain networks using their native smart contract technologies. We make our smart contract code publicly available on Github. We conclude that the proposed framework can be highly effective in ensuring that the trial abides by the protocol and the functions are executed only by the stakeholders who are given permission. It also assures data integrity and promotes transparency and traceability of information among stakeholders.

Journal ArticleDOI
TL;DR: Using a systematic approach, recruiting a broad network of collaborators and implementing automated methods, Epistemonikos is developed, a one-stop shop for systematic reviews relevant for health decision making.
Abstract: Systematic reviews allow health decisions to be informed by the best available research evidence. However, their number is proliferating quickly, and many skills are required to identify all the relevant reviews for a specific question. We screen 10 bibliographic databases on a daily or weekly basis, to identify systematic reviews relevant for health decision-making. Using a machine-based approach developed for this project we select reviews, which are then validated by a network of more than 1000 collaborators. After screening over 1,400,000 records we have identified more than 300,000 systematic reviews, which are now stored in a single place and accessible through an easy-to-use search engine. This makes Epistemonikos the largest database of its kind. Using a systematic approach, recruiting a broad network of collaborators and implementing automated methods, we developed a one-stop shop for systematic reviews relevant for health decision making.

Journal ArticleDOI
TL;DR: A modified or stop screening approach once a true recall @ 95% is achieved appears to be a valid method for rapid reviews, and perhaps systematic reviews, as well as in prospective reviews using the estimated recall.
Abstract: Systematic reviews often require substantial resources, partially due to the large number of records identified during searching. Although artificial intelligence may not be ready to fully replace human reviewers, it may accelerate and reduce the screening burden. Using DistillerSR (May 2020 release), we evaluated the performance of the prioritization simulation tool to determine the reduction in screening burden and time savings. Using a true recall @ 95%, response sets from 10 completed systematic reviews were used to evaluate: (i) the reduction of screening burden; (ii) the accuracy of the prioritization algorithm; and (iii) the hours saved when a modified screening approach was implemented. To account for variation in the simulations, and to introduce randomness (through shuffling the references), 10 simulations were run for each review. Means, standard deviations, medians and interquartile ranges (IQR) are presented. Among the 10 systematic reviews, using true recall @ 95% there was a median reduction in screening burden of 47.1% (IQR: 37.5 to 58.0%). A median of 41.2% (IQR: 33.4 to 46.9%) of the excluded records needed to be screened to achieve true recall @ 95%. The median title/abstract screening hours saved using a modified screening approach at a true recall @ 95% was 29.8 h (IQR: 28.1 to 74.7 h). This was increased to a median of 36 h (IQR: 32.2 to 79.7 h) when considering the time saved not retrieving and screening full texts of the remaining 5% of records not yet identified as included at title/abstract. Among the 100 simulations (10 simulations per review), none of these 5% of records were a final included study in the systematic review. The reduction in screening burden to achieve true recall @ 95% compared to @ 100% resulted in a reduced screening burden median of 40.6% (IQR: 38.3 to 54.2%). The prioritization tool in DistillerSR can reduce screening burden. A modified or stop screening approach once a true recall @ 95% is achieved appears to be a valid method for rapid reviews, and perhaps systematic reviews. This needs to be further evaluated in prospective reviews using the estimated recall.

Journal ArticleDOI
TL;DR: A Shannon transform of the P -value p, also known as the binary surprisal or S -value s = −log 2 ( p ), is used to provide a measure of the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing.
Abstract: Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and P-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review some simple methods to aid researchers in interpreting statistical outputs. These methods emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. We use the Shannon transform of the P-value p, also known as the binary surprisal or S-value s = −log2(p), to provide a measure of the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret P-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. In line with other recent recommendations, we advise that teaching materials and research reports discuss P-values as measures of compatibility rather than significance, compute P-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.

Journal ArticleDOI
TL;DR: The results of the simulation studies demonstrated that RQRs have low type I error and great statistical power in comparisons to other residuals for detecting many forms of model misspecification for count regression models (non-linearity in covariate effect, over-dispersion, and zero inflation).
Abstract: Examining residuals is a crucial step in statistical analysis to identify the discrepancies between models and data, and assess the overall model goodness-of-fit. In diagnosing normal linear regression models, both Pearson and deviance residuals are often used, which are equivalently and approximately standard normally distributed when the model fits the data adequately. However, when the response vari*able is discrete, these residuals are distributed far from normality and have nearly parallel curves according to the distinct discrete response values, imposing great challenges for visual inspection. Randomized quantile residuals (RQRs) were proposed in the literature by Dunn and Smyth (1996) to circumvent the problems in traditional residuals. However, this approach has not gained popularity partly due to the lack of investigation of its performance for count regression including zero-inflated models through simulation studies. Therefore, we assessed the normality of the RQRs and compared their performance with traditional residuals for diagnosing count regression models through a series of simulation studies. A real data analysis in health care utilization study for modeling the number of repeated emergency department visits was also presented. Our results of the simulation studies demonstrated that RQRs have low type I error and great statistical power in comparisons to other residuals for detecting many forms of model misspecification for count regression models (non-linearity in covariate effect, over-dispersion, and zero inflation). Our real data analysis also showed that RQRs are effective in detecting misspecified distributional assumptions for count regression models. Our results for evaluating RQRs in comparison with traditional residuals provide further evidence on its advantages for diagnosing count regression models.

Journal ArticleDOI
TL;DR: Social network analysis method for SARS-CoV-2 contact tracing data would be of use in measuring individual patient level variations in disease transmission.
Abstract: Contact tracing data of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic is used to estimate basic epidemiological parameters. Contact tracing data could also be potentially used for assessing the heterogeneity of transmission at the individual patient level. Characterization of individuals based on different levels of infectiousness could better inform the contact tracing interventions at field levels. Standard social network analysis methods used for exploring infectious disease transmission dynamics was employed to analyze contact tracing data of 1959 diagnosed SARS-CoV-2 patients from a large state of India. Relational network data set with diagnosed patients as “nodes” and their epidemiological contact as “edges” was created. Directed network perspective was utilized in which directionality of infection emanated from a “source patient” towards a “target patient”. Network measures of “ degree centrality” and “betweenness centrality” were calculated to identify influential patients in the transmission of infection. Components analysis was conducted to identify patients connected as sub- groups. Descriptive statistics was used to summarise network measures and percentile ranks were used to categorize influencers. Out-degree centrality measures identified that of the total 1959 patients, 11.27% (221) patients have acted as a source of infection to 40.19% (787) other patients. Among these source patients, 0.65% (12) patients had a higher out-degree centrality (> = 10) and have collectively infected 37.61% (296 of 787), secondary patients. Betweenness centrality measures highlighted that 7.50% (93) patients had a non-zero betweenness (range 0.5 to 135) and thus have bridged the transmission between other patients. Network component analysis identified nineteen connected components comprising of influential patient’s which have overall accounted for 26.95% of total patients (1959) and 68.74% of epidemiological contacts in the network. Social network analysis method for SARS-CoV-2 contact tracing data would be of use in measuring individual patient level variations in disease transmission. The network metrics identified individual patients and patient components who have disproportionately contributed to transmission. The network measures and graphical tools could complement the existing contact tracing indicators and could help improve the contact tracing activities.

Journal ArticleDOI
TL;DR: Cochites is an efficient and accurate method for finding relevant related articles that uses the expert knowledge of authors to rank related articles, does not depend on keyword selection and requires no special expertise to build search queries.
Abstract: We recently developed CoCites, a citation-based search method that is designed to be more efficient than traditional keyword-based methods. The method begins with identification of one or more highly relevant publications (query articles) and consists of two searches: the co-citation search, which ranks publications on their co-citation frequency with the query articles, and the citation search, which ranks publications on frequency of all citations that cite or are cited by the query articles. We aimed to reproduce the literature searches of published systematic reviews and meta-analyses and assess whether CoCites retrieves all eligible articles while screening fewer titles. A total of 250 reviews were included. CoCites retrieved a median of 75% of the articles that were included in the original reviews. The percentage of retrieved articles was higher (88%) when the query articles were cited more frequently and when they had more overlap in their citations. Applying CoCites to only the highest-cited article yielded similar results. The co-citation and citation searches combined were more efficient when the review authors had screened more than 500 titles, but not when they had screened less. CoCites is an efficient and accurate method for finding relevant related articles. The method uses the expert knowledge of authors to rank related articles, does not depend on keyword selection and requires no special expertise to build search queries. The method is transparent and reproducible.

Journal ArticleDOI
TL;DR: This work aimed to identify regression modeling approaches that assess heterogeneity of treatment effect within a randomized clinical trial by performing a literature review using a broad search strategy, complemented by suggestions of a technical expert panel.
Abstract: Recent evidence suggests that there is often substantial variation in the benefits and harms across a trial population. We aimed to identify regression modeling approaches that assess heterogeneity of treatment effect within a randomized clinical trial. We performed a literature review using a broad search strategy, complemented by suggestions of a technical expert panel. The approaches are classified into 3 categories: 1) Risk-based methods (11 papers) use only prognostic factors to define patient subgroups, relying on the mathematical dependency of the absolute risk difference on baseline risk; 2) Treatment effect modeling methods (9 papers) use both prognostic factors and treatment effect modifiers to explore characteristics that interact with the effects of therapy on a relative scale. These methods couple data-driven subgroup identification with approaches to prevent overfitting, such as penalization or use of separate data sets for subgroup identification and effect estimation. 3) Optimal treatment regime methods (12 papers) focus primarily on treatment effect modifiers to classify the trial population into those who benefit from treatment and those who do not. Finally, we also identified papers which describe model evaluation methods (4 papers). Three classes of approaches were identified to assess heterogeneity of treatment effect. Methodological research, including both simulations and empirical evaluations, is required to compare the available methods in different settings and to derive well-informed guidance for their application in RCT analysis.