scispace - formally typeset
Search or ask a question

Showing papers in "Biometrical Journal in 2016"


Journal ArticleDOI
TL;DR: In this paper, a quantitative assessment of the strength of evidence for surrogacy requires the demonstration of the prognostic value of the surrogate for the clinical outcome, and evidence that treatment effects on the surrogate reliably predict treatment effect on clinical outcome.
Abstract: A surrogate endpoint is intended to replace a clinical endpoint for the evaluation of new treatments when it can be measured more cheaply, more conveniently, more frequently, or earlier than that clinical endpoint. A surrogate endpoint is expected to predict clinical benefit, harm, or lack of these. Besides the biological plausibility of a surrogate, a quantitative assessment of the strength of evidence for surrogacy requires the demonstration of the prognostic value of the surrogate for the clinical outcome, and evidence that treatment effects on the surrogate reliably predict treatment effects on the clinical outcome. We focus on these two conditions, and outline the statistical approaches that have been proposed to assess the extent to which these conditions are fulfilled. When data are available from a single trial, one can assess the "individual level association" between the surrogate and the true endpoint. When data are available from several trials, one can additionally assess the "trial level association" between the treatment effect on the surrogate and the treatment effect on the true endpoint. In the latter case, the "surrogate threshold effect" can be estimated as the minimum effect on the surrogate endpoint that predicts a statistically significant effect on the clinical endpoint. All these concepts are discussed in the context of randomized clinical trials in oncology, and illustrated with two meta-analyses in gastric cancer.

86 citations


Journal ArticleDOI
TL;DR: In this paper, a mixture of multivariate contaminated normal distributions is developed for model-based clustering, where each cluster has a parameter controlling the proportion of mild outliers and one specifying the degree of contamination.
Abstract: A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large-scale simulation study, the behavior of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

84 citations


Journal ArticleDOI
TL;DR: A new regression model is introduced by considering the distribution proposed in this article, which is useful for situations where the response is restricted to the standard unit interval and the regression structure involves regressors and unknown parameters.
Abstract: By starting from the Johnson SB distribution pioneered by Johnson (), we propose a broad class of distributions with bounded support on the basis of the symmetric family of distributions. The new class of distributions provides a rich source of alternative distributions for analyzing univariate bounded data. A comprehensive account of the mathematical properties of the new family is provided. We briefly discuss estimation of the model parameters of the new class of distributions based on two estimation methods. Additionally, a new regression model is introduced by considering the distribution proposed in this article, which is useful for situations where the response is restricted to the standard unit interval and the regression structure involves regressors and unknown parameters. The regression model allows to model both location and dispersion effects. We define two residuals for the proposed regression model to assess departures from model assumptions as well as to detect outlying observations, and discuss some influence methods such as the local influence and generalized leverage. Finally, an application to real data is presented to show the usefulness of the new regression model.

46 citations


Journal ArticleDOI
TL;DR: Streamlined mean field variational Bayes algorithms for efficient fitting and inference in large models for longitudinal and multilevel data analysis are obtained, allowing the fastest ever approximate Bayesian analyses of arbitrarily large longitudinal andMultilevel datasets.
Abstract: Streamlined mean field variational Bayes algorithms for efficient fitting and inference in large models for longitudinal and multilevel data analysis are obtained. The number of operations is linear in the number of groups at each level, which represents a two orders of magnitude improvement over the naive approach. Storage requirements are also lessened considerably. We treat models for the Gaussian and binary response situations. Our algorithms allow the fastest ever approximate Bayesian analyses of arbitrarily large longitudinal and multilevel datasets, with little degradation in accuracy compared with Markov chain Monte Carlo. The modularity of mean field variational Bayes allows relatively simple extension to more complicated scenarios.

40 citations


Journal ArticleDOI
TL;DR: Different methodologies for analyzing cytogenetic chromosomal aberrations datasets are compared, with special focus on zero-inflated Poisson and zero- inflated negative binomial models.
Abstract: Within the field of cytogenetic biodosimetry, Poisson regression is the classical approach for modeling the number of chromosome aberrations as a function of radiation dose. However, it is common to find data that exhibit overdispersion. In practice, the assumption of equidispersion may be violated due to unobserved heterogeneity in the cell population, which will render the variance of observed aberration counts larger than their mean, and/or the frequency of zero counts greater than expected for the Poisson distribution. This phenomenon is observable for both full- and partial-body exposure, but more pronounced for the latter. In this work, different methodologies for analyzing cytogenetic chromosomal aberrations datasets are compared, with special focus on zero-inflated Poisson and zero-inflated negative binomial models. A score test for testing for zero inflation in Poisson regression models under the identity link is also developed.

40 citations


Journal ArticleDOI
TL;DR: A case study of natalizumab for the treatment of relapsing remitting multiple sclerosis found structured benefit‐risk analysis to be a useful tool for structuring, quantifying, and communicating the relative benefit and safety profiles of drugs in a transparent, rational and consistent way.
Abstract: While benefit-risk assessment is a key component of the drug development and maintenance process, it is often described in a narrative. In contrast, structured benefit-risk assessment builds on established ideas from decision analysis and comprises a qualitative framework and quantitative methodology. We compare two such frameworks, applying multi-criteria decision-analysis (MCDA) within the PrOACT-URL framework and weighted net clinical benefit (wNCB), within the BRAT framework. These are applied to a case study of natalizumab for the treatment of relapsing remitting multiple sclerosis. We focus on the practical considerations of applying these methods and give recommendations for visual presentation of results. In the case study, we found structured benefit-risk analysis to be a useful tool for structuring, quantifying, and communicating the relative benefit and safety profiles of drugs in a transparent, rational and consistent way. The two frameworks were similar. MCDA is a generic and flexible methodology that can be used to perform a structured benefit-risk in any common context. wNCB is a special case of MCDA and is shown to be equivalent to an extension of the number needed to treat (NNT) principle. It is simpler to apply and understand than MCDA and can be applied when all outcomes are measured on a binary scale.

38 citations


Journal ArticleDOI
TL;DR: Methods that can be used to detect parameter redundancy in discrete state‐space models are developed and it is demonstrated that combining multiple data sets, through the use of an integrated population model, may result in a model in which all parameters are estimable, even though models fitted to the separate data sets may be parameter redundant.
Abstract: Discrete state-space models are used in ecology to describe the dynamics of wild animal populations, with parameters, such as the probability of survival, being of ecological interest. For a particular parametrization of a model it is not always clear which parameters can be estimated. This inability to estimate all parameters is known as parameter redundancy or a model is described as nonidentifiable. In this paper we develop methods that can be used to detect parameter redundancy in discrete state-space models. An exhaustive summary is a combination of parameters that fully specify a model. To use general methods for detecting parameter redundancy a suitable exhaustive summary is required. This paper proposes two methods for the derivation of an exhaustive summary for discrete state-space models using discrete analogues of methods for continuous state-space models. We also demonstrate that combining multiple data sets, through the use of an integrated population model, may result in a model in which all parameters are estimable, even though models fitted to the separate data sets may be parameter redundant.

36 citations


Journal ArticleDOI
TL;DR: It is proved that the studentized permutation distribution of the Brunner-Munzel rank statistic is asymptotically standard normal, even under the alternative, incidentally providing the hitherto missing theoretical foundation for the Neubert and Brunner studentization permutation test.
Abstract: We investigate rank-based studentized permutation methods for the nonparametric Behrens-Fisher problem, that is, inference methods for the area under the ROC curve. We hereby prove that the studentized permutation distribution of the Brunner-Munzel rank statistic is asymptotically standard normal, even under the alternative. Thus, incidentally providing the hitherto missing theoretical foundation for the Neubert and Brunner studentized permutation test. In particular, we do not only show its consistency, but also that confidence intervals for the underlying treatment effects can be computed by inverting this permutation test. In addition, we derive permutation-based range-preserving confidence intervals. Extensive simulation studies show that the permutation-based confidence intervals appear to maintain the preassigned coverage probability quite accurately (even for rather small sample sizes). For a convenient application of the proposed methods, a freely available software package for the statistical software R has been developed. A real data example illustrates the application.

30 citations



Journal ArticleDOI
TL;DR: The present paper assumes that the approach provides a robust, transparent, and thus predictable foundation to determine minor, considerable, and major treatment effects on binary outcomes in the early benefit assessment of new drugs in Germany.
Abstract: At the beginning of 2011, the early benefit assessment of new drugs was introduced in Germany with the Act on the Reform of the Market for Medicinal Products (AMNOG). The Federal Joint Committee (G-BA) generally commissions the Institute for Quality and Efficiency in Health Care (IQWiG) with this type of assessment, which examines whether a new drug shows an added benefit (a positive patient-relevant treatment effect) over the current standard therapy. IQWiG is required to assess the extent of added benefit on the basis of a dossier submitted by the pharmaceutical company responsible. In this context, IQWiG was faced with the task of developing a transparent and plausible approach for operationalizing how to determine the extent of added benefit. In the case of an added benefit, the law specifies three main extent categories (minor, considerable, major). To restrict value judgements to a minimum in the first stage of the assessment process, an explicit and abstract operationalization was needed. The present paper is limited to the situation of binary data (analysis of 2 × 2 tables), using the relative risk as an effect measure. For the treatment effect to be classified as a minor, considerable, or major added benefit, the methodological approach stipulates that the (two-sided) 95% confidence interval of the effect must exceed a specified distance to the zero effect. In summary, we assume that our approach provides a robust, transparent, and thus predictable foundation to determine minor, considerable, and major treatment effects on binary outcomes in the early benefit assessment of new drugs in Germany. After a decision on the added benefit of a new drug by G-BA, the classification of added benefit is used to inform pricing negotiations between the umbrella organization of statutory health insurance and the pharmaceutical companies.

27 citations


Journal ArticleDOI
TL;DR: A variegated comparison of the estimated OR and estimated POR with the true OR in a single study with two parallel groups without confounders in data situations where the POR is currently recommended is performed.
Abstract: For the calculation of relative measures such as risk ratio (RR) and odds ratio (OR) in a single study, additional approaches are required for the case of zero events. In the case of zero events in one treatment arm, the Peto odds ratio (POR) can be calculated without continuity correction, and is currently the relative effect estimation method of choice for binary data with rare events. The aim of this simulation study is a variegated comparison of the estimated OR and estimated POR with the true OR in a single study with two parallel groups without confounders in data situations where the POR is currently recommended. This comparison was performed by means of several performance measures, that is the coverage, confidence interval (CI) width, mean squared error (MSE), and mean percentage error (MPE). We demonstrated that the estimator for the POR does not outperform the estimator for the OR for all the performance measures investigated. In the case of rare events, small treatment effects and similar group sizes, we demonstrated that the estimator for the POR performed better than the estimator for the OR only regarding the coverage and MPE, but not the CI width and MSE. For larger effects and unbalanced group size ratios, the coverage and MPE of the estimator for the POR were inappropriate. As in practice the true effect is unknown, the POR method should be applied only with the utmost caution.

Journal ArticleDOI
TL;DR: Common reproducibility issues are identified and guidelines for structuring code submission to the Biometrical Journal have been established to help authors and researchers to implement RR.
Abstract: Reproducible research (RR) constitutes the idea that a publication should be accompanied by all relevant material to reproduce the results and findings of a scientific work. Hence, results can be verified and researchers are able to build upon these. Efforts of the Biometrical Journal over the last five years have increased the number of manuscripts which are reproducible by a factor of 4 to almost 50%. Yet, more than half of the code submission could not be executed in the initial review due to missing code, missing data or errors in the code. Careful checks of the submitted code as part of the reviewing process are essential to eliminate these issues and to foster RR. In this article, we reviewed n=56 recent submissions of code and data to identify common reproducibility issues. Based on these findings, guidelines for structuring code submission to the Biometrical Journal have been established to help authors. These guidelines should help researchers to implement RR in general. Together with the code reviews, this supports the mission of the Biometrical Journal in publishing highest quality, novel and relevant papers on statistical methods and their applications in life sciences. Source code and data to reproduce the presented data analyses are available as Supplementary Material on the journal's web page.

Journal ArticleDOI
TL;DR: Kernel density estimation is utilized to numerically solve for an estimate of the optimal cut-off point associated with the Youden index and it is proved that the estimators based on ranked set sampling are relatively more efficient than that of simple random sampling and both estimators are asymptotically unbiased.
Abstract: A diagnostic cut-off point of a biomarker measurement is needed for classifying a random subject to be either diseased or healthy. However, the cut-off point is usually unknown and needs to be estimated by some optimization criteria. One important criterion is the Youden index, which has been widely adopted in practice. The Youden index, which is defined as the maximum of (sensitivity + specificity -1), directly measures the largest total diagnostic accuracy a biomarker can achieve. Therefore, it is desirable to estimate the optimal cut-off point associated with the Youden index. Sometimes, taking the actual measurements of a biomarker is very difficult and expensive, while ranking them without the actual measurement can be relatively easy. In such cases, ranked set sampling can give more precise estimation than simple random sampling, as ranked set samples are more likely to span the full range of the population. In this study, kernel density estimation is utilized to numerically solve for an estimate of the optimal cut-off point. The asymptotic distributions of the kernel estimators based on two sampling schemes are derived analytically and we prove that the estimators based on ranked set sampling are relatively more efficient than that of simple random sampling and both estimators are asymptotically unbiased. Furthermore, the asymptotic confidence intervals are derived. Intensive simulations are carried out to compare the proposed method using ranked set sampling with simple random sampling, with the proposed method outperforming simple random sampling in all cases. A real data set is analyzed for illustrating the proposed method.

Journal ArticleDOI
TL;DR: A Bayesian statistical method is presented that directly models the outcomes observed in randomized placebo-controlled trials and uses this to infer indirect comparisons between competing active treatments, suitable for use within the MCDA setting.
Abstract: Quantitative decision models such as multiple criteria decision analysis (MCDA) can be used in benefit-risk assessment to formalize trade-offs between benefits and risks, providing transparency to the assessment process. There is however no well-established method for propagating uncertainty of treatment effects data through such models to provide a sense of the variability of the benefit-risk balance. Here, we present a Bayesian statistical method that directly models the outcomes observed in randomized placebo-controlled trials and uses this to infer indirect comparisons between competing active treatments. The resulting treatment effects estimates are suitable for use within the MCDA setting, and it is possible to derive the distribution of the overall benefit-risk balance through Markov Chain Monte Carlo simulation. The method is illustrated using a case study of natalizumab for relapsing-remitting multiple sclerosis.

Journal ArticleDOI
TL;DR: A resampling approach that allows to replicate the subgroup finding process many times and is used to adjust the effect estimates for selection bias and to provide variance estimators that account for selection uncertainty.
Abstract: The interest in individualized medicines and upcoming or renewed regulatory requests to assess treatment effects in subgroups of confirmatory trials requires statistical methods that account for selection uncertainty and selection bias after having performed the search for meaningful subgroups. The challenge is to judge the strength of the apparent findings after mining the same data to discover them. In this paper, we describe a resampling approach that allows to replicate the subgroup finding process many times. The replicates are used to adjust the effect estimates for selection bias and to provide variance estimators that account for selection uncertainty. A simulation study provides some evidence of the performance of the method and an example from oncology illustrates its use.

Journal ArticleDOI
TL;DR: In this paper, a Bayesian hierarchical Tweedie regression model was developed to deal with continuous ecological data with a spike at zero, which can directly accommodate the excess number of zeros common to this type of data, whilst accounting for both spatial and temporal correlation.
Abstract: The development of methods for dealing with continuous data with a spike at zero has lagged behind those for overdispersed or zero-inflated count data. We consider longitudinal ecological data corresponding to an annual average of 26 weekly maximum counts of birds, and are hence effectively continuous, bounded below by zero but also with a discrete mass at zero. We develop a Bayesian hierarchical Tweedie regression model that can directly accommodate the excess number of zeros common to this type of data, whilst accounting for both spatial and temporal correlation. Implementation of the model is conducted in a Markov chain Monte Carlo (MCMC) framework, using reversible jump MCMC to explore uncertainty across both parameter and model spaces. This regression modelling framework is very flexible and removes the need to make strong assumptions about mean-variance relationships a priori. It can also directly account for the spike at zero, whilst being easily applicable to other types of data and other model formulations. Whilst a correlative study such as this cannot prove causation, our results suggest that an increase in an avian predator may have led to an overall decrease in the number of one of its prey species visiting garden feeding stations in the United Kingdom. This may reflect a change in behaviour of house sparrows to avoid feeding stations frequented by sparrowhawks, or a reduction in house sparrow population size as a result of sparrowhawk increase.

Journal ArticleDOI
TL;DR: Evidence that p-values and model selection criteria evaluated on bootstrap data sets do not represent what would be obtained on the original data or new data drawn from the overall population is explained and the behavior of subsampling is investigated.
Abstract: The bootstrap method has become a widely used tool applied in diverse areas where results based on asymptotic theory are scarce. It can be applied, for example, for assessing the variance of a statistic, a quantile of interest or for significance testing by resampling from the null hypothesis. Recently, some approaches have been proposed in the biometrical field where hypothesis testing or model selection is performed on a bootstrap sample as if it were the original sample. P-values computed from bootstrap samples have been used, for example, in the statistics and bioinformatics literature for ranking genes with respect to their differential expression, for estimating the variability of p-values and for model stability investigations. Procedures which make use of bootstrapped information criteria are often applied in model stability investigations and model averaging approaches as well as when estimating the error of model selection procedures which involve tuning parameters. From the literature, however, there is evidence that p-values and model selection criteria evaluated on bootstrap data sets do not represent what would be obtained on the original data or new data drawn from the overall population. We explain the reasons for this and, through the use of a real data set and simulations, we assess the practical impact on procedures relevant to biometrical applications in cases where it has not yet been studied. Moreover, we investigate the behavior of subsampling (i.e., drawing from a data set without replacement) as a potential alternative solution to the bootstrap for these procedures.

Journal ArticleDOI
TL;DR: The intraclass kappa statistic is used for assessing nominal scale agreement with a design where multiple clinicians examine the same group of patients under two different conditions and an explicit variance formula is derived for the difference of correlated kappa statistics is derived.
Abstract: In clinical studies, it is often of interest to see the diagnostic agreement among clinicians on certain symptoms. Previous work has focused on the agreement between two clinicians under two different conditions or the agreement among multiple clinicians under one condition. Few have discussed the agreement study with a design where multiple clinicians examine the same group of patients under two different conditions. In this paper, we use the intraclass kappa statistic for assessing nominal scale agreement with such a design. We derive an explicit variance formula for the difference of correlated kappa statistics and conduct hypothesis testing for the equality of kappa statistics. Simulation studies show that the method performs well with realistic sample sizes and may be superior to a method that did not take into account the measurement dependence structure. The practical utility of the method is illustrated on data from an eosinophilic esophagitis (EoE) study.

Journal ArticleDOI
TL;DR: This work provides maximum-likelihood estimation of GSEM parameters using an approximate Monte Carlo EM algorithm, coupled with a mediation formula approach to estimate natural direct and indirect effects.
Abstract: Health researchers are often interested in assessing the direct effect of a treatment or exposure on an outcome variable, as well as its indirect (or mediation) effect through an intermediate variable (or mediator). For an outcome following a nonlinear model, the mediation formula may be used to estimate causally interpretable mediation effects. This method, like others, assumes that the mediator is observed. However, as is common in structural equations modeling, we may wish to consider a latent (unobserved) mediator. We follow a potential outcomes framework and assume a generalized structural equations model (GSEM). We provide maximum-likelihood estimation of GSEM parameters using an approximate Monte Carlo EM algorithm, coupled with a mediation formula approach to estimate natural direct and indirect effects. The method relies on an untestable sequential ignorability assumption; we assess robustness to this assumption by adapting a recently proposed method for sensitivity analysis. Simulation studies show good properties of the proposed estimators in plausible scenarios. Our method is applied to a study of the effect of mother education on occurrence of adolescent dental caries, in which we examine possible mediation through latent oral health behavior.

Journal ArticleDOI
TL;DR: The Tsiatis GOF statistic originally developed for logistic GLMCCs, TG, is generalized, so that it can be applied under any link function, and it is shown that the algebraically related Hosmer–Lemeshow and Pigeon–Heyse statistics can be applications directly.
Abstract: Generalized linear models (GLM) with a canonical logit link function are the primary modeling technique used to relate a binary outcome to predictor variables. However, noncanonical links can offer more flexibility, producing convenient analytical quantities (e.g., probit GLMs in toxicology) and desired measures of effect (e.g., relative risk from log GLMs). Many summary goodness-of-fit (GOF) statistics exist for logistic GLM. Their properties make the development of GOF statistics relatively straightforward, but it can be more difficult under noncanonical links. Although GOF tests for logistic GLM with continuous covariates (GLMCC) have been applied to GLMCCs with log links, we know of no GOF tests in the literature specifically developed for GLMCCs that can be applied regardless of link function chosen. We generalize the Tsiatis GOF statistic originally developed for logistic GLMCCs, (TG), so that it can be applied under any link function. Further, we show that the algebraically related Hosmer-Lemeshow (HL) and Pigeon-Heyse (J(2) ) statistics can be applied directly. In a simulation study, TG, HL, and J(2) were used to evaluate the fit of probit, log-log, complementary log-log, and log models, all calculated with a common grouping method. The TG statistic consistently maintained Type I error rates, while those of HL and J(2) were often lower than expected if terms with little influence were included. Generally, the statistics had similar power to detect an incorrect model. An exception occurred when a log GLMCC was incorrectly fit to data generated from a logistic GLMCC. In this case, TG had more power than HL or J(2) .

Journal ArticleDOI
TL;DR: A mixed-effects model for nonnegative continuous cross-sectional data in a two-part modelling framework where a potentially endogenous binary variable is included in the model specification and association between the outcomes is modeled through a (discrete) latent structure.
Abstract: We describe a mixed-effects model for nonnegative continuous cross-sectional data in a two-part modelling framework. A potentially endogenous binary variable is included in the model specification and association between the outcomes is modeled through a (discrete) latent structure. We show how model parameters can be estimated in a finite mixture context, allowing for skewness, multivariate association between random effects and endogeneity. The model behavior is investigated through a large-scale simulation experiment. The proposed model is computationally parsimonious and seems to produce acceptable results even if the underlying random effects structure follows a continuous parametric (e.g. Gaussian) distribution. The proposed approach is motivated by the analysis of a sample taken from the Medical Expenditure Panel Survey. The analyzed outcome, that is ambulatory health expenditure, is a mixture of zeros and continuous values. The effects of socio-demographic characteristics on health expenditure are investigated and, as a by-product of the estimation procedure, two subpopulations (i.e. high and low users) are identified.

Journal ArticleDOI
TL;DR: The results suggest that this methodology can lead to a dosing strategy that performs well both within and across populations with different pharmacokinetic characteristics, and may assist in the design of randomized trials by narrowing the list of potential dosing strategies to those which are most promising.
Abstract: There have been considerable advances in the methodology for estimating dynamic treatment regimens, and for the design of sequential trials that can be used to collect unconfounded data to inform such regimens. However, relatively little attention has been paid to how such methodology could be used to advance understanding of optimal treatment strategies in a continuous dose setting, even though it is often the case that considerable patient heterogeneity in drug response along with a narrow therapeutic window may necessitate the tailoring of dosing over time. Such is the case with warfarin, a common oral anticoagulant. We propose novel, realistic simulation models based on pharmacokinetic-pharmacodynamic properties of the drug that can be used to evaluate potentially optimal dosing strategies. Our results suggest that this methodology can lead to a dosing strategy that performs well both within and across populations with different pharmacokinetic characteristics, and may assist in the design of randomized trials by narrowing the list of potential dosing strategies to those which are most promising.

Journal ArticleDOI
TL;DR: If automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect.
Abstract: Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap-based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.

Journal ArticleDOI
TL;DR: A conditional autoregressive model for longitudinal binary data with an ID model such that the probabilities of positive outcomes as well as the drop‐out indicator in each occasion are logit linear in some covariates and outcomes.
Abstract: Dropouts are common in longitudinal study. If the dropout probability depends on the missing observations at or after dropout, this type of dropout is called informative (or nonignorable) dropout (ID). Failure to accommodate such dropout mechanism into the model will bias the parameter estimates. We propose a conditional autoregressive model for longitudinal binary data with an ID model such that the probabilities of positive outcomes as well as the drop-out indicator in each occasion are logit linear in some covariates and outcomes. This model adopting a marginal model for outcomes and a conditional model for dropouts is called a selection model. To allow for the heterogeneity and clustering effects, the outcome model is extended to incorporate mixture and random effects. Lastly, the model is further extended to a novel model that models the outcome and dropout jointly such that their dependency is formulated through an odds ratio function. Parameters are estimated by a Bayesian approach implemented using the user-friendly Bayesian software WinBUGS. A methadone clinic dataset is analyzed to illustrate the proposed models. Result shows that the treatment time effect is still significant but weaker after allowing for an ID process in the data. Finally the effect of drop-out on parameter estimates is evaluated through simulation studies.

Journal ArticleDOI
TL;DR: This work discusses SAE techniques for semicontinuous variables under a two part random effects model that allows for the presence of excess zeros as well as the skewed nature of the nonzero values of the response variable and proposes a parametric bootstrap method to estimate the MSE of the proposed small area estimator.
Abstract: Survey data often contain measurements for variables that are semicontinuous in nature, i.e. they either take a single fixed value (we assume this is zero) or they have a continuous, often skewed, distribution on the positive real line. Standard methods for small area estimation (SAE) based on the use of linear mixed models can be inefficient for such variables. We discuss SAE techniques for semicontinuous variables under a two part random effects model that allows for the presence of excess zeros as well as the skewed nature of the nonzero values of the response variable. In particular, we first model the excess zeros via a generalized linear mixed model fitted to the probability of a nonzero, i.e. strictly positive, value being observed, and then model the response, given that it is strictly positive, using a linear mixed model fitted on the logarithmic scale. Empirical results suggest that the proposed method leads to efficient small area estimates for semicontinuous data of this type. We also propose a parametric bootstrap method to estimate the MSE of the proposed small area estimator. These bootstrap estimates of the MSE are compared to the true MSE in a simulation study.

Journal ArticleDOI
TL;DR: This model adopts a semiparametric Bayesian approach by imposing a GP prior on the nonlinear structure of continuous covariate, and the conditionally autoregressive distribution is placed on the region-specific frailties to handle spatial correlation.
Abstract: Our present work proposes a new survival model in a Bayesian context to analyze right-censored survival data for populations with a surviving fraction, assuming that the log failure time follows a generalized extreme value distribution. Many applications require a more flexible modeling of covariate information than a simple linear or parametric form for all covariate effects. It is also necessary to include the spatial variation in the model, since it is sometimes unexplained by the covariates considered in the analysis. Therefore, the nonlinear covariate effects and the spatial effects are incorporated into the systematic component of our model. Gaussian processes (GPs) provide a natural framework for modeling potentially nonlinear relationship and have recently become extremely powerful in nonlinear regression. Our proposed model adopts a semiparametric Bayesian approach by imposing a GP prior on the nonlinear structure of continuous covariate. With the consideration of data availability and computational complexity, the conditionally autoregressive distribution is placed on the region-specific frailties to handle spatial correlation. The flexibility and gains of our proposed model are illustrated through analyses of simulated data examples as well as a dataset involving a colon cancer clinical trial from the state of Iowa.

Journal ArticleDOI
TL;DR: Simulation results show that the proposed criteria frequently select the correct model in candidate mean models and have good performance in selecting the working correlation structure in longitudinal data with dropout missingness for binary and normal outcomes.
Abstract: We propose criteria for variable selection in the mean model and for the selection of a working correlation structure in longitudinal data with dropout missingness using weighted generalized estimating equations. The proposed criteria are based on a weighted quasi-likelihood function and a penalty term. Our simulation results show that the proposed criteria frequently select the correct model in candidate mean models. The proposed criteria also have good performance in selecting the working correlation structure for binary and normal outcomes. We illustrate our approaches using two empirical examples. In the first example, we use data from a randomized double-blind study to test the cancer-preventing effects of beta carotene. In the second example, we use longitudinal CD4 count data from a randomized double-blind study.

Journal ArticleDOI
TL;DR: Tests for main and simple treatment effects, time effects, as well as treatment by time interactions in possibly high-dimensional multigroup repeated measures designs, illustrated using electroencephalography data from a neurological study involving patients with Alzheimer's disease and other cognitive impairments.
Abstract: We propose tests for main and simple treatment effects, time effects, as well as treatment by time interactions in possibly high-dimensional multigroup repeated measures designs. The proposed inference procedures extend the work by Brunner et al. (2012) from two to several treatment groups and remain valid for unbalanced data and under unequal covariance matrices. In addition to showing consistency when sample size and dimension tend to infinity at the same rate, we provide finite sample approximations and evaluate their performance in a simulation study, demonstrating better maintenance of the nominal α-level than the popular Box-Greenhouse-Geisser and Huynh-Feldt methods, and a gain in power for informatively increasing dimension. Application is illustrated using electroencephalography (EEG) data from a neurological study involving patients with Alzheimer's disease and other cognitive impairments.

Journal ArticleDOI
TL;DR: It is found that following an AKI event the average long‐term rate of decline in kidney function is almost doubled, regardless of the severity of the event.
Abstract: We use data from an ongoing cohort study of chronic kidney patients at Salford Royal NHS Foundation Trust, Greater Manchester, United Kingdom, to investigate the influence of acute kidney injury (AKI) on the subsequent rate of change of kidney function amongst patients already diagnosed with chronic kidney disease (CKD). We use a linear mixed effects modelling framework to enable estimation of both acute and chronic effects of AKI events on kidney function. We model the fixed effects by a piece-wise linear function with three change-points to capture the acute changes in kidney function that characterise an AKI event, and the random effects by the sum of three components: a random intercept, a stationary stochastic process with Matern correlation structure, and measurement error. We consider both multivariate Normal and multivariate t versions of the random effects. For either specification, we estimate model parameters by maximum likelihood and evaluate the plug-in predictive distributions of the random effects given the data. We find that following an AKI event the average long-term rate of decline in kidney function is almost doubled, regardless of the severity of the event. We also identify and present examples of individual patients whose kidney function trajectories diverge substantially from the population-average.

Journal ArticleDOI
TL;DR: Masedki et al. as mentioned in this paper proposed to use a logistic regression whose sparsity is viewed as a model selection challenge, since the model space is huge, a Metropolis-Hastings algorithm carries out the model selection by maximizing the BIC criterion.
Abstract: Spontaneous adverse event reports have a high potential for detecting adverse drug reactions. However, due to their dimension, the analysis of such databases requires statistical methods. In this context, disproportionality measures can be used. Their main idea is to project the data onto contingency tables in order to measure the strength of associations between drugs and adverse events. However, due to the data projection, these methods are sensitive to the problem of coprescriptions and masking effects. Recently, logistic regressions have been used with a Lasso type penalty to perform the detection of associations between drugs and adverse events. On different examples, this approach limits the drawbacks of the disproportionality methods, but the choice of the penalty value is open to criticism while it strongly influences the results. In this paper, we propose to use a logistic regression whose sparsity is viewed as a model selection challenge. Since the model space is huge, a Metropolis-Hastings algorithm carries out the model selection by maximizing the BIC criterion. Thus, we avoid the calibration of penalty or threshold. During our application on the French pharmacovigilance database, the proposed method is compared to well-established approaches on a reference dataset, and obtains better rates of positive and negative controls. However, many signals (i.e., specific drug-event associations) are not detected by the proposed method. So, we conclude that this method should be used in parallel to existing measures in pharmacovigilance. Code implementing the proposed method is available at the following url: https://github.com/masedki/MHTrajectoryR.