Showing papers in "arXiv: Applications in 2008"

PDF

Open Access

Journal Article•DOI•

[...]

Hemant Ishwaran¹, Udaya B. Kogalur¹, Eugene H. Blackstone, Michael S. Lauer•Institutions (1)

11 Nov 2008-arXiv: Applications

TL;DR: Random Survival Forest (RSF) as discussed by the authors is a random forests method for the analysis of right-censored survival data, which is based on the conservation-of-events principle.

...read moreread less

Abstract: We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

...read moreread less

1,562 citations

Journal Article•DOI•

For objective causal inference, design trumps analysis

[...]

Donald B. Rubin

11 Nov 2008-arXiv: Applications

TL;DR: In this paper, the authors argue that observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data, and they use the framework of potential outcomes to define causal effects.

...read moreread less

Abstract: For obtaining causal inferences that are objective, and therefore have the best chance of revealing scientific truths, carefully designed and executed randomized experiments are generally considered to be the gold standard. Observational studies, in contrast, are generally fraught with problems that compromise any claim for objectivity of the resulting causal inferences. The thesis here is that observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data. Often a candidate data set will have to be rejected as inadequate because of lack of data on key covariates, or because of lack of overlap in the distributions of key covariates between treatment and control groups, often revealed by careful propensity score analyses. Sometimes the template for the approximating randomized experiment will have to be altered, and the use of principal stratification can be helpful in doing this. These issues are discussed and illustrated using the framework of potential outcomes to define causal effects, which greatly clarifies critical issues.

...read moreread less

640 citations

Journal Article•DOI•

Predictive learning via rule ensembles

[...]

Jerome H. Friedman, Bogdan E. Popescu

11 Nov 2008-arXiv: Applications

TL;DR: In this article, a linear combination of simple rules derived from the data is used for general regression and classification models, where each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables.

...read moreread less

Abstract: General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. However, their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on individual predictions, selected subsets of predictions, or globally over the entire space of joint input variable values. Similarly, the degree of relevance of the respective input variables can be assessed globally, locally in different regions of the input space, or at individual prediction points. Techniques are presented for automatically identifying those variables that are involved in interactions with other variables, the strength and degree of those interactions, as well as the identities of the other variables with which they interact. Graphical representations are used to visualize both main and interaction effects.

...read moreread less

515 citations

Posted Content•

Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer

[...]

Jie Peng¹, Ji Zhu², Anna Bergamaschi³, Wonshik Han⁴, Dong Young Noh⁴, Jonathan R. Pollack⁵, Pei Wang⁶ - Show less +3 more•Institutions (6)

University of California, Davis¹, University of Michigan², Rikshospitalet–Radiumhospitalet³, New Generation University College⁴, Stanford University⁵, Fred Hutchinson Cancer Research Center⁶

18 Dec 2008-arXiv: Applications

TL;DR: The proposed method remMap - REgularized Multivariate regression for identifying MAster Predictors - for fitting multivariate response regression models under the high-dimension-low-sample-size setting is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples.

...read moreread less

Abstract: In this paper, we propose a new method remMap -- REgularized Multivariate regression for identifying MAster Predictors -- for fitting multivariate response regression models under the high-dimension-low-sample-size setting. remMap is motivated by investigating the regulatory relationships among different biological molecules based on multiple types of high dimensional genomic data. Particularly, we are interested in studying the influence of DNA copy number alterations on RNA transcript levels. For this purpose, we model the dependence of the RNA expression levels on DNA copy numbers through multivariate linear regressions and utilize proper regularizations to deal with the high dimensionality as well as to incorporate desired network structures. Criteria for selecting the tuning parameters are also discussed. The performance of the proposed method is illustrated through extensive simulation studies. Finally, remMap is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples. We identify a tran-hub region in cytoband 17q12-q21, whose amplification influences the RNA expression levels of more than 30 unlinked genes. These findings may lead to a better understanding of breast cancer pathology.

...read moreread less

251 citations

Posted Content•

Adaptive design and analysis of supercomputer experiments

[...]

Robert B. Gramacy¹, Herbert K. H. Lee²•Institutions (2)

University of Cambridge¹, University of California, Santa Cruz²

28 May 2008-arXiv: Applications

TL;DR: In this article, an adaptive sequential design framework was developed to cope with an asynchronous, random, agent-based supercomputing environment, by using a hybrid approach that melds optimal strategies from the statistics literature with flexible strategies from active learning literature.

...read moreread less

Abstract: Computer experiments are often performed to allow modeling of a response surface of a physical experiment that can be too costly or difficult to run except using a simulator. Running the experiment over a dense grid can be prohibitively expensive, yet running over a sparse design chosen in advance can result in obtaining insufficient information in parts of the space, particularly when the surface calls for a nonstationary model. We propose an approach that automatically explores the space while simultaneously fitting the response surface, using predictive uncertainty to guide subsequent experimental runs. The newly developed Bayesian treed Gaussian process is used as the surrogate model, and a fully Bayesian approach allows explicit measures of uncertainty. We develop an adaptive sequential design framework to cope with an asynchronous, random, agent--based supercomputing environment, by using a hybrid approach that melds optimal strategies from the statistics literature with flexible strategies from the active learning literature. The merits of this approach are borne out in several examples, including the motivating computational fluid dynamics simulation of a rocket booster.

...read moreread less

189 citations

Posted Content•

An efficient methodology for modeling complex computer codes with Gaussian processes

[...]

Amandine Marrel, Bertrand Iooss, François Van Dorpe, Elena Volkova¹•Institutions (1)

Kurchatov Institute¹

08 Feb 2008-arXiv: Applications

TL;DR: In this article, a specific estimation procedure is developed to adjust a Gaussian process model in complex cases (non linear relations, highly dispersed or discontinuous output, high dimensional input, inadequate sampling designs,...).

...read moreread less

Abstract: Complex computer codes are often too time expensive to be directly used to perform uncertainty propagation studies, global sensitivity analysis or to solve optimization problems. A well known and widely used method to circumvent this inconvenience consists in replacing the complex computer code by a reduced model, called a metamodel, or a response surface that represents the computer code and requires acceptable calculation time. One particular class of metamodels is studied: the Gaussian process model that is characterized by its mean and covariance functions. A specific estimation procedure is developed to adjust a Gaussian process model in complex cases (non linear relations, highly dispersed or discontinuous output, high dimensional input, inadequate sampling designs, ...). The efficiency of this algorithm is compared to the efficiency of other existing algorithms on an analytical test case. The proposed methodology is also illustrated for the case of a complex hydrogeological computer code, simulating radionuclide transport in groundwater.

...read moreread less

180 citations

Journal Article•DOI•

Inference using shape-restricted regression splines

[...]

Mary C. Meyer

11 Nov 2008-arXiv: Applications

TL;DR: In this paper, an algorithm for the cubic monotone case is proposed, and the method is extended to convex constraints and variants such as increasing-concave, which has smaller squared error loss than the unrestricted splines.

...read moreread less

Abstract: Regression splines are smooth, flexible, and parsimonious nonparametric function estimators. They are known to be sensitive to knot number and placement, but if assumptions such as monotonicity or convexity may be imposed on the regression function, the shape-restricted regression splines are robust to knot choices. Monotone regression splines were introduced by Ramsay [Statist. Sci. 3 (1998) 425--461], but were limited to quadratic and lower order. In this paper an algorithm for the cubic monotone case is proposed, and the method is extended to convex constraints and variants such as increasing-concave. The restricted versions have smaller squared error loss than the unrestricted splines, although they have the same convergence rates. The relatively small degrees of freedom of the model and the insensitivity of the fits to the knot choices allow for practical inference methods; the computational efficiency allows for back-fitting of additive models. Tests of constant versus increasing and linear versus convex regression function, when implemented with shape-restricted regression splines, have higher power than the standard version using ordinary shape-restricted regression.

...read moreread less

151 citations

Posted Content•

Markov switching negative binomial models: an application to vehicle accident frequencies

[...]

Nataliya V. Malyshkina¹, Fred L. Mannering¹, Andrew P. Tarko¹•Institutions (1)

Purdue University¹

11 Nov 2008-arXiv: Applications

TL;DR: The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) negative binomial model and are found that the more frequent state is safer and it is correlated with better weather conditions.

...read moreread less

Abstract: In this paper, two-state Markov switching models are proposed to study accident frequencies. These models assume that there are two unobserved states of roadway safety, and that roadway entities (roadway segments) can switch between these states over time. The states are distinct, in the sense that in the different states accident frequencies are generated by separate counting processes (by separate Poisson or negative binomial processes). To demonstrate the applicability of the approach presented herein, two-state Markov switching negative binomial models are estimated using five-year accident frequencies on Indiana interstate highway segments. Bayesian inference methods and Markov Chain Monte Carlo (MCMC) simulations are used for model estimation. The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) negative binomial model. It is found that the more frequent state is safer and it is correlated with better weather conditions. The less frequent state is found to be less safe and to be correlated with adverse weather conditions.

...read moreread less

149 citations

Posted Content•

Markov switching multinomial logit model: an application to accident injury severities

[...]

Nataliya V. Malyshkina¹, Fred L. Mannering¹•Institutions (1)

Purdue University¹

21 Nov 2008-arXiv: Applications

TL;DR: Two-state Markov switching multinomial logit models are proposed for statistical modeling of accident-injury severities and it is found that the more frequent state of roadway safety is correlated with better weather conditions and that the less frequent state is correlation with adverse weather conditions.

...read moreread less

Abstract: In this study, two-state Markov switching multinomial logit models are proposed for statistical modeling of accident injury severities. These models assume Markov switching in time between two unobserved states of roadway safety. The states are distinct, in the sense that in different states accident severity outcomes are generated by separate multinomial logit processes. To demonstrate the applicability of the approach presented herein, two-state Markov switching multinomial logit models are estimated for severity outcomes of accidents occurring on Indiana roads over a four-year time interval. Bayesian inference methods and Markov Chain Monte Carlo (MCMC) simulations are used for model estimation. The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) multinomial logit models. It is found that the more frequent state of roadway safety is correlated with better weather conditions. The less frequent state is found to be correlated with adverse weather conditions.

...read moreread less

129 citations

Journal Article•DOI•

Simultaneous inference: When should hypothesis testing problems be combined?

[...]

Bradley Efron

27 Mar 2008-arXiv: Applications

TL;DR: A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses, which allows efficient testing within small subclasses and has applications to "enrichment." the detection of multi-case effects.

...read moreread less

Abstract: Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance, a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to ``enrichment,'' the detection of multi-case effects.

...read moreread less

105 citations

Journal Article•DOI•

Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions

[...]

Jie Peng, Hans-Georg Müller

05 May 2008-arXiv: Applications

TL;DR: This paper proposes a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available, and applies distance-based clustering methods to eBay online auction data.

...read moreread less

Abstract: We propose a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available. Such data occur commonly in longitudinal studies and online trading data. A distance measure then makes it possible to apply distance-based analysis such as classification, clustering and multidimensional scaling for irregularly sampled longitudinal data. Once a suitable distance measure for sparsely sampled longitudinal trajectories has been found, we apply distance-based clustering methods to eBay online auction data. We identify six distinct clusters of bidding patterns. Each of these bidding patterns is found to be associated with a specific chance to obtain the auctioned item at a reasonable price.

...read moreread less

Posted Content•

Global sensitivity analysis of computer models with functional inputs

[...]

Bertrand Iooss, Mathieu Ribatet¹•Institutions (1)

Institut national de la recherche scientifique¹

07 Feb 2008-arXiv: Applications

TL;DR: In this paper, the mean and dispersion of the code outputs using two interlinked Generalized Linear Models (GLM) or Generalized Additive Models (GAM) are used to estimate the sensitivity indices of each scalar input variables.

...read moreread less

Abstract: Global sensitivity analysis is used to quantify the influence of uncertain input parameters on the response variability of a numerical model. The common quantitative methods are applicable to computer codes with scalar input variables. This paper aims to illustrate different variance-based sensitivity analysis techniques, based on the so-called Sobol indices, when some input variables are functional, such as stochastic processes or random spatial fields. In this work, we focus on large cpu time computer codes which need a preliminary meta-modeling step before performing the sensitivity analysis. We propose the use of the joint modeling approach, i.e., modeling simultaneously the mean and the dispersion of the code outputs using two interlinked Generalized Linear Models (GLM) or Generalized Additive Models (GAM). The ``mean'' model allows to estimate the sensitivity indices of each scalar input variables, while the ``dispersion'' model allows to derive the total sensitivity index of the functional input variables. The proposed approach is compared to some classical SA methodologies on an analytical function. Lastly, the proposed methodology is applied to a concrete industrial computer code that simulates the nuclear fuel irradiation.

...read moreread less

Journal Article•DOI•

Forecasting time series of inhomogeneous Poisson processes with application to call center workforce management

[...]

Haipeng Shen, Jianhua Z. Huang¹•Institutions (1)

University of North Carolina at Chapel Hill¹

25 Jul 2008-arXiv: Applications

TL;DR: The empirical results demonstrate how forecasting and dynamic updating of call arrival rates can affect the accuracy of call center staffing.

...read moreread less

Abstract: We consider forecasting the latent rate profiles of a time series of inhomogeneous Poisson processes. The work is motivated by operations management of queueing systems, in particular, telephone call centers, where accurate forecasting of call arrival rates is a crucial primitive for efficient staffing of such centers. Our forecasting approach utilizes dimension reduction through a factor analysis of Poisson variables, followed by time series modeling of factor score series. Time series forecasts of factor scores are combined with factor loadings to yield forecasts of future Poisson rate profiles. Penalized Poisson regressions on factor loadings guided by time series forecasts of factor scores are used to generate dynamic within-process rate updating. Methods are also developed to obtain distributional forecasts. Our methods are illustrated using simulation and real data. The empirical results demonstrate how forecasting and dynamic updating of call arrival rates can affect the accuracy of call center staffing.

...read moreread less

Journal Article•DOI•

Coordinate descent algorithms for lasso penalized regression

[...]

Tong Tong Wu, Kenneth Lange

27 Mar 2008-arXiv: Applications

TL;DR: This paper tests two exceptionally fast algorithms for estimating regression coefficients with a lasso penalty and proves that a greedy form of the l 2 algorithm converges to the minimum value of the objective function.

...read moreread less

Abstract: Imposition of a lasso penalty shrinks parameter estimates toward zero and performs continuous model selection. Lasso penalized regression is capable of handling linear regression problems where the number of predictors far exceeds the number of cases. This paper tests two exceptionally fast algorithms for estimating regression coefficients with a lasso penalty. The previously known $\ell_2$ algorithm is based on cyclic coordinate descent. Our new $\ell_1$ algorithm is based on greedy coordinate descent and Edgeworth's algorithm for ordinary $\ell_1$ regression. Each algorithm relies on a tuning constant that can be chosen by cross-validation. In some regression problems it is natural to group parameters and penalize parameters group by group rather than separately. If the group penalty is proportional to the Euclidean norm of the parameters of the group, then it is possible to majorize the norm and reduce parameter estimation to $\ell_2$ regression with a lasso penalty. Thus, the existing algorithm can be extended to novel settings. Each of the algorithms discussed is tested via either simulated or real data or both. The Appendix proves that a greedy form of the $\ell_2$ algorithm converges to the minimum value of the objective function.

...read moreread less

Journal Article•DOI•

Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays

[...]

Robert B. Scharpf¹, Giovanni Parmigiani¹, Jonathan Pevsner², Ingo Ruczinski¹•Institutions (2)

Johns Hopkins University¹, Kennedy Krieger Institute²

29 Jul 2008-arXiv: Applications

TL;DR: In this article, Hidden Markov models (HMM) are used for detecting the spatial dependence between neighboring SNPs, and confidence scores control smoothing in a probabilistic framework.

...read moreread less

Abstract: Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g., aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and are therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g., the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such alterations, modeling the spatial dependence between neighboring SNPs. Here, we improve previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we, in particular, demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package VanillaICE.

...read moreread less

Journal Article•DOI•

Unsupervised empirical Bayesian multiple testing with external covariates

[...]

Egil Ferkingstad, Arnoldo Frigessi, Håvard Rue, Gudmar Thorleifsson, Augustine Kong - Show less +1 more

29 Jul 2008-arXiv: Applications

TL;DR: In this paper, the covariate-based prior information is used to produce a list of significant hypotheses which differ in length and order from the list obtained by methods not taking covariate information into account, and the posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm.

...read moreread less

Abstract: In an empirical Bayesian setting, we provide a new multiple testing method, useful when an additional covariate is available, that influences the probability of each null hypothesis being true. We measure the posterior significance of each test conditionally on the covariate and the data, leading to greater power. Using covariate-based prior information in an unsupervised fashion, we produce a list of significant hypotheses which differs in length and order from the list obtained by methods not taking covariate-information into account. Covariate-modulated posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm. The new method is applied to expression quantitative trait loci (eQTL) data.

...read moreread less

Journal Article•DOI•

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

[...]

Yulan Liang, Arpad Kelemen

28 Mar 2008-arXiv: Applications

TL;DR: A review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection are presented.

...read moreread less

Abstract: Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.

...read moreread less

Journal Article•DOI•

Inference for the limiting cluster size distribution of extreme values

[...]

Christian Y. Robert

07 Oct 2008-arXiv: Applications

TL;DR: In this paper, the authors introduced estimators of the limiting cluster size probabilities, which are constructed through a recursive algorithm, and studied the asymptotic properties of the estimators and investigated their finite sample behavior on simulated data.

...read moreread less

Abstract: Any limiting point process for the time normalized exceedances of high levels by a stationary sequence is necessarily compound Poisson under appropriate long range dependence conditions. Typically exceedances appear in clusters. The underlying Poisson points represent the cluster positions and the multiplicities correspond to the cluster sizes. In the present paper we introduce estimators of the limiting cluster size probabilities, which are constructed through a recursive algorithm. We derive estimators of the extremal index which plays a key role in determining the intensity of cluster positions. We study the asymptotic properties of the estimators and investigate their finite sample behavior on simulated data.

...read moreread less

Journal Article•DOI•

A Semi-parametric Technique for the Quantitative Analysis of Dynamic Contrast-enhanced MR Images Based on Bayesian P-splines

[...]

Volker Schmid, Brandon Whitcher, Anwar R. Padhani, Guang-Zhong Yang

26 Jan 2008-arXiv: Applications

TL;DR: A semi-parametric penalizedspline smoothing approach, where the AIF is convolved with a set of B-splines to produce a design matrix using locally adaptive smoothing parameters based on Bayesian penalized spline models (P-spline models).

...read moreread less

Abstract: Dynamic Contrast-enhanced Magnetic Resonance Imaging (DCE-MRI) is an important tool for detecting subtle kinetic changes in cancerous tissue. Quantitative analysis of DCE-MRI typically involves the convolution of an arterial input function (AIF) with a nonlinear pharmacokinetic model of the contrast agent concentration. Parameters of the kinetic model are biologically meaningful, but the optimization of the non-linear model has significant computational issues. In practice, convergence of the optimization algorithm is not guaranteed and the accuracy of the model fitting may be compromised. To overcome this problems, this paper proposes a semi-parametric penalized spline smoothing approach, with which the AIF is convolved with a set of B-splines to produce a design matrix using locally adaptive smoothing parameters based on Bayesian penalized spline models (P-splines). It has been shown that kinetic parameter estimation can be obtained from the resulting deconvolved response function, which also includes the onset of contrast enhancement. Detailed validation of the method, both with simulated and in vivo data, is provided.

...read moreread less

Posted Content•

Causal Models for Estimating the Effects of Weight Gain on Mortality

[...]

James M. Robins¹•Institutions (1)

Harvard University¹

04 Feb 2008-arXiv: Applications

TL;DR: In this article, the authors estimate the counterfactual mortality of a cohort of 18 year old non-smoking American men on a stringent mandatory diet that guaranteed that no one would ever weigh more than their baseline weight established at age 18.

...read moreread less

Abstract: Suppose, contrary to fact, in 1950, we had put the cohort of 18 year old non-smoking American men on a stringent mandatory diet that guaranteed that no one would ever weigh more than their baseline weight established at age 18. How would the counter-factual mortality of these 18 year olds have compared to their actual observed mortality through 2007? We describe in detail how this counterfactual contrast could be estimated from longitudinal epidemiologic data similiar to that stored in the electronic medical records of a large HMO by applying g-estimation to a novel structural nested model. Our analytic approach differs from any alternative approach in that in that, in the abscence of model misspecification, it can successfully adjust for (i) measured time-varying confounders such as exercise, hypertension and diabetes that are simultaneously intermediate variables on the causal pathway from weight gain to death and determinants of future weight gain, (ii) unmeasured confounding by undiagnosed preclinical disease (i.e reverse causation) that can cause both poor weight gain and premature mortality [provided an upper bound can be specified for the maximum length of time a subject may suffer from a subclinical illness severe enough to affect his weight without the illness becomes clinically manifest], and (iii) the prescence of particular identifiable subgroups, such as those suffering from serious renal, liver, pulmonary, and/or cardiac disease, in whom confounding by unmeasured prognostic factors so severe as to render useless any attempt at direct analytic adjustment.

...read moreread less

Journal Article•DOI•

A Sharper discrepancy measure for post-election audits

[...]

Philip B. Stark

11 Nov 2008-arXiv: Applications

TL;DR: For the 2006 U.S. Senate race in Minnesota, a test using MRO gave a $P$-value of 4.05% for the hypothesis that a full hand tally would find a different winner, less than half the value Stark as discussed by the authors finds.

...read moreread less

Abstract: Post-election audits use the discrepancy between machine counts and a hand tally of votes in a random sample of precincts to infer whether error affected the electoral outcome. The maximum relative overstatement of pairwise margins (MRO) quantifies that discrepancy. The electoral outcome a full hand tally shows must agree with the apparent outcome if the MRO is less than 1. This condition is sharper than previous ones when there are more than two candidates or when voters may vote for more than one candidate. For the 2006 U.S. Senate race in Minnesota, a test using MRO gives a $P$-value of 4.05% for the hypothesis that a full hand tally would find a different winner, less than half the value Stark [Ann. Appl. Statist. 2 (2008) 550--581] finds.

...read moreread less

Journal Article•DOI•

Assessing surrogate endpoints in vaccine trials with case-cohort sampling and the Cox model

[...]

Li Qin¹, Peter B. Gilbert², Dean Follmann², Dongfeng Li²•Institutions (2)

Fred Hutchinson Cancer Research Center¹, Peking University²

27 Mar 2008-arXiv: Applications

TL;DR: In this article, the value of an immune response as a surrogate of protection in a randomized placebo-controlled trial with case-cohort sampling of immune responses and a time to event endpoint is evaluated.

...read moreread less

Abstract: Assessing immune responses to study vaccines as surrogates of protection plays a central role in vaccine clinical trials. Motivated by three ongoing or pending HIV vaccine efficacy trials, we consider such surrogate endpoint assessment in a randomized placebo-controlled trial with case-cohort sampling of immune responses and a time to event endpoint. Based on the principal surrogate definition under the principal stratification framework proposed by Frangakis and Rubin [Biometrics 58 (2002) 21--29] and adapted by Gilbert and Hudgens (2006), we introduce estimands that measure the value of an immune response as a surrogate of protection in the context of the Cox proportional hazards model. The estimands are not identified because the immune response to vaccine is not measured in placebo recipients. We formulate the problem as a Cox model with missing covariates, and employ novel trial designs for predicting the missing immune responses and thereby identifying the estimands. The first design utilizes information from baseline predictors of the immune response, and bridges their relationship in the vaccine recipients to the placebo recipients. The second design provides a validation set for the unmeasured immune responses of uninfected placebo recipients by immunizing them with the study vaccine after trial closeout. A maximum estimated likelihood approach is proposed for estimation of the parameters. Simulated data examples are given to evaluate the proposed designs and study their properties.

...read moreread less

Journal Article•DOI•

Statistical Challenges in the Analysis of Cosmic Microwave Background Radiation

[...]

Paolo Cabella, Domenico Marinucci

11 Jul 2008-arXiv: Applications

TL;DR: In this article, the authors review a number of open problems in CMB data analysis and present applications to observations from the WMAP mission, and present a solution to one of them.

...read moreread less

Abstract: An enormous amount of observations on Cosmic Microwave Background radiation has been collected in the last decade, and much more data are expected in the near future from planned or operating satellite missions. These datasets are a goldmine of information for Cosmology and Theoretical Physics; their efficient exploitation posits several intriguing challenges from the statistical point of view. In this paper we review a number of open problems in CMB data analysis and we present applications to observations from the WMAP mission.

...read moreread less

Journal Article•DOI•

Optimal factorial designs for cDNA microarray experiments

[...]

Tathagata Banerjee, Rahul Mukerjee

27 Mar 2008-arXiv: Applications

TL;DR: In this paper, the authors consider cDNA microarray experiments when the cell populations have a factorial structure, and investigate the problem of their optimal designing under a baseline parametrization where the objects of interest differ from those under the more common orthogonal parameter.

...read moreread less

Abstract: We consider cDNA microarray experiments when the cell populations have a factorial structure, and investigate the problem of their optimal designing under a baseline parametrization where the objects of interest differ from those under the more common orthogonal parametrization. First, analytical results are given for the $2\times 2$ factorial. Since practical applications often involve a more complex factorial structure, we next explore general factorials and obtain a collection of optimal designs in the saturated, that is, most economic, case. This, in turn, is seen to yield an approach for finding optimal or efficient designs in the practically more important nearly saturated cases. Thereafter, the findings are extended to the more intricate situation where the underlying model incorporates dye-coloring effects, and the role of dye-swapping is critically examined.

...read moreread less

Posted Content•

Hierarchical Additive Modeling of Nonlinear Association with Spatial Correlations-An Application to Relate Alcohol Outlet Density and Neighborhood Assault Rates

[...]

Qingzhao Yu, Bin Li, Richard Scribner, Deborah A. Cohen

04 Feb 2008-arXiv: Applications

TL;DR: This paper proposes a hierarchical additive model, where the nonlinear correlations and the complex interaction effects are modeled using the multiple additive regression trees and the residual spatial association in the assault rates that cannot be explained in the model are smoothed using a conditional autoregressive (CAR) method.

...read moreread less

Abstract: Previous studies have suggested a link between alcohol outlets and assaultive violence. In this paper, we explore the effects of alcohol availability on assault crimes at the census tract level over time. The statistical analysis is challenged by several features of the data: (1) the effects of possible covariates (for example, the alcohol outlet density of each census tract) on the assaultive crime rates may be complex; (2) the covariates may be highly correlated with each other; (3) there are a lot of missing inputs in the data; and (4) spatial correlations exist in the outcome assaultive crime rates. We propose a hierarchical additive model, where the nonlinear correlations and the complex interaction effects are modeled using the multiple additive regression trees (MART) and the spatial variances in the assaultive rates that cannot be explained by the specified covariates are smoothed trough the Conditional Autoregressive (CAR) model. We develop a two-stage algorithm that connect the non-parametric trees with CAR to look for important variables covariates associated with the assaultive crime rates, while taking account of the spatial correlations among adjacent census tracts. The proposed methods are applied to the Los Angeles assaultive data (1990-1999) and compared with traditional method.

...read moreread less

Posted Content•

Analysis of the Effect of Speed Limit Increases on Accident-Injury Severities

[...]

Nataliya V. Malyshkina, Fred L. Mannering

08 Jun 2008-arXiv: Applications

TL;DR: In this paper, the influence of the posted speed limit on the severity of vehicle accidents is studied using Indiana accident data from 2004 and 2006 (the year after speed limits were raised on rural interstates and some multi-lane non-interstate routes).

...read moreread less

Abstract: The influence of speed limits on roadway safety has been a subject of continuous debate in the State of Indiana and nationwide In Indiana, highway-related accidents result in about 900 fatalities and forty thousand injuries annually and place an incredible social and economic burden on the state Still, speed limits posted on highways and other roads are routinely exceeded as individual drivers try to balance safety, mobility (speed), and the risks and penalties associated with law enforcement efforts The speed-limit/safety issue has been a matter of considerable concern in Indiana since the state raised its speed limits on rural interstates and selected multilane highways on July 1, 2005 In this paper, the influence of the posted speed limit on the severity of vehicle accidents is studied using Indiana accident data from 2004 (the year before speed limits were raised) and 2006 (the year after speed limits were raised on rural interstates and some multi-lane non-interstate routes) Statistical models of the injury severity of different types of accidents on various roadway classes were estimated The results of the model estimations showed that, for the speed limit ranges currently used, speed limits did not have a statistically significant effect on the severity of accidents on interstate highways However, for some non-interstate highways, higher speed limits were found to be associated with higher accident severities - suggesting that future speed limit changes, on non-interstate highways in particular, need to be carefully assessed on a case-by-case basis

...read moreread less

Journal Article•DOI•

Estimating limits from Poisson counting data using Dempster--Shafer analysis

[...]

Paul T. Edlefsen, Chuanhai Liu, Arthur P. Dempster

09 Dec 2008-arXiv: Applications

TL;DR: The Poisson Dempster--Shafer model (DSM) is used to derive a posterior DSM for the ``Banff upper limits challenge'' three-Poisson model and it is argued that the reduced dependence on priors afforded by the Dem pster--shafer framework is both practically and theoretically desirable.

...read moreread less

Abstract: We present a Dempster--Shafer (DS) approach to estimating limits from Poisson counting data with nuisance parameters. Dempster--Shafer is a statistical framework that generalizes Bayesian statistics. DS calculus augments traditional probability by allowing mass to be distributed over power sets of the event space. This eliminates the Bayesian dependence on prior distributions while allowing the incorporation of prior information when it is available. We use the Poisson Dempster--Shafer model (DSM) to derive a posterior DSM for the ``Banff upper limits challenge'' three-Poisson model. The results compare favorably with other approaches, demonstrating the utility of the approach. We argue that the reduced dependence on priors afforded by the Dempster--Shafer framework is both practically and theoretically desirable.

...read moreread less

Posted Content•

Zero-state Markov switching count-data models: an empirical assessment

[...]

Nataliya V. Malyshkina¹, Fred L. Mannering¹•Institutions (1)

Purdue University¹

21 Nov 2008-arXiv: Applications

TL;DR: In this article, a two-state Markov switching count-data model is proposed as an alternative to zero-inflated models to account for the preponderance of zeros sometimes observed in transportation count data, such as the number of accidents occurring on a roadway segment over some period of time.

...read moreread less

Abstract: In this study, a two-state Markov switching count-data model is proposed as an alternative to zero-inflated models to account for the preponderance of zeros sometimes observed in transportation count data, such as the number of accidents occurring on a roadway segment over some period of time. For this accident-frequency case, zero-inflated models assume the existence of two states: one of the states is a zero-accident count state, in which accident probabilities are so low that they cannot be statistically distinguished from zero, and the other state is a normal count state, in which counts can be non-negative integers that are generated by some counting process, for example, a Poisson or negative binomial. In contrast to zero-inflated models, Markov switching models allow specific roadway segments to switch between the two states over time. An important advantage of this Markov switching approach is that it allows for the direct statistical estimation of the specific roadway-segment state (i.e., zero or count state) whereas traditional zero-inflated models do not. To demonstrate the applicability of this approach, a two-state Markov switching negative binomial model (estimated with Bayesian inference) and standard zero-inflated negative binomial models are estimated using five-year accident frequencies on Indiana interstate highway segments. It is shown that the Markov switching model is a viable alternative and results in a superior statistical fit relative to the zero-inflated models.

...read moreread less

Book Chapter•DOI•

Characteristics of hand and machine-assigned scores to college students’ answers to open-ended tasks

[...]

Stephen P. Klein

01 Jan 2008-arXiv: Applications

TL;DR: It is demonstrated that machine scoring can facilitate the use of open-ended questions in large-scale testing programs by providing a fast, accurate, and economical way to grade responses.

...read moreread less

Abstract: Assessment of learning in higher education is a critical concern to policy makers, educators, parents, and students. And, doing so appropriately is likely to require including constructed response tests in the assessment system. We examined whether scoring costs and other concerns with using open-end measures on a large scale (e.g., turnaround time and inter-reader consistency) could be addressed by machine grading the answers. Analyses with 1359 students from 14 colleges found that two human readers agreed highly with each other in the scores they assigned to the answers to three types of open-ended questions. These reader assigned scores also agreed highly with those assigned by a computer. The correlations of the machine-assigned scores with SAT scores, college grades, and other measures were comparable to the correlations of these variables with the hand-assigned scores. Machine scoring did not widen differences in mean scores between racial/ethnic or gender groups. Our findings demonstrated that machine scoring can facilitate the use of open-ended questions in large-scale testing programs by providing a fast, accurate, and economical way to grade responses.

...read moreread less

Book Chapter•DOI•

Three months journeying of a Hawaiian monk seal

[...]

David R. Brillinger, Brent S. Stewart, Charles L. Littnan

20 May 2008-arXiv: Applications

TL;DR: In this paper, a stochastic dieren tial equation (SDE) was used to estimate the length and locations of a monk seal's foraging trips in a relatively shallow oshore submerged bank.

...read moreread less

Abstract: Hawaiian monk seals (Monachus schauinslandi) are endemic to the Hawaiian Islands and are the most endangered species of marine mammal that lives entirely within the jurisdiction of the United States. The species numbers around 1300 and has been declining owing, among other things, to poor juve- nile survival which is evidently related to poor foraging success. Consequently, data have been collected recently on the foraging habitats, movements, and behaviors of monk seals throughout the Northwestern and main Hawaiian Is- lands. Our work here is directed to exploring a data set located in a relatively shallow oshore submerged bank (Penguin Bank) in our search of a model for a seal's journey. The work ends by tting a stochastic dieren tial equation (SDE) that mimics some aspects of the behavior of seals by working with location data collected for one seal. The SDE is found by developing a time varying potential function with two points of attraction. The times of location are irregularly spaced and not close together geographicaly, leading to some diculties of interpretation. Synthetic plots generated using the model are employed to assess its reasonableness spatially and temporally. One aspect is that the animal stays mainly southwest of Molokai. The work led to the estimation of the lengths and locations of the seal's foraging trips.

...read moreread less

Collapse