scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Applications in 2008"


Journal ArticleDOI
TL;DR: Random Survival Forest (RSF) as discussed by the authors is a random forests method for the analysis of right-censored survival data, which is based on the conservation-of-events principle.
Abstract: We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

1,562 citations


Journal ArticleDOI
TL;DR: In this paper, the authors argue that observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data, and they use the framework of potential outcomes to define causal effects.
Abstract: For obtaining causal inferences that are objective, and therefore have the best chance of revealing scientific truths, carefully designed and executed randomized experiments are generally considered to be the gold standard. Observational studies, in contrast, are generally fraught with problems that compromise any claim for objectivity of the resulting causal inferences. The thesis here is that observational studies have to be carefully designed to approximate randomized experiments, in particular, without examining any final outcome data. Often a candidate data set will have to be rejected as inadequate because of lack of data on key covariates, or because of lack of overlap in the distributions of key covariates between treatment and control groups, often revealed by careful propensity score analyses. Sometimes the template for the approximating randomized experiment will have to be altered, and the use of principal stratification can be helpful in doing this. These issues are discussed and illustrated using the framework of potential outcomes to define causal effects, which greatly clarifies critical issues.

640 citations


Journal ArticleDOI
TL;DR: In this article, a linear combination of simple rules derived from the data is used for general regression and classification models, where each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables.
Abstract: General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. However, their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on individual predictions, selected subsets of predictions, or globally over the entire space of joint input variable values. Similarly, the degree of relevance of the respective input variables can be assessed globally, locally in different regions of the input space, or at individual prediction points. Techniques are presented for automatically identifying those variables that are involved in interactions with other variables, the strength and degree of those interactions, as well as the identities of the other variables with which they interact. Graphical representations are used to visualize both main and interaction effects.

515 citations


Posted Content
TL;DR: The proposed method remMap - REgularized Multivariate regression for identifying MAster Predictors - for fitting multivariate response regression models under the high-dimension-low-sample-size setting is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples.
Abstract: In this paper, we propose a new method remMap -- REgularized Multivariate regression for identifying MAster Predictors -- for fitting multivariate response regression models under the high-dimension-low-sample-size setting. remMap is motivated by investigating the regulatory relationships among different biological molecules based on multiple types of high dimensional genomic data. Particularly, we are interested in studying the influence of DNA copy number alterations on RNA transcript levels. For this purpose, we model the dependence of the RNA expression levels on DNA copy numbers through multivariate linear regressions and utilize proper regularizations to deal with the high dimensionality as well as to incorporate desired network structures. Criteria for selecting the tuning parameters are also discussed. The performance of the proposed method is illustrated through extensive simulation studies. Finally, remMap is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples. We identify a tran-hub region in cytoband 17q12-q21, whose amplification influences the RNA expression levels of more than 30 unlinked genes. These findings may lead to a better understanding of breast cancer pathology.

251 citations


Posted Content
TL;DR: In this article, an adaptive sequential design framework was developed to cope with an asynchronous, random, agent-based supercomputing environment, by using a hybrid approach that melds optimal strategies from the statistics literature with flexible strategies from active learning literature.
Abstract: Computer experiments are often performed to allow modeling of a response surface of a physical experiment that can be too costly or difficult to run except using a simulator. Running the experiment over a dense grid can be prohibitively expensive, yet running over a sparse design chosen in advance can result in obtaining insufficient information in parts of the space, particularly when the surface calls for a nonstationary model. We propose an approach that automatically explores the space while simultaneously fitting the response surface, using predictive uncertainty to guide subsequent experimental runs. The newly developed Bayesian treed Gaussian process is used as the surrogate model, and a fully Bayesian approach allows explicit measures of uncertainty. We develop an adaptive sequential design framework to cope with an asynchronous, random, agent--based supercomputing environment, by using a hybrid approach that melds optimal strategies from the statistics literature with flexible strategies from the active learning literature. The merits of this approach are borne out in several examples, including the motivating computational fluid dynamics simulation of a rocket booster.

189 citations


Posted Content
TL;DR: In this article, a specific estimation procedure is developed to adjust a Gaussian process model in complex cases (non linear relations, highly dispersed or discontinuous output, high dimensional input, inadequate sampling designs,...).
Abstract: Complex computer codes are often too time expensive to be directly used to perform uncertainty propagation studies, global sensitivity analysis or to solve optimization problems. A well known and widely used method to circumvent this inconvenience consists in replacing the complex computer code by a reduced model, called a metamodel, or a response surface that represents the computer code and requires acceptable calculation time. One particular class of metamodels is studied: the Gaussian process model that is characterized by its mean and covariance functions. A specific estimation procedure is developed to adjust a Gaussian process model in complex cases (non linear relations, highly dispersed or discontinuous output, high dimensional input, inadequate sampling designs, ...). The efficiency of this algorithm is compared to the efficiency of other existing algorithms on an analytical test case. The proposed methodology is also illustrated for the case of a complex hydrogeological computer code, simulating radionuclide transport in groundwater.

180 citations


Journal ArticleDOI
TL;DR: In this paper, an algorithm for the cubic monotone case is proposed, and the method is extended to convex constraints and variants such as increasing-concave, which has smaller squared error loss than the unrestricted splines.
Abstract: Regression splines are smooth, flexible, and parsimonious nonparametric function estimators. They are known to be sensitive to knot number and placement, but if assumptions such as monotonicity or convexity may be imposed on the regression function, the shape-restricted regression splines are robust to knot choices. Monotone regression splines were introduced by Ramsay [Statist. Sci. 3 (1998) 425--461], but were limited to quadratic and lower order. In this paper an algorithm for the cubic monotone case is proposed, and the method is extended to convex constraints and variants such as increasing-concave. The restricted versions have smaller squared error loss than the unrestricted splines, although they have the same convergence rates. The relatively small degrees of freedom of the model and the insensitivity of the fits to the knot choices allow for practical inference methods; the computational efficiency allows for back-fitting of additive models. Tests of constant versus increasing and linear versus convex regression function, when implemented with shape-restricted regression splines, have higher power than the standard version using ordinary shape-restricted regression.

151 citations


Posted Content
TL;DR: The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) negative binomial model and are found that the more frequent state is safer and it is correlated with better weather conditions.
Abstract: In this paper, two-state Markov switching models are proposed to study accident frequencies. These models assume that there are two unobserved states of roadway safety, and that roadway entities (roadway segments) can switch between these states over time. The states are distinct, in the sense that in the different states accident frequencies are generated by separate counting processes (by separate Poisson or negative binomial processes). To demonstrate the applicability of the approach presented herein, two-state Markov switching negative binomial models are estimated using five-year accident frequencies on Indiana interstate highway segments. Bayesian inference methods and Markov Chain Monte Carlo (MCMC) simulations are used for model estimation. The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) negative binomial model. It is found that the more frequent state is safer and it is correlated with better weather conditions. The less frequent state is found to be less safe and to be correlated with adverse weather conditions.

149 citations


Posted Content
TL;DR: Two-state Markov switching multinomial logit models are proposed for statistical modeling of accident-injury severities and it is found that the more frequent state of roadway safety is correlated with better weather conditions and that the less frequent state is correlation with adverse weather conditions.
Abstract: In this study, two-state Markov switching multinomial logit models are proposed for statistical modeling of accident injury severities. These models assume Markov switching in time between two unobserved states of roadway safety. The states are distinct, in the sense that in different states accident severity outcomes are generated by separate multinomial logit processes. To demonstrate the applicability of the approach presented herein, two-state Markov switching multinomial logit models are estimated for severity outcomes of accidents occurring on Indiana roads over a four-year time interval. Bayesian inference methods and Markov Chain Monte Carlo (MCMC) simulations are used for model estimation. The estimated Markov switching models result in a superior statistical fit relative to the standard (single-state) multinomial logit models. It is found that the more frequent state of roadway safety is correlated with better weather conditions. The less frequent state is found to be correlated with adverse weather conditions.

129 citations


Journal ArticleDOI
TL;DR: A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses, which allows efficient testing within small subclasses and has applications to "enrichment." the detection of multi-case effects.
Abstract: Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance, a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to ``enrichment,'' the detection of multi-case effects.

105 citations


Journal ArticleDOI
TL;DR: This paper proposes a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available, and applies distance-based clustering methods to eBay online auction data.
Abstract: We propose a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available. Such data occur commonly in longitudinal studies and online trading data. A distance measure then makes it possible to apply distance-based analysis such as classification, clustering and multidimensional scaling for irregularly sampled longitudinal data. Once a suitable distance measure for sparsely sampled longitudinal trajectories has been found, we apply distance-based clustering methods to eBay online auction data. We identify six distinct clusters of bidding patterns. Each of these bidding patterns is found to be associated with a specific chance to obtain the auctioned item at a reasonable price.

Posted Content
TL;DR: In this paper, the mean and dispersion of the code outputs using two interlinked Generalized Linear Models (GLM) or Generalized Additive Models (GAM) are used to estimate the sensitivity indices of each scalar input variables.
Abstract: Global sensitivity analysis is used to quantify the influence of uncertain input parameters on the response variability of a numerical model. The common quantitative methods are applicable to computer codes with scalar input variables. This paper aims to illustrate different variance-based sensitivity analysis techniques, based on the so-called Sobol indices, when some input variables are functional, such as stochastic processes or random spatial fields. In this work, we focus on large cpu time computer codes which need a preliminary meta-modeling step before performing the sensitivity analysis. We propose the use of the joint modeling approach, i.e., modeling simultaneously the mean and the dispersion of the code outputs using two interlinked Generalized Linear Models (GLM) or Generalized Additive Models (GAM). The ``mean'' model allows to estimate the sensitivity indices of each scalar input variables, while the ``dispersion'' model allows to derive the total sensitivity index of the functional input variables. The proposed approach is compared to some classical SA methodologies on an analytical function. Lastly, the proposed methodology is applied to a concrete industrial computer code that simulates the nuclear fuel irradiation.

Journal ArticleDOI
TL;DR: The empirical results demonstrate how forecasting and dynamic updating of call arrival rates can affect the accuracy of call center staffing.
Abstract: We consider forecasting the latent rate profiles of a time series of inhomogeneous Poisson processes. The work is motivated by operations management of queueing systems, in particular, telephone call centers, where accurate forecasting of call arrival rates is a crucial primitive for efficient staffing of such centers. Our forecasting approach utilizes dimension reduction through a factor analysis of Poisson variables, followed by time series modeling of factor score series. Time series forecasts of factor scores are combined with factor loadings to yield forecasts of future Poisson rate profiles. Penalized Poisson regressions on factor loadings guided by time series forecasts of factor scores are used to generate dynamic within-process rate updating. Methods are also developed to obtain distributional forecasts. Our methods are illustrated using simulation and real data. The empirical results demonstrate how forecasting and dynamic updating of call arrival rates can affect the accuracy of call center staffing.

Journal ArticleDOI
TL;DR: This paper tests two exceptionally fast algorithms for estimating regression coefficients with a lasso penalty and proves that a greedy form of the l 2 algorithm converges to the minimum value of the objective function.
Abstract: Imposition of a lasso penalty shrinks parameter estimates toward zero and performs continuous model selection. Lasso penalized regression is capable of handling linear regression problems where the number of predictors far exceeds the number of cases. This paper tests two exceptionally fast algorithms for estimating regression coefficients with a lasso penalty. The previously known $\ell_2$ algorithm is based on cyclic coordinate descent. Our new $\ell_1$ algorithm is based on greedy coordinate descent and Edgeworth's algorithm for ordinary $\ell_1$ regression. Each algorithm relies on a tuning constant that can be chosen by cross-validation. In some regression problems it is natural to group parameters and penalize parameters group by group rather than separately. If the group penalty is proportional to the Euclidean norm of the parameters of the group, then it is possible to majorize the norm and reduce parameter estimation to $\ell_2$ regression with a lasso penalty. Thus, the existing algorithm can be extended to novel settings. Each of the algorithms discussed is tested via either simulated or real data or both. The Appendix proves that a greedy form of the $\ell_2$ algorithm converges to the minimum value of the objective function.

Journal ArticleDOI
TL;DR: In this article, Hidden Markov models (HMM) are used for detecting the spatial dependence between neighboring SNPs, and confidence scores control smoothing in a probabilistic framework.
Abstract: Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g., aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and are therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g., the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such alterations, modeling the spatial dependence between neighboring SNPs. Here, we improve previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we, in particular, demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package VanillaICE.

Journal ArticleDOI
TL;DR: In this paper, the covariate-based prior information is used to produce a list of significant hypotheses which differ in length and order from the list obtained by methods not taking covariate information into account, and the posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm.
Abstract: In an empirical Bayesian setting, we provide a new multiple testing method, useful when an additional covariate is available, that influences the probability of each null hypothesis being true. We measure the posterior significance of each test conditionally on the covariate and the data, leading to greater power. Using covariate-based prior information in an unsupervised fashion, we produce a list of significant hypotheses which differs in length and order from the list obtained by methods not taking covariate-information into account. Covariate-modulated posterior probabilities of each null hypothesis are estimated using a fast approximate algorithm. The new method is applied to expression quantitative trait loci (eQTL) data.

Journal ArticleDOI
TL;DR: A review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection are presented.
Abstract: Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.

Journal ArticleDOI
TL;DR: In this paper, the authors introduced estimators of the limiting cluster size probabilities, which are constructed through a recursive algorithm, and studied the asymptotic properties of the estimators and investigated their finite sample behavior on simulated data.
Abstract: Any limiting point process for the time normalized exceedances of high levels by a stationary sequence is necessarily compound Poisson under appropriate long range dependence conditions. Typically exceedances appear in clusters. The underlying Poisson points represent the cluster positions and the multiplicities correspond to the cluster sizes. In the present paper we introduce estimators of the limiting cluster size probabilities, which are constructed through a recursive algorithm. We derive estimators of the extremal index which plays a key role in determining the intensity of cluster positions. We study the asymptotic properties of the estimators and investigate their finite sample behavior on simulated data.

Journal ArticleDOI
TL;DR: A semi-parametric penalizedspline smoothing approach, where the AIF is convolved with a set of B-splines to produce a design matrix using locally adaptive smoothing parameters based on Bayesian penalized spline models (P-spline models).
Abstract: Dynamic Contrast-enhanced Magnetic Resonance Imaging (DCE-MRI) is an important tool for detecting subtle kinetic changes in cancerous tissue. Quantitative analysis of DCE-MRI typically involves the convolution of an arterial input function (AIF) with a nonlinear pharmacokinetic model of the contrast agent concentration. Parameters of the kinetic model are biologically meaningful, but the optimization of the non-linear model has significant computational issues. In practice, convergence of the optimization algorithm is not guaranteed and the accuracy of the model fitting may be compromised. To overcome this problems, this paper proposes a semi-parametric penalized spline smoothing approach, with which the AIF is convolved with a set of B-splines to produce a design matrix using locally adaptive smoothing parameters based on Bayesian penalized spline models (P-splines). It has been shown that kinetic parameter estimation can be obtained from the resulting deconvolved response function, which also includes the onset of contrast enhancement. Detailed validation of the method, both with simulated and in vivo data, is provided.

Posted Content
TL;DR: In this article, the authors estimate the counterfactual mortality of a cohort of 18 year old non-smoking American men on a stringent mandatory diet that guaranteed that no one would ever weigh more than their baseline weight established at age 18.
Abstract: Suppose, contrary to fact, in 1950, we had put the cohort of 18 year old non-smoking American men on a stringent mandatory diet that guaranteed that no one would ever weigh more than their baseline weight established at age 18. How would the counter-factual mortality of these 18 year olds have compared to their actual observed mortality through 2007? We describe in detail how this counterfactual contrast could be estimated from longitudinal epidemiologic data similiar to that stored in the electronic medical records of a large HMO by applying g-estimation to a novel structural nested model. Our analytic approach differs from any alternative approach in that in that, in the abscence of model misspecification, it can successfully adjust for (i) measured time-varying confounders such as exercise, hypertension and diabetes that are simultaneously intermediate variables on the causal pathway from weight gain to death and determinants of future weight gain, (ii) unmeasured confounding by undiagnosed preclinical disease (i.e reverse causation) that can cause both poor weight gain and premature mortality [provided an upper bound can be specified for the maximum length of time a subject may suffer from a subclinical illness severe enough to affect his weight without the illness becomes clinically manifest], and (iii) the prescence of particular identifiable subgroups, such as those suffering from serious renal, liver, pulmonary, and/or cardiac disease, in whom confounding by unmeasured prognostic factors so severe as to render useless any attempt at direct analytic adjustment.

Journal ArticleDOI
TL;DR: For the 2006 U.S. Senate race in Minnesota, a test using MRO gave a $P$-value of 4.05% for the hypothesis that a full hand tally would find a different winner, less than half the value Stark as discussed by the authors finds.
Abstract: Post-election audits use the discrepancy between machine counts and a hand tally of votes in a random sample of precincts to infer whether error affected the electoral outcome. The maximum relative overstatement of pairwise margins (MRO) quantifies that discrepancy. The electoral outcome a full hand tally shows must agree with the apparent outcome if the MRO is less than 1. This condition is sharper than previous ones when there are more than two candidates or when voters may vote for more than one candidate. For the 2006 U.S. Senate race in Minnesota, a test using MRO gives a $P$-value of 4.05% for the hypothesis that a full hand tally would find a different winner, less than half the value Stark [Ann. Appl. Statist. 2 (2008) 550--581] finds.

Journal ArticleDOI
TL;DR: In this article, the value of an immune response as a surrogate of protection in a randomized placebo-controlled trial with case-cohort sampling of immune responses and a time to event endpoint is evaluated.
Abstract: Assessing immune responses to study vaccines as surrogates of protection plays a central role in vaccine clinical trials. Motivated by three ongoing or pending HIV vaccine efficacy trials, we consider such surrogate endpoint assessment in a randomized placebo-controlled trial with case-cohort sampling of immune responses and a time to event endpoint. Based on the principal surrogate definition under the principal stratification framework proposed by Frangakis and Rubin [Biometrics 58 (2002) 21--29] and adapted by Gilbert and Hudgens (2006), we introduce estimands that measure the value of an immune response as a surrogate of protection in the context of the Cox proportional hazards model. The estimands are not identified because the immune response to vaccine is not measured in placebo recipients. We formulate the problem as a Cox model with missing covariates, and employ novel trial designs for predicting the missing immune responses and thereby identifying the estimands. The first design utilizes information from baseline predictors of the immune response, and bridges their relationship in the vaccine recipients to the placebo recipients. The second design provides a validation set for the unmeasured immune responses of uninfected placebo recipients by immunizing them with the study vaccine after trial closeout. A maximum estimated likelihood approach is proposed for estimation of the parameters. Simulated data examples are given to evaluate the proposed designs and study their properties.

Journal ArticleDOI
TL;DR: In this article, the authors review a number of open problems in CMB data analysis and present applications to observations from the WMAP mission, and present a solution to one of them.
Abstract: An enormous amount of observations on Cosmic Microwave Background radiation has been collected in the last decade, and much more data are expected in the near future from planned or operating satellite missions. These datasets are a goldmine of information for Cosmology and Theoretical Physics; their efficient exploitation posits several intriguing challenges from the statistical point of view. In this paper we review a number of open problems in CMB data analysis and we present applications to observations from the WMAP mission.

Journal ArticleDOI
TL;DR: In this paper, the authors consider cDNA microarray experiments when the cell populations have a factorial structure, and investigate the problem of their optimal designing under a baseline parametrization where the objects of interest differ from those under the more common orthogonal parameter.
Abstract: We consider cDNA microarray experiments when the cell populations have a factorial structure, and investigate the problem of their optimal designing under a baseline parametrization where the objects of interest differ from those under the more common orthogonal parametrization. First, analytical results are given for the $2\times 2$ factorial. Since practical applications often involve a more complex factorial structure, we next explore general factorials and obtain a collection of optimal designs in the saturated, that is, most economic, case. This, in turn, is seen to yield an approach for finding optimal or efficient designs in the practically more important nearly saturated cases. Thereafter, the findings are extended to the more intricate situation where the underlying model incorporates dye-coloring effects, and the role of dye-swapping is critically examined.

Posted Content
TL;DR: This paper proposes a hierarchical additive model, where the nonlinear correlations and the complex interaction effects are modeled using the multiple additive regression trees and the residual spatial association in the assault rates that cannot be explained in the model are smoothed using a conditional autoregressive (CAR) method.
Abstract: Previous studies have suggested a link between alcohol outlets and assaultive violence. In this paper, we explore the effects of alcohol availability on assault crimes at the census tract level over time. The statistical analysis is challenged by several features of the data: (1) the effects of possible covariates (for example, the alcohol outlet density of each census tract) on the assaultive crime rates may be complex; (2) the covariates may be highly correlated with each other; (3) there are a lot of missing inputs in the data; and (4) spatial correlations exist in the outcome assaultive crime rates. We propose a hierarchical additive model, where the nonlinear correlations and the complex interaction effects are modeled using the multiple additive regression trees (MART) and the spatial variances in the assaultive rates that cannot be explained by the specified covariates are smoothed trough the Conditional Autoregressive (CAR) model. We develop a two-stage algorithm that connect the non-parametric trees with CAR to look for important variables covariates associated with the assaultive crime rates, while taking account of the spatial correlations among adjacent census tracts. The proposed methods are applied to the Los Angeles assaultive data (1990-1999) and compared with traditional method.

Posted Content
TL;DR: In this paper, the influence of the posted speed limit on the severity of vehicle accidents is studied using Indiana accident data from 2004 and 2006 (the year after speed limits were raised on rural interstates and some multi-lane non-interstate routes).
Abstract: The influence of speed limits on roadway safety has been a subject of continuous debate in the State of Indiana and nationwide In Indiana, highway-related accidents result in about 900 fatalities and forty thousand injuries annually and place an incredible social and economic burden on the state Still, speed limits posted on highways and other roads are routinely exceeded as individual drivers try to balance safety, mobility (speed), and the risks and penalties associated with law enforcement efforts The speed-limit/safety issue has been a matter of considerable concern in Indiana since the state raised its speed limits on rural interstates and selected multilane highways on July 1, 2005 In this paper, the influence of the posted speed limit on the severity of vehicle accidents is studied using Indiana accident data from 2004 (the year before speed limits were raised) and 2006 (the year after speed limits were raised on rural interstates and some multi-lane non-interstate routes) Statistical models of the injury severity of different types of accidents on various roadway classes were estimated The results of the model estimations showed that, for the speed limit ranges currently used, speed limits did not have a statistically significant effect on the severity of accidents on interstate highways However, for some non-interstate highways, higher speed limits were found to be associated with higher accident severities - suggesting that future speed limit changes, on non-interstate highways in particular, need to be carefully assessed on a case-by-case basis

Journal ArticleDOI
TL;DR: The Poisson Dempster--Shafer model (DSM) is used to derive a posterior DSM for the ``Banff upper limits challenge'' three-Poisson model and it is argued that the reduced dependence on priors afforded by the Dem pster--shafer framework is both practically and theoretically desirable.
Abstract: We present a Dempster--Shafer (DS) approach to estimating limits from Poisson counting data with nuisance parameters. Dempster--Shafer is a statistical framework that generalizes Bayesian statistics. DS calculus augments traditional probability by allowing mass to be distributed over power sets of the event space. This eliminates the Bayesian dependence on prior distributions while allowing the incorporation of prior information when it is available. We use the Poisson Dempster--Shafer model (DSM) to derive a posterior DSM for the ``Banff upper limits challenge'' three-Poisson model. The results compare favorably with other approaches, demonstrating the utility of the approach. We argue that the reduced dependence on priors afforded by the Dempster--Shafer framework is both practically and theoretically desirable.

Posted Content
TL;DR: In this article, a two-state Markov switching count-data model is proposed as an alternative to zero-inflated models to account for the preponderance of zeros sometimes observed in transportation count data, such as the number of accidents occurring on a roadway segment over some period of time.
Abstract: In this study, a two-state Markov switching count-data model is proposed as an alternative to zero-inflated models to account for the preponderance of zeros sometimes observed in transportation count data, such as the number of accidents occurring on a roadway segment over some period of time. For this accident-frequency case, zero-inflated models assume the existence of two states: one of the states is a zero-accident count state, in which accident probabilities are so low that they cannot be statistically distinguished from zero, and the other state is a normal count state, in which counts can be non-negative integers that are generated by some counting process, for example, a Poisson or negative binomial. In contrast to zero-inflated models, Markov switching models allow specific roadway segments to switch between the two states over time. An important advantage of this Markov switching approach is that it allows for the direct statistical estimation of the specific roadway-segment state (i.e., zero or count state) whereas traditional zero-inflated models do not. To demonstrate the applicability of this approach, a two-state Markov switching negative binomial model (estimated with Bayesian inference) and standard zero-inflated negative binomial models are estimated using five-year accident frequencies on Indiana interstate highway segments. It is shown that the Markov switching model is a viable alternative and results in a superior statistical fit relative to the zero-inflated models.

Book ChapterDOI
TL;DR: It is demonstrated that machine scoring can facilitate the use of open-ended questions in large-scale testing programs by providing a fast, accurate, and economical way to grade responses.
Abstract: Assessment of learning in higher education is a critical concern to policy makers, educators, parents, and students. And, doing so appropriately is likely to require including constructed response tests in the assessment system. We examined whether scoring costs and other concerns with using open-end measures on a large scale (e.g., turnaround time and inter-reader consistency) could be addressed by machine grading the answers. Analyses with 1359 students from 14 colleges found that two human readers agreed highly with each other in the scores they assigned to the answers to three types of open-ended questions. These reader assigned scores also agreed highly with those assigned by a computer. The correlations of the machine-assigned scores with SAT scores, college grades, and other measures were comparable to the correlations of these variables with the hand-assigned scores. Machine scoring did not widen differences in mean scores between racial/ethnic or gender groups. Our findings demonstrated that machine scoring can facilitate the use of open-ended questions in large-scale testing programs by providing a fast, accurate, and economical way to grade responses.

Book ChapterDOI
TL;DR: In this paper, a stochastic dieren tial equation (SDE) was used to estimate the length and locations of a monk seal's foraging trips in a relatively shallow oshore submerged bank.
Abstract: Hawaiian monk seals (Monachus schauinslandi) are endemic to the Hawaiian Islands and are the most endangered species of marine mammal that lives entirely within the jurisdiction of the United States. The species numbers around 1300 and has been declining owing, among other things, to poor juve- nile survival which is evidently related to poor foraging success. Consequently, data have been collected recently on the foraging habitats, movements, and behaviors of monk seals throughout the Northwestern and main Hawaiian Is- lands. Our work here is directed to exploring a data set located in a relatively shallow oshore submerged bank (Penguin Bank) in our search of a model for a seal's journey. The work ends by tting a stochastic dieren tial equation (SDE) that mimics some aspects of the behavior of seals by working with location data collected for one seal. The SDE is found by developing a time varying potential function with two points of attraction. The times of location are irregularly spaced and not close together geographicaly, leading to some diculties of interpretation. Synthetic plots generated using the model are employed to assess its reasonableness spatially and temporally. One aspect is that the animal stays mainly southwest of Molokai. The work led to the estimation of the lengths and locations of the seal's foraging trips.