scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Applications in 2018"


Posted Content
TL;DR: In this article, a linear model for the observed associations approximately holds in a wide variety of settings when all the genetic variants satisfy the exclusion restriction assumption, or in genetic terms, when there is no pleiotropy.
Abstract: Mendelian randomization (MR) is a method of exploiting genetic variation to unbiasedly estimate a causal effect in presence of unmeasured confounding MR is being widely used in epidemiology and other related areas of population science In this paper, we study statistical inference in the increasingly popular two-sample summary-data MR design We show a linear model for the observed associations approximately holds in a wide variety of settings when all the genetic variants satisfy the exclusion restriction assumption, or in genetic terms, when there is no pleiotropy In this scenario, we derive a maximum profile likelihood estimator with provable consistency and asymptotic normality However, through analyzing real datasets, we find strong evidence of both systematic and idiosyncratic pleiotropy in MR, echoing the omnigenic model of complex traits that is recently proposed in genetics We model the systematic pleiotropy by a random effects model, where no genetic variant satisfies the exclusion restriction condition exactly In this case we propose a consistent and asymptotically normal estimator by adjusting the profile score We then tackle the idiosyncratic pleiotropy by robustifying the adjusted profile score We demonstrate the robustness and efficiency of the proposed methods using several simulated and real datasets

290 citations


Journal ArticleDOI
TL;DR: This paper explicates the various choices and assumptions made---often implicitly---to justify the use of prediction-based decisions and presents a notationally consistent catalogue of fairness definitions from the ML literature to offer a concise reference for thinking through the choices, assumptions, and fairness considerations of Prediction-based decision systems.
Abstract: A recent flurry of research activity has attempted to quantitatively define "fairness" for decisions based on statistical and machine learning (ML) predictions. The rapid growth of this new field has led to wildly inconsistent terminology and notation, presenting a serious challenge for cataloguing and comparing definitions. This paper attempts to bring much-needed order. First, we explicate the various choices and assumptions made---often implicitly---to justify the use of prediction-based decisions. Next, we show how such choices and assumptions can raise concerns about fairness and we present a notationally consistent catalogue of fairness definitions from the ML literature. In doing so, we offer a concise reference for thinking through the choices, assumptions, and fairness considerations of prediction-based decision systems.

146 citations


Journal ArticleDOI
TL;DR: The authors discuss the authors' conceptualization of replication, in particular the false dichotomy of direct versus conceptual replication intrinsic to it, and suggest a broader one that better generalizes to other domains of psychological research.
Abstract: We discuss the authors' conceptualization of replication, in particular the false dichotomy of direct versus conceptual replication intrinsic to it, and suggest a broader one that better generalizes to other domains of psychological research. We also discuss their approach to the evaluation of replication results and suggest moving beyond their dichotomous statistical paradigms and employing hierarchical / meta-analytic statistical models.

136 citations


Posted Content
TL;DR: This work model the relationship between the sequence of characters in a name, and race and ethnicity using Long Short Term Memory Networks, and applies this method to the campaign finance data to estimate the share of donations made by people of various racial groups.
Abstract: To answer questions about racial inequality, we often need a way to infer race and ethnicity from a name. Until now, a bulk of the focus has been on optimally exploiting the last names list provided by the Census Bureau. But there is more information in the first names, especially for African Americans. To estimate the relationship between full names and race, we exploit the Florida voter registration data and the Wikipedia data. In particular, we model the relationship between the sequence of characters in a name, and race and ethnicity using Long Short Term Memory Networks. Our out of sample (OOS) precision and recall for the full name model estimated on the Florida Voter Registration data is .83 and .84 respectively. This compares to OOS precision and recall of .79 and .81 for the last name only model. Commensurate numbers for Wikipedia data are .73 and .73 for the full name model and .66 and .67 for the last name model. To illustrate the use of this method, we apply our method to the campaign finance data to estimate the share of donations made by people of various racial groups.

120 citations


Posted Content
TL;DR: In this paper, a statistical model was proposed to predict the superconducting critical temperature based on the features extracted from the superconductor's chemical formula, and the model gave reasonable out-of-sample predictions:
Abstract: We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor's chemical formula. The statistical model gives reasonable out-of-sample predictions: $\pm 9.5$ K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model's predictive accuracy. It is crucial to note that our model does not predict whether a material is a superconductor or not, it only gives predictions for superconductors.

70 citations


Journal ArticleDOI
TL;DR: In this article, it is suggested that p values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p value.
Abstract: It is widely acknowledged that the biomedical literature suffer from a surfeit of false positive results Part of the reason for this is the persistence of the myth that observation of a p value less than 005 is sufficient justification to claim that you've made a discovery It is hopeless to expect users to change their reliance on p values unless they are offered an alternative way of judging the reliability of their conclusions If the alternative method is to have a chance of being adopted widely, it will have to be easy to understand and to calculate One such proposal is based on calculation of false positive risk It is suggested that p values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p value This number could be the minimum false positive risk (that calculated on the assumption of a prior probability of 05, the largest value that can be assumed in the absence of hard prior data) Alternatively one could specify the prior probability that it would be necessary to believe in order to achieve a false positive risk of, say, 005

65 citations


Journal ArticleDOI
TL;DR: In this article, a machine learning application was developed to identify opportunities in the real estate market in real time, i.e., houses that are listed with a price substantially below the market price.
Abstract: The real estate market is exposed to many fluctuations in prices because of existing correlations with many variables, some of which cannot be controlled or might even be unknown. Housing prices can increase rapidly (or in some cases, also drop very fast), yet the numerous listings available online where houses are sold or rented are not likely to be updated that often. In some cases, individuals interested in selling a house (or apartment) might include it in some online listing, and forget about updating the price. In other cases, some individuals might be interested in deliberately setting a price below the market price in order to sell the home faster, for various reasons. In this paper, we aim at developing a machine learning application that identifies opportunities in the real estate market in real time, i.e., houses that are listed with a price substantially below the market price. This program can be useful for investors interested in the housing market. We have focused in a use case considering real estate assets located in the Salamanca district in Madrid (Spain) and listed in the most relevant Spanish online site for home sales and rentals. The application is formally implemented as a regression problem that tries to estimate the market price of a house given features retrieved from public online listings. For building this application, we have performed a feature engineering stage in order to discover relevant features that allows for attaining a high predictive performance. Several machine learning algorithms have been tested, including regression trees, k-nearest neighbors, support vector machines and neural networks, identifying advantages and handicaps of each of them.

59 citations


Proceedings ArticleDOI
TL;DR: In this paper, the authors introduce a new language for describing individual player actions on the pitch and a framework for valuing any type of player action based on its impact on the game outcome while accounting for the context in which the action happened.
Abstract: Assessing the impact of the individual actions performed by soccer players during games is a crucial aspect of the player recruitment process. Unfortunately, most traditional metrics fall short in addressing this task as they either focus on rare actions like shots and goals alone or fail to account for the context in which the actions occurred. This paper introduces (1) a new language for describing individual player actions on the pitch and (2) a framework for valuing any type of player action based on its impact on the game outcome while accounting for the context in which the action happened. By aggregating soccer players' action values, their total offensive and defensive contributions to their team can be quantified. We show how our approach considers relevant contextual information that traditional player evaluation metrics ignore and present a number of use cases related to scouting and playing style characterization in the 2016/2017 and 2017/2018 seasons in Europe's top competitions.

58 citations


Journal ArticleDOI
TL;DR: The use of weakly informative priors (WIP) for the treatment effect parameter of a Bayesian meta-analysis model, which may also be seen as a form of penalization, is suggested and illustrated by a systematic review in immunosuppression of rare safety events following paediatric transplantation.
Abstract: Meta-analyses of clinical trials targeting rare events face particular challenges when the data lack adequate numbers of events for all treatment arms. Especially when the number of studies is low, standard meta-analysis methods can lead to serious distortions because of such data sparsity. To overcome this, we suggest the use of weakly informative priors (WIP) for the treatment effect parameter of a Bayesian meta-analysis model, which may also be seen as a form of penalization. As a data model, we use a binomial-normal hierarchical model (BNHM) which does not require continuity corrections in case of zero counts in one or both arms. We suggest a normal prior for the log odds ratio with mean 0 and standard deviation 2.82, which is motivated (1) as a symmetric prior centred around unity and constraining the odds ratio to within a range from 1/250 to 250 with 95 % probability, and (2) as consistent with empirically observed effect estimates from a set of $\mbox{$37\,773$}$ meta-analyses from the Cochrane Database of Systematic Reviews. In a simulation study with rare events and few studies, our BNHM with a WIP outperformed a Bayesian method without a WIP and a maximum likelihood estimator in terms of smaller bias and shorter interval estimates with similar coverage. Furthermore, the methods are illustrated by a systematic review in immunosuppression of rare safety events following paediatric transplantation. A publicly available $\textbf{R}$ package, $\texttt{MetaStan}$, is developed to automate the $\textbf{Stan}$ implementation of meta-analysis models using WIPs.

56 citations


Journal ArticleDOI
TL;DR: Different measures of volatility, representing newly available surrogate measures of safety, are explored by combining data from the Michigan Safety Pilot Deployment of connected vehicles with crash and inventory data at several intersections, showing that an increase in three measures of driving volatility are positively associated with higher intersection crash frequency.
Abstract: With the emergence of high-frequency connected and automated vehicle data, analysts have become able to extract useful information from them. To this end, the concept of "driving volatility" is defined and explored as deviation from the norm. Several measures of dispersion and variation can be computed in different ways using vehicles' instantaneous speed, acceleration, and jerk observed at intersections. This study explores different measures of volatility, representing newly available surrogate measures of safety, by combining data from the Michigan Safety Pilot Deployment of connected vehicles with crash and inventory data at several intersections. The intersection data was error-checked and verified for accuracy. Then, for each intersection, 37 different measures of volatility were calculated. These volatilities were then used to explain crash frequencies at intersection by estimating fixed and random parameter Poisson regression models. Results show that an increase in three measures of driving volatility are positively associated with higher intersection crash frequency, controlling for exposure variables and geometric features. More intersection crashes were associated with higher percentages of vehicle data points (speed & acceleration) lying beyond threshold-bands. These bands were created using mean plus two standard deviations. Furthermore, a higher magnitude of time-varying stochastic volatility of vehicle speeds when they pass through the intersection is associated with higher crash frequencies. These measures can be used to locate intersections with high driving volatilities, i.e., hot-spots where crashes are waiting to happen. Therefore, a deeper analysis of these intersections can be undertaken and proactive safety countermeasures considered at high volatility locations to enhance safety.

55 citations


Posted Content
TL;DR: A case study on short-term load forecasting for France, with emphasis on special days, such as public holidays, using nine years of half-hourly French load data and employing the rule-based triple seasonal adaptations of Holt-Winters-Taylor exponential smoothing and artificial neural networks.
Abstract: This paper presents a case study on short-term load forecasting for France, with emphasis on special days, such as public holidays. We investigate the generalisability to French data of a recently proposed approach, which generates forecasts for normal and special days in a coherent and unified framework, by incorporating subjective judgment in univariate statistical models using a rule-based methodology. The intraday, intraweek, and intrayear seasonality in load are accommodated using a rule-based triple seasonal adaptation of a seasonal autoregressive moving average (SARMA) model. We find that, for application to French load, the method requires an important adaption. We also adapt a recently proposed SARMA model that accommodates special day effects on an hourly basis using indicator variables. Using a rule formulated specifically for the French load, we compare the SARMA models with a range of different benchmark methods based on an evaluation of their point and density forecast accuracy. As sophisticated benchmarks, we employ the rule-based triple seasonal adaptations of Holt-Winters-Taylor (HWT) exponential smoothing and artificial neural networks (ANNs). We use nine years of half-hourly French load data, and consider lead times ranging from one half-hour up to a day ahead. The rule-based SARMA approach generated the most accurate forecasts.

Posted Content
TL;DR: In this article, the authors present a state of the art discussion of present efforts of developing particle filters for highly nonlinear geoscience state-estimation problems with an emphasis on atmospheric and oceanic applications, including many new ideas, derivations, and unifications.
Abstract: Particle filters contain the promise of fully nonlinear data assimilation. They have been applied in numerous science areas, but their application to the geosciences has been limited due to their inefficiency in high-dimensional systems in standard settings. However, huge progress has been made, and this limitation is disappearing fast due to recent developments in proposal densities, the use of ideas from (optimal) transportation, the use of localisation and intelligent adaptive resampling strategies. Furthermore, powerful hybrids between particle filters and ensemble Kalman filters and variational methods have been developed. We present a state of the art discussion of present efforts of developing particle filters for highly nonlinear geoscience state-estimation problems with an emphasis on atmospheric and oceanic applications, including many new ideas, derivations, and unifications, highlighting hidden connections, and generating a valuable tool and guide for the community. Initial experiments show that particle filters can be competitive with present-day methods for numerical weather prediction suggesting that they will become mainstream soon.

Posted Content
TL;DR: In this article, the authors partially reverse engineer the COMPAS algorithm and show that it does not seem to depend linearly on the defendant's age, despite statements to the contrary by the algorithm's creator.
Abstract: In our current society, secret algorithms make important decisions about individuals. There has been substantial discussion about whether these algorithms are unfair to groups of individuals. While noble, this pursuit is complex and ultimately stagnating because there is no clear definition of fairness and competing definitions are largely incompatible. We argue that the focus on the question of fairness is misplaced, as these algorithms fail to meet a more important and yet readily obtainable goal: transparency. As a result, creators of secret algorithms can provide incomplete or misleading descriptions about how their models work, and various other kinds of errors can easily go unnoticed. By partially reverse engineering the COMPAS algorithm -- a recidivism-risk scoring algorithm used throughout the criminal justice system -- we show that it does not seem to depend linearly on the defendant's age, despite statements to the contrary by the algorithm's creator. Furthermore, by subtracting from COMPAS its (hypothesized) nonlinear age component, we show that COMPAS does not necessarily depend on race, contradicting ProPublica's analysis, which assumed linearity in age. In other words, faulty assumptions about a proprietary algorithm lead to faulty conclusions that go unchecked without careful reverse engineering. Were the algorithm transparent in the first place, this would likely not have occurred. The most important result in this work is that we find that there are many defendants with low risk score but long criminal histories, suggesting that data inconsistencies occur frequently in criminal justice databases. We argue that transparency satisfies a different notion of procedural fairness by providing both the defendants and the public with the opportunity to scrutinize the methodology and calculations behind risk scores for recidivism.

Posted Content
TL;DR: In this paper, the authors developed and evaluated national empirical models for China incorporating land-use regression (LUR), satellite measurements, and universal kriging (UK), and compared the resulting models in several ways, including (1) comparing models developed using forward stepwise regression vs. partial least squares (PLS) regression, and (2) 10-fold cross-validation (CV), leave-one-province-out (LOPO) CV, and leave one city-out(LOCO)
Abstract: Outdoor air pollution is a major killer worldwide and the fourth largest contributor to the burden of disease in China. China is the most populous country in the world and also has the largest number of air pollution deaths per year, yet the spatial resolution of existing national air pollution estimates for China is generally relatively low. We address this knowledge gap by developing and evaluating national empirical models for China incorporating land-use regression (LUR), satellite measurements, and universal kriging (UK). We test the resulting models in several ways, including (1) comparing models developed using forward stepwise regression vs. partial least squares (PLS) regression, (2) comparing models developed with and without satellite measurements, and with and without UK, and (3) 10-fold cross-validation (CV), leave-one-province-out(LOPO) CV, and leave-one-city-out(LOCO) CV. Satellite data and kriging are complementary in making predictions more accurate: kriging improved the models in well-sampled areas; satellite data substantially improved performance at locations far away from monitors. Stepwise forward selection performs similarly to PLS in 10-fold CV, but better than PLS in LOPO-CV. Our best models employ forward selection and UK, with 10-fold CV R2 of 0.89 (for both 2014 and 2015) for PM2.5 and of 0.73 (year-2014) and 0.78 (year-2015) for NO2. Population-weighted concentrations during 2014-2015 decreased for PM2.5 (58.7 {\mu}g/m3 to 52.3 {\mu}g/m3) and NO2 (29.6 {\mu}g/m3 to 26.8 {\mu}g/m3). We produced the first high resolution national LUR models for annual-average concentrations in China. Models were applied on 1 km grid to support future research. In 2015, more than 80% of the Chinese population lived in areas that exceed the Chinese national PM2.5 standard, 35 {\mu}g/m3. Results here will be publicly available and may be useful for environmental health research.

Posted Content
TL;DR: It is concluded that, while not optimal, the proposed algorithm offers additional practical advantages such as faster computation times and increased robustness to non-stationarities in building dynamics.
Abstract: Classical methods to control heating systems are often marred by suboptimal performance, inability to adapt to dynamic conditions and unreasonable assumptions e.g. existence of building models. This paper presents a novel deep reinforcement learning algorithm which can control space heating in buildings in a computationally efficient manner, and benchmarks it against other known techniques. The proposed algorithm outperforms rule based control by between 5-10% in a simulation environment for a number of price signals. We conclude that, while not optimal, the proposed algorithm offers additional practical advantages such as faster computation times and increased robustness to non-stationarities in building dynamics.

Journal ArticleDOI
TL;DR: This paper designs and implements PlayeRank, a data-driven framework that offers a principled multi-dimensional and role-aware evaluation of the performance of soccer players and shows its flexibility and efficiency, which makes it worth to be used in the design of a scalable platform for soccer analytics.
Abstract: The problem of evaluating the performance of soccer players is attracting the interest of many companies and the scientific community, thanks to the availability of massive data capturing all the events generated during a match (e.g., tackles, passes, shots, etc.). Unfortunately, there is no consolidated and widely accepted metric for measuring performance quality in all of its facets. In this paper, we design and implement PlayeRank, a data-driven framework that offers a principled multi-dimensional and role-aware evaluation of the performance of soccer players. We build our framework by deploying a massive dataset of soccer-logs and consisting of millions of match events pertaining to four seasons of 18 prominent soccer competitions. By comparing PlayeRank to known algorithms for performance evaluation in soccer, and by exploiting a dataset of players' evaluations made by professional soccer scouts, we show that PlayeRank significantly outperforms the competitors. We also explore the ratings produced by {\sf PlayeRank} and discover interesting patterns about the nature of excellent performances and what distinguishes the top players from the others. At the end, we explore some applications of PlayeRank -- i.e. searching players and player versatility --- showing its flexibility and efficiency, which makes it worth to be used in the design of a scalable platform for soccer analytics.

Posted Content
TL;DR: This work studies inference problems during the emerging phase of an outbreak, and point out potential sources of bias, with emphasis on: contact tracing backwards in time, replacing generation times by serial intervals, multiple potential infectors and censoring effects amplified by exponential growth.
Abstract: When analysing new emerging infectious disease outbreaks one typically has observational data over a limited period of time and several parameters to estimate, such as growth rate, R0, serial or generation interval distribution, latent and incubation times or case fatality rates. Also parameters describing the temporal relations between appearance of symptoms, notification, death and recovery/discharge will be of interest. These parameters form the basis for predicting the future outbreak, planning preventive measures and monitoring the progress of the disease. We study the problem of making inference during the emerging phase of an outbreak and point out potential sources of bias related to contact tracing, replacing generation times by serial intervals, multiple potential infectors or truncation effects amplified by exponential growth. These biases directly affect the estimation of e.g. the generation time distribution and the case fatality rate, but can then propagate to other estimates, e.g. of R0 and growth rate. Many of the traditionally used estimation methods in disease epidemiology may suffer from these biases when applied to the emerging disease outbreak situation. We show how to avoid these biases based on proper statistical modelling. We illustrate the theory by numerical examples and simulations based on the recent 2014-15 Ebola outbreak to quantify possible estimation biases, which may be up to 20% underestimation of R0, if the epidemic growth rate is fitted to observed data or, conversely, up to 62% overestimation of the growth rate if the correct R0 is used in conjunction with the Euler-Lotka equation.

Journal ArticleDOI
Abstract: The ability to see around corners, i.e., recover details of a hidden scene from its reflections in the surrounding environment, is of considerable interest in a wide range of applications. However, the diffuse nature of light reflected from typical surfaces leads to mixing of spatial information in the collected light, precluding useful scene reconstruction. Here, we employ a computational imaging technique that opportunistically exploits the presence of occluding objects, which obstruct probe-light propagation in the hidden scene, to undo the mixing and greatly improve scene recovery. Importantly, our technique obviates the need for the ultrafast time-of-flight measurements employed by most previous approaches to hidden-scene imaging. Moreover, it does so in a photon-efficient manner based on an accurate forward model and a computational algorithm that, together, respect the physics of three-bounce light propagation and single-photon detection. Using our methodology, we demonstrate reconstruction of hidden-surface reflectivity patterns in a meter-scale environment from non-time-resolved measurements. Ultimately, our technique represents an instance of a rich and promising new imaging modality with important potential implications for imaging science.

Journal ArticleDOI
TL;DR: In this paper, robust, Recurrent Neural Network (RNN) based, multi-step ahead forecasting models are developed for time-series in which simple RNN, the Gated Recurrent Unit (GRU) and the Long Short-Term Memory (LSTM) units are used to develop the model and evaluate its performance.
Abstract: The prediction of high-resolution hourly traffic volumes of a given roadway is essential for transportation planning Traditionally, Automatic Traffic Recorders (ATR) are used to collect this hourly volume data These large datasets are time series data characterized by long-term temporal dependencies and missing values Regarding the temporal dependencies, all roadways are characterized by seasonal variations that can be weekly, monthly or yearly, depending on the cause of the variation Regarding the missing data in a time-series sequence, traditional time series forecasting models perform poorly under the influence of seasonal variations To address this limitation, robust, Recurrent Neural Network (RNN) based, multi-step ahead forecasting models are developed for time-series in this study The simple RNN, the Gated Recurrent Unit (GRU) and the Long Short-Term Memory (LSTM) units are used to develop the model and evaluate its performance Two approaches are used to address the missing value issue: masking and imputation, in conjunction with the RNN models Six different imputation algorithms are then used to identify the best model The analysis indicates that the LSTM model performs better than simple RNN and GRU models, and imputation performs better than masking to predict future traffic volume Based on analysis using 92 ATRs, the LSTM-Median model is deemed the best model in all scenarios for hourly traffic volume and AADT prediction, with an average RMSE of 274 and MAPE of 1891% for hourly traffic volume prediction and average RMSE of 824 and MAPE of 210% for AADT prediction

Journal Article
TL;DR: Using a simulation built on top of data from Airbnb, the use of methods from the network interference literature for online marketplace experimentation are considered and suggest that experiment design and analysis techniques are promising tools for reducing bias due to test-control interference in marketplace experiments.
Abstract: In an A/B test, the typical objective is to measure the total average treatment effect (TATE), which measures the difference between the average outcome if all users were treated and the average outcome if all users were untreated. However, a simple difference-in-means estimator will give a biased estimate of the TATE when outcomes of control units depend on the outcomes of treatment units, an issue we refer to as test-control interference. Using a simulation built on top of data from Airbnb, this paper considers the use of methods from the network interference literature for online marketplace experimentation. We model the marketplace as a network in which an edge exists between two sellers if their goods substitute for one another. We then simulate seller outcomes, specifically considering a "status quo" context and "treatment" context that forces all sellers to lower their prices. We use the same simulation framework to approximate TATE distributions produced by using blocked graph cluster randomization, exposure modeling, and the Hajek estimator for the difference in means. We find that while blocked graph cluster randomization reduces the bias of the naive difference-in-means estimator by as much as 62%, it also significantly increases the variance of the estimator. On the other hand, the use of more sophisticated estimators produces mixed results. While some provide (small) additional reductions in bias and small reductions in variance, others lead to increased bias and variance. Overall, our results suggest that experiment design and analysis techniques from the network experimentation literature are promising tools for reducing bias due to test-control interference in marketplace experiments.

Posted Content
TL;DR: In this article, a semisupervised nonparametric clustering approach is proposed to identify anomalous particles in a model-independent search in particle physics, where information available on the background can be incorporated in the search, in order to identify potential anomalies.
Abstract: Model-independent searches in particle physics aim at completing our knowledge of the universe by looking for new possible particles not predicted by the current theories. Such particles, referred to as signal, are expected to behave as a deviation from the background, representing the known physics. Information available on the background can be incorporated in the search, in order to identify potential anomalies. From a statistical perspective, the problem is recasted to a peculiar classification one where only partial information is accessible. Therefore a semisupervised approach shall be adopted, either by strengthening or by relaxing assumptions underlying clustering or classification methods respectively. In this work, following the first route, we semisupervise nonparametric clustering in order to identify a possible signal. The main contribution consists in tuning a nonparametric estimate of the density underlying the experimental data with the aid of the available information on the physical theory. As a side contribution, a variable selection procedure is presented. The whole procedure is tested on a dataset mimicking proton-proton collisions performed within a particle accelerator. While finding motivation in the field of particle physics, the approach is applicable to various science domains, where similar problems of anomaly detection arise.

Posted Content
TL;DR: A support vector machine-based pattern recognition system that models patterns in the cries of known asphyxiating infants (and normal infants) and then uses the developed model for classification of `new' infants as having asphyxia or not is designed.
Abstract: Perinatal Asphyxia is one of the top three causes of infant mortality in developing countries, resulting to the death of about 1.2 million newborns every year. At its early stages, the presence of asphyxia cannot be conclusively determined visually or via physical examination, but by medical diagnosis. In resource-poor settings, where skilled attendance at birth is a luxury, most cases only get detected when the damaging consequences begin to manifest or worse still, after death of the affected infant. In this project, we explored the approach of machine learning in developing a low-cost diagnostic solution. We designed a support vector machine-based pattern recognition system that models patterns in the cries of known asphyxiating infants (and normal infants) and then uses the developed model for classification of `new' infants as having asphyxia or not. Our prototype has been tested in a laboratory setting to give prediction accuracy of up to 88.85%. If higher accuracies can be obtained, this research may be a key contributor to the 4th Millennium Development Goal (MDG) of reducing mortality in under-five children.

Posted Content
TL;DR: In this article, the authors examined the relationship between growth and locations of craft breweries and the incidence of neighborhood change across the United States and found that the strongest predictor of whether a craft brewery opened in 2013 or later in a neighborhood was the presence of a prior brewery.
Abstract: Cities have recognized the local impact of small craft breweries, in many ways altering municipal codes to make it easier to establish breweries and making them the anchor points of economic development and revitalization. Nevertheless, we do not know the extent to which these strategies impacted changes at the neighborhood level across the nation. In this chapter, we examine the relationship between growth and locations of craft breweries and the incidence of neighborhood change across the United States. In the first part of the chapter, we rely on a unique dataset of geocoded brewery locations that tracks openings and closings from 2004 to the present. Using measures of neighborhood change often found in literature on gentrification-related topics, we develop statistical models relying on census tract demographic and employment data to determine the extent to which brewery locations are associated with social and demographic shifts since 2000. The strongest predictor of whether a craft brewery opened in 2013 or later in a neighborhood was the presence of a prior brewery. We do not find evidence entirely consistent with the common narrative of a link between gentrification and craft brewing, but we see a link between an influx of lower-to-middle income urban creatives and the introduction of a craft breweries. We advocate for urban planners to recognize the importance of craft breweries in neighborhood revitalization while also protecting residents from potential displacement.

Posted Content
TL;DR: This work clusters series within each type of frequency with respect to the existence of trend and seasonality, and utilizes several commonly used statistical models, which are weighted according to their performance on historical data.
Abstract: We present a detailed description of our submission for the M4 forecasting competition, in which it ranked 3rd overall. Our solution utilizes several commonly used statistical models, which are weighted according to their performance on historical data. We cluster series within each type of frequency with respect to the existence of trend and seasonality. Every class of series is assigned a different set of models to combine. Combination weights are chosen separately for each series. We conduct experiments with a holdout set to manually pick pools of models that perform best for a given series type, as well as to choose the combination approaches.

Book ChapterDOI
TL;DR: This chapter will focus on the challenges and the open problems and will not weigh in on the dilemma of researchers hoping to learn about contagion from observational social network data, except to mention here that the most responsible way to use any statistical method is with a healthy dose of skepticism.
Abstract: A growing body of literature attempts to learn about contagion using observational (i.e., non-experimental) data collected from a single social network. While the conclusions of these studies may be correct, the methods rely on assumptions that are likely—and sometimes guaranteed to be—false, and therefore the evidence for the conclusions is often weaker than it seems. Developing methods that do not need to rely on implausible assumptions is an incredibly challenging and important open problem in statistics. Appropriate methods don’t (yet!) exist, so researchers hoping to learn about contagion from observational social network data are sometimes faced with a dilemma: they can abandon their research program, or they can use inappropriate methods. This chapter will focus on the challenges and the open problems and will not weigh in on that dilemma, except to mention here that the most responsible way to use any statistical method, especially when it is well-known that the assumptions on which it rests do not hold, is with a healthy dose of skepticism, with honest acknowledgment and deep understanding of the limitations, and with copious caveats about how to interpret the results.

Journal ArticleDOI
TL;DR: The InSPiRe project as discussed by the authors developed decision-making methods for small population clinical trials using a Bayesian decision-theoretic framework to compare costs with potential benefits, developed approaches for targeted treatment trials, enabling simultaneous identification of subgroups and confirmation of treatment effect for these patients, worked on early phase clinical trial design and on extrapolation from adult to pediatric studies, developing methods to enable use of pharmacokinetics and pharmacodynamics data, and also developed improved robust meta-analysis methods for a small number of trials to support the planning, analysis and interpretation of a trial as
Abstract: Where there are a limited number of patients, such as in a rare disease, clinical trials in these small populations present several challenges, including statistical issues. This led to an EU FP7 call for proposals in 2013. One of the three projects funded was the Innovative Methodology for Small Populations Research (InSPiRe) project. This paper summarizes the main results of the project, which was completed in 2017. The InSPiRe project has led to development of novel statistical methodology for clinical trials in small populations in four areas. We have explored new decision-making methods for small population clinical trials using a Bayesian decision-theoretic framework to compare costs with potential benefits, developed approaches for targeted treatment trials, enabling simultaneous identification of subgroups and confirmation of treatment effect for these patients, worked on early phase clinical trial design and on extrapolation from adult to pediatric studies, developing methods to enable use of pharmacokinetics and pharmacodynamics data, and also developed improved robust meta-analysis methods for a small number of trials to support the planning, analysis and interpretation of a trial as well as enabling extrapolation between patient groups. In addition to scientific publications, we have contributed to regulatory guidance and produced free software in order to facilitate implementation of the novel methods.

Journal ArticleDOI
TL;DR: In this article, the relative root mean squared errors (RMSE) of nonparametric methods for spectral estimation is compared for microwave scattering data of plasma fluctuations, and two new adaptive multi-taper weightings are presented.
Abstract: The relative root mean squared errors (RMSE) of nonparametric methods for spectral estimation is compared for microwave scattering data of plasma fluctuations. These methods reduce the variance of the periodogram estimate by averaging the spectrum over a frequency bandwidth. As the bandwidth increases, the variance decreases, but the bias error increases. The plasma spectra vary by over four orders of magnitude, and therefore, using a spectral window is necessary. We compare the smoothed tapered periodogram with the adaptive multiple taper methods and hybrid methods. We find that a hybrid method, which uses four orthogonal tapers and then applies a kernel smoother, performs best. For 300 point data segments, even an optimized smoothed tapered periodogram has a 24 \% larger relative RMSE than the hybrid method. We present two new adaptive multi-taper weightings which outperform Thomson's original adaptive weighting.

Posted Content
TL;DR: The proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications.
Abstract: This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub.

Posted Content
TL;DR: This work presents and evaluates the machine learning approach to the Rodeo and releases the SubseasonalRodeo dataset, collected to train and evaluate the system, an ensemble of two nonlinear regression models.
Abstract: Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance. Here we present and evaluate our machine learning approach to the Rodeo and release our SubseasonalRodeo dataset, collected to train and evaluate our forecasting system. Our system is an ensemble of two regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2), and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon. Moreover, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40-50% for temperature and 129-169% for precipitation. We hope that both our dataset and our methods will help to advance the state of the art in subseasonal forecasting.

Posted Content
TL;DR: A set of toolboxes, including an advanced structural connectome extraction pipeline and a novel tensor network principal components analysis (TN-PCA) method, are developed and integrated to study relationships between structural connectomes and various human traits such as alcohol and drug use, cognition and motion abilities.
Abstract: Advanced brain imaging techniques make it possible to measure individuals' structural connectomes in large cohort studies non-invasively. The structural connectome is initially shaped by genetics and subsequently refined by the environment. It is extremely interesting to study relationships between structural connectomes and environment factors or human traits, such as substance use and cognition. Due to limitations in structural connectome recovery, previous studies largely focus on functional connectomes. Questions remain about how well structural connectomes can explain variance in different human traits. Using a state-of-the-art structural connectome processing pipeline and a novel dimensionality reduction technique applied to data from the Human Connectome Project (HCP), we show strong relationships between structural connectomes and various human traits. Our dimensionality reduction approach uses a tensor characterization of the connectome and relies on a generalization of principal components analysis. We analyze over 1100 scans for 1076 subjects from the HCP and the Sherbrooke test-retest data set, as well as $175$ human traits that measure domains including cognition, substance use, motor, sensory and emotion. We find that structural connectomes are associated with many traits. Specifically, fluid intelligence, language comprehension, and motor skills are associated with increased cortical-cortical brain structural connectivity, while the use of alcohol, tobacco, and marijuana are associated with decreased cortical-cortical connectivity.