scispace - formally typeset
Search or ask a question

Showing papers by "Helsinki Institute for Information Technology published in 2018"


Proceedings Article
01 Jan 2018
TL;DR: In this article, the authors apply basic statistical reasoning to signal reconstruction by machine learning, learning to map corrupted observations to clean signals without explicit image priors or likelihood models of the corruption, and show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans.
Abstract: We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption. In practice, we show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans -- all corrupted by different processes -- based on noisy data only.

610 citations


Posted Content
TL;DR: It is shown that under certain common circumstances, it is possible to learn to restore signals without ever observing clean ones, at performance close or equal to training using clean exemplars.
Abstract: We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption. In practice, we show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans -- all corrupted by different processes -- based on noisy data only.

399 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a Bayesian analysis of Bayesian networks with a focus on the first-order dynamics of the Bayesian network, and include invited and contributed discussions.
Abstract: Main article included invited and contributed discussions: https://dx.doi.org/10.1214/17-BA1091 (Bayesian Analysis 13:3 (2018), pages 917-1007).

264 citations


Journal ArticleDOI
TL;DR: Already available approaches to construct and use pan-genomes are examined, the potential benefits of future technologies and methodologies are discussed, and open challenges from the vantage point of the above-mentioned biological disciplines are reviewed.
Abstract: Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

220 citations


Journal ArticleDOI
TL;DR: A method within the Rosetta macromolecular modeling suite (flex ddG) that samples conformational diversity using "backrub" to generate an ensemble of models and then applies torsion minimization, side chain repacking, and averaging across this ensemble to estimate interface ΔΔ G values is developed.
Abstract: Computationally modeling changes in binding free energies upon mutation (interface ΔΔ G) allows large-scale prediction and perturbation of protein-protein interactions. Additionally, methods that consider and sample relevant conformational plasticity should be able to achieve higher prediction accuracy over methods that do not. To test this hypothesis, we developed a method within the Rosetta macromolecular modeling suite (flex ddG) that samples conformational diversity using "backrub" to generate an ensemble of models and then applies torsion minimization, side chain repacking, and averaging across this ensemble to estimate interface ΔΔ G values. We tested our method on a curated benchmark set of 1240 mutants, and found the method outperformed existing methods that sampled conformational space to a lesser degree. We observed considerable improvements with flex ddG over existing methods on the subset of small side chain to large side chain mutations, as well as for multiple simultaneous non-alanine mutations, stabilizing mutations, and mutations in antibody-antigen interfaces. Finally, we applied a generalized additive model (GAM) approach to the Rosetta energy function; the resulting nonlinear reweighting model improved the agreement with experimentally determined interface ΔΔ G values but also highlighted the necessity of future energy function improvements.

165 citations


Journal ArticleDOI
TL;DR: A population-genomic analysis of more than 800 isolates of Staphylococcus aureus reveals details of the pathogen’s evolutionary trajectory, including how this has been influenced by animal domestication and antibiotic use.
Abstract: The capacity for some pathogens to jump into different host-species populations is a major threat to public health and food security Staphylococcus aureus is a multi-host bacterial pathogen responsible for important human and livestock diseases Here, using a population-genomic approach, we identify humans as a major hub for ancient and recent S aureus host-switching events linked to the emergence of endemic livestock strains, and cows as the main animal reservoir for the emergence of human epidemic clones Such host-species transitions are associated with horizontal acquisition of genetic elements from host-specific gene pools conferring traits required for survival in the new host-niche Importantly, genes associated with antimicrobial resistance are unevenly distributed among human and animal hosts, reflecting distinct antibiotic usage practices in medicine and agriculture In addition to gene acquisition, genetic diversification has occurred in pathways associated with nutrient acquisition, implying metabolic remodelling after a host switch in response to distinct nutrient availability For example, S aureus from dairy cattle exhibit enhanced utilization of lactose-a major source of carbohydrate in bovine milk Overall, our findings highlight the influence of human activities on the multi-host ecology of a major bacterial pathogen, underpinned by horizontal gene transfer and core genome diversification

129 citations


Journal ArticleDOI
TL;DR: This work finds that classification accuracy can be used to assess the discrepancy between simulated and observed data and the complete arsenal of classification methods becomes thereby available for inference of intractable generative models.
Abstract: Increasingly complex generative models are being used across disciplines as they allow for realistic characterization of data, but a common difficulty with them is the prohibitively large computational cost to evaluate the likelihood function and thus to perform likelihood-based statistical inference. A likelihood-free inference framework has emerged where the parameters are identified by finding values that yield simulated data resembling the observed data. While widely applicable, a major difficulty in this framework is how to measure the discrepancy between the simulated and observed data. Transforming the original problem into a problem of classifying the data into simulated versus observed, we find that classification accuracy can be used to assess the discrepancy. The complete arsenal of classification methods becomes thereby available for inference of intractable generative models. We validate our approach using theory and simulations for both point estimation and Bayesian inference, and demonstrate its use on real data by inferring an individual-based epidemiological model for bacterial infections in child care centers.

118 citations


Journal ArticleDOI
TL;DR: It is demonstrated that antibiotic resistance in E. coli can be accurately predicted from whole genome sequences without a priori knowledge of mechanisms, and that both genomic and epidemiological data can be informative.
Abstract: The emergence of microbial antibiotic resistance is a global health threat. In clinical settings, the key to controlling spread of resistant strains is accurate and rapid detection. As traditional culture-based methods are time consuming, genetic approaches have recently been developed for this task. The detection of antibiotic resistance is typically made by measuring a few known determinants previously identified from genome sequencing, and thus requires the prior knowledge of its biological mechanisms. To overcome this limitation, we employed machine learning models to predict resistance to 11 compounds across four classes of antibiotics from existing and novel whole genome sequences of 1936 E. coli strains. We considered a range of methods, and examined population structure, isolation year, gene content, and polymorphism information as predictors. Gradient boosted decision trees consistently outperformed alternative models with an average accuracy of 0.91 on held-out data (range 0.81-0.97). While the best models most frequently employed gene content, an average accuracy score of 0.79 could be obtained using population structure information alone. Single nucleotide variation data were less useful, and significantly improved prediction only for two antibiotics, including ciprofloxacin. These results demonstrate that antibiotic resistance in E. coli can be accurately predicted from whole genome sequences without a priori knowledge of mechanisms, and that both genomic and epidemiological data can be informative. This paves way to integrating machine learning approaches into diagnostic tools in the clinic.

113 citations


Journal ArticleDOI
TL;DR: It is concluded that adenomas evolve across an undulating fitness landscape, whereas carcinomas occupy a sharper fitness peak, probably owing to stabilizing selection.
Abstract: The evolutionary events that cause colorectal adenomas (benign) to progress to carcinomas (malignant) remain largely undetermined Using multi-region genome and exome sequencing of 24 benign and malignant colorectal tumours, we investigate the evolutionary fitness landscape occupied by these neoplasms Unlike carcinomas, advanced adenomas frequently harbour sub-clonal driver mutations—considered to be functionally important in the carcinogenic process—that have not swept to fixation, and have relatively high genetic heterogeneity Carcinomas are distinguished from adenomas by widespread aneusomies that are usually clonal and often accrue in a ‘punctuated’ fashion We conclude that adenomas evolve across an undulating fitness landscape, whereas carcinomas occupy a sharper fitness peak, probably owing to stabilizing selection

99 citations


Journal ArticleDOI
TL;DR: In this article, a comprehensive view of recent population history (≤100 generations), the timespan during which most rare-disease-causing alleles arose, was assembled by comparing pairwise haplotype sharing from 43,254 Finns to that of 16,060 Swedes, Estonians, Russians, and Hungarians from geographically and linguistically adjacent countries with different population histories.
Abstract: Finland provides unique opportunities to investigate population and medical genomics because of its adoption of unified national electronic health records, detailed historical and birth records, and serial population bottlenecks. We assembled a comprehensive view of recent population history (≤100 generations), the timespan during which most rare-disease-causing alleles arose, by comparing pairwise haplotype sharing from 43,254 Finns to that of 16,060 Swedes, Estonians, Russians, and Hungarians from geographically and linguistically adjacent countries with different population histories. We find much more extensive sharing in Finns, with at least one ≥ 5 cM tract on average between pairs of unrelated individuals. By coupling haplotype sharing with fine-scale birth records from more than 25,000 individuals, we find that although haplotype sharing broadly decays with geographical distance, there are pockets of excess haplotype sharing; individuals from northeast Finland typically share several-fold more of their genome in identity-by-descent segments than individuals from southwest regions. We estimate recent effective population-size changes through time across regions of Finland, and we find that there was more continuous gene flow as Finns migrated from southwest to northeast between the early- and late-settlement regions than was dichotomously described previously. Lastly, we show that haplotype sharing is locally enriched by an order of magnitude among pairs of individuals sharing rare alleles and especially among pairs sharing rare disease-causing variants. Our work provides a general framework for using haplotype sharing to reconstruct an integrative view of recent population history and gain insight into the evolutionary origins of rare variants contributing to disease.

65 citations


Journal ArticleDOI
01 Mar 2018-Genetics
TL;DR: Epistasis may play an important role in both the short- and long-term adaptive evolution of bacteria, and, unlike in eukaryotes, is not limited to strong effect sizes, closely linked loci, or other conditions that limit the impact of recombination.
Abstract: The impact of epistasis on the evolution of multi-locus traits depends on recombination. While sexually reproducing eukaryotes recombine so frequently that epistasis between polymorphisms is not considered to play a large role in short-term adaptation, many bacteria also recombine, some to the degree that their populations are described as "panmictic" or "freely recombining." However, whether this recombination is sufficient to limit the ability of selection to act on epistatic contributions to fitness is unknown. We quantify homologous recombination in five bacterial pathogens and use these parameter estimates in a multilocus model of bacterial evolution with additive and epistatic effects. We find that even for highly recombining species (e.g., Streptococcus pneumoniae or Helicobacter pylori), selection on weak interactions between distant mutations is nearly as efficient as for an asexual species, likely because homologous recombination typically transfers only short segments. However, for strong epistasis, bacterial recombination accelerates selection, with the dynamics dependent on the amount of recombination and the number of loci. Epistasis may thus play an important role in both the short- and long-term adaptive evolution of bacteria, and, unlike in eukaryotes, is not limited to strong effect sizes, closely linked loci, or other conditions that limit the impact of recombination.

Journal ArticleDOI
01 Jul 2018
TL;DR: P pairwiseMKL is introduced, the first method for time‐ and memory‐efficient learning with multiple pairwise kernels that provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem.
Abstract: Motivation Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs. Results We introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem. Availability and implementation Code is available at https://github.com/aalto-ics-kepaco. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The first pan‐cancer, multi‐omics comparative analysis of the relative performance of two proteomic technologies, targeted reverse phase protein array (RPPA) and global mass spectrometry (MS), in terms of their accuracy for predicting the sensitivity of cancer cells to both cytotoxic chemotherapeutics and molecularly targeted anticancer compounds is carried out.
Abstract: Motivation Proteomics profiling is increasingly being used for molecular stratification of cancer patients and cell-line panels. However, systematic assessment of the predictive power of large-scale proteomic technologies across various drug classes and cancer types is currently lacking. To that end, we carried out the first pan-cancer, multi-omics comparative analysis of the relative performance of two proteomic technologies, targeted reverse phase protein array (RPPA) and global mass spectrometry (MS), in terms of their accuracy for predicting the sensitivity of cancer cells to both cytotoxic chemotherapeutics and molecularly targeted anticancer compounds. Results Our results in two cell-line panels demonstrate how MS profiling improves drug response predictions beyond that of the RPPA or the other omics profiles when used alone. However, frequent missing MS data values complicate its use in predictive modeling and required additional filtering, such as focusing on completely measured or known oncoproteins, to obtain maximal predictive performance. Rather strikingly, the two proteomics profiles provided complementary predictive signal both for the cytotoxic and targeted compounds. Further, information about the cellular-abundance of primary target proteins was found critical for predicting the response of targeted compounds, although the non-target features also contributed significantly to the predictive power. The clinical relevance of the selected protein markers was confirmed in cancer patient data. These results provide novel insights into the relative performance and optimal use of the widely applied proteomic technologies, MS and RPPA, which should prove useful in translational applications, such as defining the best combination of omics technologies and marker panels for understanding and predicting drug sensitivities in cancer patients. Availability and implementation Processed datasets, R as well as Matlab implementations of the methods are available at https://github.com/mehr-een/bemkl-rbps. Contact mehreen.ali@helsinki.fi or tero.aittokallio@fimm.fi. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this paper, a two-stage approach is proposed to construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions.
Abstract: This paper discusses predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We argue that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the \emph{reference model} and the operation during the latter step as predictive \emph{projection}. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The benefits are illustrated via several simulated and real world examples.

Journal ArticleDOI
01 Sep 2018
TL;DR: This work presents a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column, and shows that retention order is much better conserved between instruments than retention time.
Abstract: Motivation Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. Results We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run. Availability and implementation Implementation of the method is available at https://version.aalto.fi/gitlab/bache1/retention_order_prediction.git.

Journal ArticleDOI
09 May 2018
TL;DR: This work proposes a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference and provides a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows.
Abstract: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC . Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.

Journal ArticleDOI
TL;DR: In this article, the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by approximate Bayesian computation.
Abstract: Approximate Bayesian computation (ABC) can be used for model fitting when the likelihood function is intractable but simulating from the model is feasible. However, even a single evaluation of a complex model may take several hours, limiting the number of model evaluations available. Modelling the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by ABC, but the sensitivity of this approach to a specific GP formulation has not yet been thoroughly investigated. We begin with a comprehensive empirical evaluation of using GPs in ABC, including various transformations of the discrepancies and two novel GP formulations. Our results indicate the choice of GP may significantly affect the accuracy of the estimated posterior distribution. Selection of an appropriate GP model is thus important. We formulate expected utility to measure the accuracy of classifying discrepancies below or above the ABC threshold, and show that it can be used to automate the GP model selection step. Finally, based on the understanding gained with toy examples, we fit a population genetic model for bacteria, providing insight into horizontal gene transfer events within the population and from external origins.

Journal ArticleDOI
TL;DR: This work demonstrates an integrated use of the rich bioactivity data from DTC and related drug databases using Drug Target Profiler (DTP), an open-source software and web tool for interactive exploration of drug-target interaction networks.
Abstract: Knowledge of the full target space of drugs (or drug-like compounds) provides important insights into the potential therapeutic use of the agents to modulate or avoid their various on- and off-targets in drug discovery and precision medicine. However, there is a lack of consolidated databases and associated data exploration tools that allow for systematic profiling of drug target-binding potencies of both approved and investigational agents using a network-centric approach. We recently initiated a community-driven platform, Drug Target Commons (DTC), which is an open-data crowdsourcing platform designed to improve the management, reproducibility and extended use of compound-target bioactivity data for drug discovery and repurposing, as well as target identification applications. In this work, we demonstrate an integrated use of the rich bioactivity data from DTC and related drug databases using Drug Target Profiler (DTP), an open-source software and web tool for interactive exploration of drug-target interaction networks. DTP was designed for network-centric modeling of mode-of-action of multi-targeting anticancer compounds, especially for precision oncology applications. DTP enables users to construct an interaction network based on integrated bioactivity data across selected chemical compounds and their protein targets, further customizable using various visualization and filtering options, as well as cross-links to several drug and protein databases to provide comprehensive information of the network nodes and interactions. We demonstrate here the operation of the DTP tool and its unique features by several use cases related to both drug discovery and drug repurposing applications, using examples of anticancer drugs with shared target profiles. DTP is freely accessible at http://drugtargetprofiler.fimm.fi/.

Proceedings Article
31 Mar 2018
TL;DR: An information theoretic criterion for Bayesian network structure learning which is called quotient normalized maximum likelihood (qNML), which satisfies the property of score equivalence and is decomposable and completely free of adjustable hyperparameters.
Abstract: We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy.

Book ChapterDOI
TL;DR: Recurrent neural networks are applied to classifying process instances in a supervised fashion using labeled process instances extracted from event log traces for the first time, showing that GRU outperforms LSTM remarkably in training time while giving almost identical accuracies to L STM models.
Abstract: Process Mining consists of techniques where logs created by operative systems are transformed into process models. In process mining tools it is often desired to be able to classify ongoing process instances, e.g., to predict how long the process will still require to complete, or to classify process instances to different classes based only on the activities that have occurred in the process instance thus far. Recurrent neural networks and its subclasses, such as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM), have been demonstrated to be able to learn relevant temporal features for subsequent classification tasks. In this paper we apply recurrent neural networks to classifying process instances. The proposed model is trained in a supervised fashion using labeled process instances extracted from event log traces. This is the first time we know of GRU having been used in classifying business process instances. Our main experimental results shows that GRU outperforms LSTM remarkably in training time while giving almost identical accuracies to LSTM models. Additional contributions of our paper are improving the classification model training time by filtering infrequent activities, which is a technique commonly used, e.g., in Natural Language Processing (NLP).

Journal ArticleDOI
TL;DR: In this paper, a Gaussian Process (GP) based method for predicting protein's stability changes upon single and multiple mutations is proposed. But the accuracy of predictive models is ultimately constrained by the limited availability of experimental data.
Abstract: Motivation Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data. Results We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein's stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy. Availability and implementation Software implementation and datasets are available at github.com/emmijokinen/mgpfusion. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An interactive and user-friendly multi-platform-compatible software, BasePlayer, is introduced, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings.
Abstract: Next-generation sequencing (NGS) is routinely applied in life sciences and clinical practice, but interpretation of the massive quantities of genomic data produced has become a critical challenge. The genome-wide mutation analyses enabled by NGS have had a revolutionary impact in revealing the predisposing and driving DNA alterations behind a multitude of disorders. The workflow to identify causative mutations from NGS data, for example in cancer and rare diseases, commonly involves phases such as quality filtering, case–control comparison, genome annotation, and visual validation, which require multiple processing steps and usage of various tools and scripts. To this end, we have introduced an interactive and user-friendly multi-platform-compatible software, BasePlayer, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings. A genome-wide scan of regulatory regions for mutation clusters can be carried out with a desktop computer in ~10 min with a dataset of 3 million somatic variants in 200 whole-genome-sequenced (WGS) cancers. Here, the authors describe how to use BasePlayer, an interactive and user-friendly software that facilitates the identification of causative mutations from next-generation sequencing data.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper forms the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, the aim is to find the minimum number of changes to be performed to the given time series so that the classifier changes its decision to another class.
Abstract: Time series classification has received great attention over the past decade with a wide range of methods focusing on predictive performance by exploiting various types of temporal features. Nonetheless, little emphasis has been placed on interpretability and explainability. In this paper, we formulate the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, we want to find the minimum number of changes to be performed to the given time series so that the classifier changes its decision to another class. We show that the problem is NP-hard, and focus on two instantiations of the problem, which we refer to as reversible and irreversible time series tweaking. The classifier under investigation is the random shapelet forest classifier. Moreover, we propose two algorithmic solutions for the two problems along with simple optimizations, as well as a baseline solution using the nearest neighbor classifier. An extensive experimental evaluation on a variety of real datasets demonstrates the usefulness and effectiveness of our problem formulation and solutions.

Journal ArticleDOI
24 Oct 2018-BMJ Open
TL;DR: Women aged below 30 and from the most deprived areas were at highest risk of depression and most likely to receive antidepressant treatment and more than one in eight women received antidepressant treatment in this period.
Abstract: Objectives To investigate how depression is recognised in the year after child birth and treatment given in clinical practice. Design Cohort study based on UK primary care electronic health records. Setting Primary care. Participants Women who have given live birth between 2000 and 2013. Outcomes Prevalence of postnatal depression, depression diagnoses, depressive symptoms, antidepressant and non-pharmacological treatment within a year after birth. Results Of 206 517 women, 23 623 (11%) had a record of depressive diagnosis or symptoms in the year after delivery and more than one in eight women received antidepressant treatment. Recording and treatment peaked 6–8 weeks after delivery. Initiation of selective serotonin reuptake inhibitors (SSRI) treatment has become earlier in the more recent years. Thus, the initiation rate of SSRI treatment per 100 pregnancies (95% CI) at 8 weeks were 2.6 (2.5 to 2.8) in 2000–2004, increasing to 3.0 (2.9 to 3.1) in 2005–2009 and 3.8 (3.6 to 3.9) in 2010–2013. The overall rate of initiation of SSRI within the year after delivery, however, has not changed noticeably. A third of the women had at least one record suggestive of depression at any time prior to delivery and of these one in four received SSRI treatment in the year after delivery. Younger women were most likely to have records of depression and depressive symptoms. (Relative risk for postnatal depression: age 15–19: 1.92 (1.76 to 2.10), age 20–24: 1.49 (1.39 to 1.59) versus age 30–34). The risk of depression, postnatal depression and depressive symptoms increased with increasing social deprivation. Conclusions More than 1 in 10 women had electronic health records indicating depression diagnoses or depressive symptoms within a year after delivery and more than one in eight women received antidepressant treatment in this period. Women aged below 30 and from the most deprived areas were at highest risk of depression and most likely to receive antidepressant treatment.

Journal ArticleDOI
TL;DR: It is shown that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method.
Abstract: Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if other users are willing to share their private information. Good personalised predictions are vitally important in precision medicine, but genomic information on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily be anonymised. Differential privacy has emerged as a potentially promising solution: privacy is considered sufficient if presence of individual patients cannot be distinguished. However, differentially private learning with current methods does not improve predictions with feasible data sizes and dimensionalities. We show that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method. Our method matches the predictive accuracy of the state-of-the-art non-private lasso regression using only 4x more samples under relatively strong differential privacy guarantees. Good performance with limited data is achieved by limiting the sharing of private information by decreasing the dimensionality and by projecting outliers to fit tighter bounds, therefore needing to add less noise for equal privacy. The proposed differentially private regression method combines theoretical appeal and asymptotic efficiency with good prediction accuracy even with moderate-sized data. As already the simple-to-implement method shows promise on the challenging genomic data, we anticipate rapid progress towards practical applications in many fields. This article was reviewed by Zoltan Gaspari and David Kreil.

Journal ArticleDOI
TL;DR: The evolution of game genres from 1979 till 2010 is analyzed, indicating that until 1990, there have been many genres competing for dominance, but thereafter sport-racing, strategy, and action have become the most prevalent genres.
Abstract: Establishing genres is the first step toward analyzing games and how the genre landscape evolves over the years. We use data-driven modeling that distils genres from textual descriptions of a large collection of games. We analyze the evolution of game genres from 1979 till 2010. Our results indicate that until 1990, there have been many genres competing for dominance, but thereafter sport-racing, strategy, and action have become the most prevalent genres. Moreover, we find that games vary to a great extent as to whether they belong mostly to one genre or to a combination of several genres. We also compare the results of our data-driven model with two product databases, Metacritic and Mobygames, and observe that the classifications of games to different genres are substantially different, even between product databases. We conclude with discussion on potential future applications and how they may further our understanding of video game genres.

01 Jan 2018
TL;DR: According to the rule of different tracks at the SAT Competition 2017, multple versions of abcdSAT are developed, which are submitted to agile, main, no-limit, incremental library and parallel track.

Proceedings ArticleDOI
17 Nov 2018
TL;DR: This paper proposes a novel approach to maximize the diversity of exposure in a social network by introducing a novel extension to the notion of random reverse-reachable sets and demonstrates the efficiency and scalability of the algorithm on several real-world datasets.
Abstract: Social-media platforms have created new ways for citizens to stay informed and participate in public debates However, to enable a healthy environment for information sharing, social deliberation, and opinion formation, citizens need to be exposed to sufficiently diverse viewpoints that challenge their assumptions, instead of being trapped inside filter bubbles In this paper, we take a step in this direction and propose a novel approach to maximize the diversity of exposure in a social network We formulate the problem in the context of information propagation, as a task of recommending a small number of news articles to selected users We propose a realistic setting where we take into account content and user leanings, and the probability of further sharing an article This setting allows us to capture the balance between maximizing the spread of information and ensuring the exposure of users to diverse viewpoints The resulting problem can be cast as maximizing a monotone and submodular function subject to a matroid constraint on the allocation of articles to users It is a challenging generalization of the influence maximization problem Yet, we are able to devise scalable approximation algorithms by introducing a novel extension to the notion of random reverse-reachable sets We experimentally demonstrate the efficiency and scalability of our algorithm on several real-world datasets

Journal ArticleDOI
TL;DR: Overall, the results show how brain activity in holistic vs analytical participants differs when viewing the same drama movie.
Abstract: People socialized in different cultures differ in their thinking styles. Eastern-culture people view objects more holistically by taking context into account, whereas Western-culture people view objects more analytically by focusing on them at the expense of context. Here we studied whether participants, who have different thinking styles but live within the same culture, exhibit differential brain activity when viewing a drama movie. A total of 26 Finnish participants, who were divided into holistic and analytical thinkers based on self-report questionnaire scores, watched a shortened drama movie during functional magnetic resonance imaging. We compared intersubject correlation (ISC) of brain hemodynamic activity of holistic vs analytical participants across the movie viewings. Holistic thinkers showed significant ISC in more extensive cortical areas than analytical thinkers, suggesting that they perceived the movie in a more similar fashion. Significantly higher ISC was observed in holistic thinkers in occipital, prefrontal and temporal cortices. In analytical thinkers, significant ISC was observed in right-hemisphere fusiform gyrus, temporoparietal junction and frontal cortex. Since these results were obtained in participants with similar cultural background, they are less prone to confounds by other possible cultural differences. Overall, our results show how brain activity in holistic vs analytical participants differs when viewing the same drama movie.

Journal ArticleDOI
TL;DR: This work focuses on a setting where the user provides only the abstract of a new paper as input, and proposes a model to expand the semantic features of the given abstract using knowledge graphs and combine them with other features to fit a learning to rank model.
Abstract: Scholarly search engines, reference management tools, and academic social networks enable modern researchers to organize their scientific libraries. Moreover, they often provide recommendations for scientific publications that might be of interest to researchers. Because of the exponentially increasing volume of publications, effective citation recommendation is of great importance to researchers, as it reduces the time and effort spent on retrieving, understanding, and selecting research papers. In this context, we address the problem of citation recommendation, i.e., the task of recommending citations for a new paper. Current research investigates this task in different settings, including cases where rich user metadata is available (e.g., user profile, publications, citations). This work focus on a setting where the user provides only the abstract of a new paper as input. Our proposed approach is to expand the semantic features of the given abstract using knowledge graphs – and, combine them with other features (e.g., indegree, recency) to fit a learning to rank model. This model is used to generate the citation recommendations. By evaluating on real data, we show that the expanded semantic features lead to improving the quality of the recommendations measured by nDCG@10.