scispace - formally typeset
Search or ask a question

Showing papers by "Helsinki Institute for Information Technology published in 2019"


Journal ArticleDOI
TL;DR: SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data and integrates CSI:FingerID for searching in molecular structure databases.
Abstract: Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets. SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data.

620 citations


Proceedings Article
15 Apr 2019
TL;DR: This work presents an evaluation metric that can separately and reliably measure both the quality and coverage of the samples produced by a generative model and the perceptual quality of individual samples, and extends it to study latent space interpolations.
Abstract: The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.

308 citations


Proceedings Article
01 Jan 2019
TL;DR: A differentiable rendering framework which allows gradients to be analytically computed for all pixels in an image and to view foreground rasterization as a weighted interpolation of local properties and background rasterized as a distance-based aggregation of global geometry.
Abstract: Many machine learning models operate on images, but ignore the fact that images are 2D projections formed by 3D geometry interacting with light, in a process called rendering. Enabling ML models to understand image formation might be key for generalization. However, due to an essential rasterization step involving discrete assignment operations, rendering pipelines are non-differentiable and thus largely inaccessible to gradient-based ML techniques. In this paper, we present DIB-Render, a novel rendering framework through which gradients can be analytically computed. Key to our approach is to view rasterization as a weighted interpolation, allowing image gradients to back-propagate through various standard vertex shaders within a single framework. Our approach supports optimizing over vertex positions, colors, normals, light directions and texture coordinates, and allows us to incorporate various well-known lighting models from graphics. We showcase our approach in two ML applications: single-image 3D object prediction, and 3D textured object generation, both trained using exclusively 2D supervision.

231 citations


Posted Content
TL;DR: This work builds on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improves two key aspects: image quality and training efficiency.
Abstract: We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improve two key aspects: image quality and training efficiency. Our result quality is on par with state-of-the-art neural network denoisers in the case of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse noise. We also successfully handle cases where parameters of the noise model are variable and/or unknown in both training and evaluation data.

149 citations


Journal ArticleDOI
TL;DR: The state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction are described and a perspective on further opportunities to make better use of high-dimensional multi-omics profiles are given.
Abstract: In-depth modeling of the complex interplay among multiple omics data measured from cancer cell lines or patient tumors is providing new opportunities toward identification of tailored therapies for individual cancer patients. Supervised machine learning algorithms are increasingly being applied to the omics profiles as they enable integrative analyses among the high-dimensional data sets, as well as personalized predictions of therapy responses using multi-omics panels of response-predictive biomarkers identified through feature selection and cross-validation. However, technical variability and frequent missingness in input “big data” require the application of dedicated data preprocessing pipelines that often lead to some loss of information and compressed view of the biological signal. We describe here the state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction and give our perspective on further opportunities to make better use of high-dimensional multi-omics profiles along with knowledge about cancer pathways targeted by anti-cancer compounds when predicting their phenotypic responses.

138 citations


Journal ArticleDOI
TL;DR: Existing non-vaccine-serotypes in most GPSCs preclude the removal of these lineages by pneumococcal conjugate vaccines; leaving potential for serotype replacement, and a subset of GPSCs have increased resistance, and/or serotype-independent invasiveness.

124 citations


Journal ArticleDOI
TL;DR: This work rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data and provides a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood.
Abstract: We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analyzing an alignment of over 110 000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.

123 citations


Proceedings Article
01 Jan 2019
TL;DR: This work presents Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data that can simultaneously learn the embedding of high dimensional trajectories and infer arbitrarily complex continuous-time latent dynamics.
Abstract: We present Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data. Leveraging the advances in deep generative models, ODE2VAE can simultaneously learn the embedding of high dimensional trajectories and infer arbitrarily complex continuous-time latent dynamics. Our model explicitly decomposes the latent space into momentum and position components and solves a second order ODE system, which is in contrast to recurrent neural network (RNN) based time series models and recently proposed black-box ODE techniques. In order to account for uncertainty, we propose probabilistic latent ODE dynamics parameterized by deep Bayesian neural networks. We demonstrate our approach on motion capture, image rotation, and bouncing balls datasets. We achieve state-of-the-art performance in long term motion prediction and imputation tasks.

105 citations


Journal ArticleDOI
TL;DR: In this paper, the authors performed genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes.
Abstract: Understanding genetic architecture of plasma lipidome could provide better insights into lipid metabolism and its link to cardiovascular diseases (CVDs). Here, we perform genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes (n = 511,700 individuals). We identify 35 lipid-species-associated loci (P <5 ×10-8), 10 of which associate with CVD risk including five new loci-COL5A1, GLTPD2, SPTLC3, MBOAT7 and GALNT16 (false discovery rate<0.05). We identify loci for lipid species that are shown to predict CVD e.g., SPTLC3 for CER(d18:1/24:1). We show that lipoprotein lipase (LPL) may more efficiently hydrolyze medium length triacylglycerides (TAGs) than others. Polyunsaturated lipids have highest heritability and genetic correlations, suggesting considerable genetic regulation at fatty acids levels. We find low genetic correlations between traditional lipids and lipid species. Our results show that lipidomic profiles capture information beyond traditional lipids and identify genetic variants modifying lipid levels and risk of CVD.

103 citations


Proceedings Article
01 Jan 2019
TL;DR: In this article, a blind-spot network is used to train a denoising model on unorganized collections of corrupted images without access to clean reference images, or explicit pairs of corrupted image pairs.
Abstract: We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improve two key aspects: image quality and training efficiency. Our result quality is on par with state-of-the-art neural network denoisers in the case of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse noise. We also successfully handle cases where parameters of the noise model are variable and/or unknown in both training and evaluation data.

84 citations


Journal ArticleDOI
TL;DR: Phylogeographical genomic analysis of Neisseria gonorrhoeae uncovers its recent emergence and current distribution into two distinct lineages that are differentially associated with antibiotic resistance and sexual networks.
Abstract: The sexually transmitted pathogen Neisseria gonorrhoeae is regarded as being on the way to becoming an untreatable superbug. Despite its clinical importance, little is known about its emergence and ...

Journal ArticleDOI
TL;DR: This work demonstrates how sensitive the geographic patterns of current PSs are for small biases even within relatively homogeneous populations and provides simple tools to identify such biases.
Abstract: Polygenic scores (PSs) are becoming a useful tool to identify individuals with high genetic risk for complex diseases, and several projects are currently testing their utility for translational applications. It is also tempting to use PSs to assess whether genetic variation can explain a part of the geographic distribution of a phenotype. However, it is not well known how the population genetic properties of the training and target samples affect the geographic distribution of PSs. Here, we evaluate geographic differences, and related biases, of PSs in Finland in a geographically well-defined sample of 2,376 individuals from the National FINRISK study. First, we detect geographic differences in PSs for coronary artery disease (CAD), rheumatoid arthritis, schizophrenia, waist-hip ratio (WHR), body-mass index (BMI), and height, but not for Crohn disease or ulcerative colitis. Second, we use height as a model trait to thoroughly assess the possible population genetic biases in PSs and apply similar approaches to the other phenotypes. Most importantly, we detect suspiciously large accumulations of geographic differences for CAD, WHR, BMI, and height, suggesting bias arising from the population's genetic structure rather than from a direct genotype-phenotype association. This work demonstrates how sensitive the geographic patterns of current PSs are for small biases even within relatively homogeneous populations and provides simple tools to identify such biases. A thorough understanding of the effects of population genetic structure on PSs is essential for translational applications of PSs.

Journal ArticleDOI
TL;DR: DECREASE, an efficient machine learning model that requires only a limited set of pairwise dose–response measurements for the accurate prediction of synergistic and antagonistic drug combinations, is implemented.
Abstract: High-throughput drug combination screening provides a systematic strategy to discover unexpected combinatorial synergies in pre-clinical cell models. However, phenotypic combinatorial screening with multi-dose matrix assays is experimentally expensive, especially when the aim is to identify selective combination synergies across a large panel of cell lines or patient samples. Here we implemented DECREASE, an efficient machine learning model that requires only a limited set of pairwise dose-response measurements for accurate prediction of drug combination synergy and antagonism. Using a compendium of 23,595 drug combination matrices tested in various cancer cell lines, and malaria and Ebola infection models, we demonstrate how cost-effective experimental designs with DECREASE capture almost the same degree of information for synergy and antagonism detection as the fully-measured dose-response matrices. Measuring only the diagonal of the matrix provides an accurate and practical option for combinatorial screening. The open-source web-implementation enables applications of DECREASE to both pre-clinical and translational studies.

Journal ArticleDOI
TL;DR: This paper proposes to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and defines a loss function that measures this uncertainty and proposes to select the next evaluation location to minimise the expected loss.
Abstract: Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.

Journal ArticleDOI
TL;DR: Hypermethylated CpGs are revealed as a novel mechanism of action for DNMTi agents and identified 638 hypermethylated molecular targets (CpGs) common to decitabine and azacytidine therapy, suggesting that hypermethylation of C pGs should be considered when predicting theDNMTi responses and side effects in cancer patients.
Abstract: DNA methyltransferase inhibitors (DNMTi) decitabine and azacytidine are approved therapies for myelodysplastic syndrome and acute myeloid leukemia, and their combinations with other anticancer agents are being tested as therapeutic options for multiple solid cancers such as colon, ovarian, and lung cancer. However, the current therapeutic challenges of DNMTis include development of resistance, severe side effects and no or partial treatment responses, as observed in more than half of the patients. Therefore, there is a critical need to better understand the mechanisms of action of these drugs. In order to discover molecular targets of DNMTi therapy, we identified 638 novel CpGs with an increased methylation in response to decitabine treatment in HCT116 cell lines and validated the findings in multiple cancer types (e.g., bladder, ovarian, breast, and lymphoma) cell lines, bone marrow mononuclear cells from primary leukemia patients, as well as peripheral blood mononuclear cells and ascites from platinum resistance epithelial ovarian cancer patients. Azacytidine treatment also increased methylation of these CpGs in colon, ovarian, breast, and lymphoma cancer cell lines. Methylation at 166 identified CpGs strongly correlated (|r|≥ 0.80) with corresponding gene expression in HCT116 cell line. Differences in methylation at some of the identified CpGs and expression changes of the corresponding genes was observed in TCGA colon cancer tissue as compared to adjacent healthy tissue. Our analysis revealed that hypermethylated CpGs are involved in cancer cell proliferation and apoptosis by P53 and olfactory receptor pathways, hence influencing DNMTi responses. In conclusion, we showed hypermethylation of CpGs as a novel mechanism of action for DNMTi agents and identified 638 hypermethylated molecular targets (CpGs) common to decitabine and azacytidine therapy. These novel results suggest that hypermethylation of CpGs should be considered when predicting the DNMTi responses and side effects in cancer patients.

Journal ArticleDOI
TL;DR: The paper analyses the resource requirements of running DIDs on the IoT devices and finds that even quite small devices can successfully deploy DIDs and proposes that the most constrained devices could rely on a proxy approach.
Abstract: When IoT devices operate not only with the owner of the device but also with third parties, identifying the device using a permanent identifier, e.g., a hardware identifier, can present privacy problems due to the identifier facilitating tracking and correlation attacks. A changeable identifier can be used to reduce the risk on privacy. This paper looks at using decentralised identifiers (DIDs), an upcoming standard of self-sovereign identifiers with multiple competing implementations, with IoT devices. The paper analyses the resource requirements of running DIDs on the IoT devices and finds that even quite small devices can successfully deploy DIDs and proposes that the most constrained devices could rely on a proxy approach. Finally, the privacy benefits and limitations of using DIDs are analysed, with the conclusion that DIDs significantly improve the users’ privacy when utilised properly.

Journal ArticleDOI
01 Apr 2019
TL;DR: This work systematically review and analyse state-of-the-art protocols for the three phases of private decision tree evaluation protocols: feature selection, comparison, and path evaluation, and identifies novel combinations of these protocols that provide better tradeoffs than existing protocols.
Abstract: Abstract Decision trees and random forests are widely used classifiers in machine learning. Service providers often host classification models in a cloud service and provide an interface for clients to use the model remotely. While the model is sensitive information of the server, the input query and prediction results are sensitive information of the client. This motivates the need for private decision tree evaluation, where the service provider does not learn the client’s input and the client does not learn the model except for its size and the result. In this work, we identify the three phases of private decision tree evaluation protocols: feature selection, comparison, and path evaluation. We systematize constant-round protocols for each of these phases to identify the best available instantiations using the two main paradigms for secure computation: garbling techniques and homomorphic encryption. There is a natural tradeoff between runtime and communication considering these two paradigms: garbling techniques use fast symmetric-key operations but require a large amount of communication, while homomorphic encryption is computationally heavy but requires little communication. Our contributions are as follows: Firstly, we systematically review and analyse state-of-the-art protocols for the three phases of private decision tree evaluation. Our methodology allows us to identify novel combinations of these protocols that provide better tradeoffs than existing protocols. Thereafter, we empirically evaluate all combinations of these protocols by providing communication and runtime measures, and provide recommendations based on the identified concrete tradeoffs.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: A conditional lower bound is proved stating that, for any constant > 0, an O(|E|1− m)-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false.
Abstract: Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V,E) such that the concatenation of their node labels is equal to the given pattern string P[1..m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks. We prove a conditional lower bound stating that, for any constant epsilon>0, an O(|E|^{1 - epsilon} m)-time, or an O(|E| m^{1 - epsilon})-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree two, i.e. to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree three. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS'16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear-time solvable problems. An interesting corollary is that exact and approximate matching are equally hard (quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear-time vs quadratic-time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC'15)).

Journal ArticleDOI
TL;DR: Reduced representation bisulfite sequencing on red blood cell derived DNA showed genome-wide temporal changes in more than 40,000 out of the 522,643 CpG sites examined, and sites that showed a temporal and treatment-specific response in DNA methylation are candidate sites of interest for future studies trying to understand the link betweenDNA methylation patterns and timing of reproduction.
Abstract: In seasonal environments, timing of reproduction is a trait with important fitness consequences, but we know little about the molecular mechanisms that underlie the variation in this trait. Recentl ...

Posted ContentDOI
06 Feb 2019-bioRxiv
TL;DR: A novel Gaussian process method to predict if TCRs recognize certain epitopes, which outperforms other state-of-the-art methods in epitope-specificity predictions is developed and is found in HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.
Abstract: T cell receptors (TCRs) can recognize various pathogens and consequently start immune responses. TCRs can be sequenced from individuals and methods that can analyze the specificity of the TCRs can help us better understand the individual9s immune status in different diseases. We have developed TCRGP, a novel Gaussian process (GP) method that can predict if TCRs recognize certain epitopes. This method can utilize different CDR sequences from both TCRα and TCRβ chains from single-cell data and learn which CDRs are important in recognizing the different epitopes. We have experimented with one previously presented and one new data set and show that TCRGP outperforms other state-of-the-art methods in predicting the epitope specificity of TCRs on both data sets. The software implementation and data sets are available at https://github.com/emmijokinen/TCRGP.

Posted Content
TL;DR: In this paper, a non-linear independent component analysis (ICA) is proposed to infer causal relationships between two or more passively observed variables in the presence of general nonlinear dependencies, exploiting the non-stationarity of observations to recover the underlying sources or latent disturbances.
Abstract: We consider the problem of inferring causal relationships between two or more passively observed variables. While the problem of such causal discovery has been extensively studied especially in the bivariate setting, the majority of current methods assume a linear causal relationship, and the few methods which consider non-linear dependencies usually make the assumption of additive noise. Here, we propose a framework through which we can perform causal discovery in the presence of general non-linear relationships. The proposed method is based on recent progress in non-linear independent component analysis and exploits the non-stationarity of observations in order to recover the underlying sources or latent disturbances. We show rigorously that in the case of bivariate causal discovery, such non-linear ICA can be used to infer the causal direction via a series of independence tests. We further propose an alternative measure of causal direction based on asymptotic approximations to the likelihood ratio, as well as an extension to multivariate causal discovery. We demonstrate the capabilities of the proposed method via a series of simulation studies and conclude with an application to neuroimaging data.

Proceedings Article
01 Jan 2019
TL;DR: In this article, the authors use pointer authentication (PA) to build novel defenses against various classes of run-time attacks, including the first PA-based mechanism for data pointer integrity.
Abstract: Run-time attacks against programs written in memory-unsafe programming languages (e.g., C and C++) remain a prominent threat against computer systems. The prevalence of techniques like return-oriented programming (ROP) in attacking real-world systems has prompted major processor manufacturers to design hardware-based countermeasures against specific classes of run-time attacks. An example is the recently added support for pointer authentication (PA) in the ARMv8-A processor architecture, commonly used in devices like smartphones. PA is a low-cost technique to authenticate pointers so as to resist memory vulnerabilities. It has been shown to enable practical protection against memory vulnerabilities that corrupt return addresses or function pointers. However, so far, PA has received very little attention as a general purpose protection mechanism to harden software against various classes of memory attacks. In this paper, we use PA to build novel defenses against various classes of run-time attacks, including the first PA-based mechanism for data pointer integrity. We present PARTS, an instrumentation framework that integrates our PA-based defenses into the LLVM compiler and the GNU/Linux operating system and show, via systematic evaluation, that PARTS provides better protection than current solutions at a reasonable performance overhead

Proceedings Article
16 Apr 2019
TL;DR: This article proposed two variable selection methods for Gaussian process models that utilize the predictions of a full model in the vicinity of the training points and thereby rank the variables based on their predictive relevance.
Abstract: Variable selection for Gaussian process models is often done using automatic relevance determination, which uses the inverse length-scale parameter of each input variable as a proxy for variable relevance. This implicitly determined relevance has several drawbacks that prevent the selection of optimal input variables in terms of predictive performance. To improve on this, we propose two novel variable selection methods for Gaussian process models that utilize the predictions of a full model in the vicinity of the training points and thereby rank the variables based on their predictive relevance. Our empirical results on synthetic and real world data sets demonstrate improved variable selection compared to automatic relevance determination in terms of variability and predictive performance.

Journal ArticleDOI
TL;DR: Genomic sequence-based phylogenetic analyses demonstrate that Ko3 and Ko4 formed well-defined sequence clusters related to, but distinct from, Klebsiella michiganensis and K. huaxiensis, and differentiating Ko3, Ko4, and Ko8 from the other K. oxytoca species.
Abstract: Klebsiella oxytoca causes opportunistic human infections and post-antibiotic haemorrhagic diarrhea. This Enterobacteriaceae species is genetically heterogeneous and is currently subdivided into seven phylogroups (Ko1 to Ko4 and Ko6 to Ko8). Here we investigated the taxonomic status of phylogroups Ko3 and Ko4. Genomic sequence-based phylogenetic analyses demonstrate that Ko3 and Ko4 formed well-defined sequence clusters related to, but distinct from, Klebsiella michiganensis (Ko1), K. oxytoca (Ko2), K. huaxiensis (Ko8), and K. grimontii (Ko6). The average nucleotide identity (ANI) of Ko3 and Ko4 were 90.7% with K. huaxiensis and 95.5% with K. grimontii, respectively. In addition, three strains of K. huaxiensis, a species so far described based on a single strain from a urinary tract infection patient in China, were isolated from cattle and human feces. Biochemical and MALDI-ToF mass spectrometry analysis allowed differentiating Ko3, Ko4, and Ko8 from the other K. oxytoca species. Based on these results, we propose the names Klebsiella spallanzanii for the Ko3 phylogroup, with SPARK_775_C1T (CIP 111695T and DSM 109531T) as type strain, and Klebsiella pasteurii for Ko4, with SPARK_836_C1T (CIP 111696T and DSM 109530T) as type strain. Strains of K. spallanzanii were isolated from human urine, cow feces, and farm surfaces, while strains of K. pasteurii were found in fecal carriage from humans, cows, and turtles.

Journal ArticleDOI
TL;DR: Application of the model-free SpydrPick method to large population genomic datasets of two major human pathogens revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.
Abstract: Covariance-based discovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level covariation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which adjusts for the phylogenetic signal in the data without requiring an explicit phylogenetic tree. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Simulations demonstrate the usefulness of our method and give some insight to when this type of analysis is most likely to be successful. Application of the method to large population genomic datasets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.

Posted ContentDOI
21 Aug 2019-bioRxiv
TL;DR: In this paper, a Gaussian process method was proposed to predict if TCRs recognize certain epitopes, which can utilize CDR sequences from TCRα and TCRβ chains and learn which CDRs are important in recognizing different epitopes.
Abstract: T cell receptors (TCRs) can recognize various pathogens and consequently start immune responses. TCRs can be sequenced from individuals and methods analyzing the specificity of the TCRs can help us better understand individuals’ immune status in different diseases. We have developed TCRGP, a novel Gaussian process method to predict if TCRs recognize certain epitopes. This method can utilize CDR sequences from TCRα and TCRβ chains and learn which CDRs are important in recognizing different epitopes. We have experimented with with epitope-specific data against 29 epitopes and performed a comprehensive evaluation with existing prediction methods. On this data, TCRGP outperforms other state-of-the-art methods in epitope-specificity predictions. We also propose a novel analysis approach for combined single-cell RNA and TCRαβ (scRNA+TCRαβ) sequencing data by quantifying epitope-specific TCRs with TCRGP in phenotypes identified from scRNA-seq data. With this approach, we find HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.

Proceedings Article
16 Apr 2019
TL;DR: In this article, a novel deep learning paradigm of differential flows that learn a stochastic differential equation transformations of inputs prior to a standard classification or regression function is proposed, where the key property of differential Gaussian processes is the warping of inputs through infinitely deep, but infinitesimal, differential fields.
Abstract: We propose a novel deep learning paradigm of differential flows that learn a stochastic differential equation transformations of inputs prior to a standard classification or regression function. The key property of differential Gaussian processes is the warping of inputs through infinitely deep, but infinitesimal, differential fields, that generalise discrete layers into a dynamical system. We demonstrate excellent results as compared to deep Gaussian processes and Bayesian neural networks.

Journal ArticleDOI
TL;DR: In this article, the authors propose an infrastructure that allows CC researchers to build workflows that can be executed online and be easily reused by others through the workflow web address, leading to novel ways of software composition for computational purposes that were not expected in advance.
Abstract: Computational creativity (CC) is a multidisciplinary research field, studying how to engineer software that exhibits behavior that would reasonably be deemed creative. This paper shows how composition of software solutions in this field can effectively be supported through a CC infrastructure that supports user-friendly development of CC software components and workflows, their sharing, execution, and reuse. The infrastructure allows CC researchers to build workflows that can be executed online and be easily reused by others through the workflow web address. Moreover, it enables the building of procedures composed of software developed by different researchers from different laboratories, leading to novel ways of software composition for computational purposes that were not expected in advance. This capability is illustrated on a workflow that implements a Concept Generator prototype based on the Conceptual Blending framework. The prototype consists of a composition of modules made available as web services, and is explored and tested through experiments involving blending of texts from different domains, blending of images, and poetry generation.

Proceedings Article
24 May 2019
TL;DR: This paper presents gradKCCA, a large-scale sparse non-linear canonical correlation method that outperforms state-of-the-art CCA methods in terms of speed and robustness to noise both in simulated and real-world datasets.
Abstract: This paper presents gradKCCA, a large-scale sparse non-linear canonical correlation method. Like Kernel Canonical Correlation Analysis (KCCA), our method finds non-linear relations through kernel functions, but it does not rely on a kernel matrix, a known bottleneck for scaling up kernel methods. gradKCCA corresponds to solving KCCA with the additional constraint that the canonical projection directions in the kernelinduced feature space have preimages in the original data space. Firstly, this modification allows us to very efficiently maximize kernel canonical correlation through an alternating projected gradient algorithm working in the original data space. Secondly, we can control the sparsity of the projection directions by constraining the `1 norm of the preimages of the projection directions, facilitating the interpretation of the discovered patterns, which is not available through KCCA. Our empirical experiments demonstrate that gradKCCA outperforms state-of-the-art CCA methods in terms of speed and robustness to noise both in simulated and real-world datasets.

Journal ArticleDOI
TL;DR: MetABF, a simple Bayesian framework for performing integrative meta‐analysis across multiple GWAS using summary statistics, is described, which can increase the power by 50% compared with standard frequentist tests when only a subset of studies have a true effect.
Abstract: Genome-wide association studies (GWAS) are a powerful tool for understanding the genetic basis of diseases and traits, but most studies have been conducted in isolation, with a focus on either a single or a set of closely related phenotypes. We describe MetABF, a simple Bayesian framework for performing integrative meta-analysis across multiple GWAS using summary statistics. The approach is applicable across a wide range of study designs and can increase the power by 50% compared with standard frequentist tests when only a subset of studies have a true effect. We demonstrate its utility in a meta-analysis of 20 diverse GWAS which were part of the Wellcome Trust Case Control Consortium 2. The novelty of the approach is its ability to explore, and assess the evidence for a range of possible true patterns of association across studies in a computationally efficient framework.