Showing papers by "Helsinki Institute for Information Technology published in 2019"

PDF

Open Access

Journal Article•DOI•

SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information.

[...]

Kai Dührkop¹, Markus Fleischauer¹, Marcus Ludwig¹, Alexander A. Aksenov², Alexey V. Melnik², Marvin Meusel³, Marvin Meusel¹, Pieter C. Dorrestein², Juho Rousu⁴, Sebastian Böcker¹ - Show less +6 more•Institutions (4)

University of Jena¹, University of Montana², Saarland University³, Helsinki Institute for Information Technology⁴

01 Apr 2019-Nature Methods

TL;DR: SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data and integrates CSI:FingerID for searching in molecular structure databases.

...read moreread less

Abstract: Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets. SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data.

...read moreread less

620 citations

Proceedings Article•

Improved Precision and Recall Metric for Assessing Generative Models

[...]

Tuomas Kynkäänniemi, Tero Karras¹, Samuli Laine¹, Jaakko Lehtinen², Timo Aila¹ - Show less +1 more•Institutions (2)

Nvidia¹, Helsinki Institute for Information Technology²

15 Apr 2019

TL;DR: This work presents an evaluation metric that can separately and reliably measure both the quality and coverage of the samples produced by a generative model and the perceptual quality of individual samples, and extends it to study latent space interpolations.

...read moreread less

Abstract: The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.

...read moreread less

308 citations

Proceedings Article•

Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

[...]

Wenzheng Chen¹, Huan Ling¹, Jun Gao¹, Edward J. Smith², Jaakko Lehtinen³, Alec Jacobson¹, Sanja Fidler¹ - Show less +3 more•Institutions (3)

University of Toronto¹, McGill University², Helsinki Institute for Information Technology³

01 Jan 2019

TL;DR: A differentiable rendering framework which allows gradients to be analytically computed for all pixels in an image and to view foreground rasterization as a weighted interpolation of local properties and background rasterized as a distance-based aggregation of global geometry.

...read moreread less

Abstract: Many machine learning models operate on images, but ignore the fact that images are 2D projections formed by 3D geometry interacting with light, in a process called rendering. Enabling ML models to understand image formation might be key for generalization. However, due to an essential rasterization step involving discrete assignment operations, rendering pipelines are non-differentiable and thus largely inaccessible to gradient-based ML techniques. In this paper, we present DIB-Render, a novel rendering framework through which gradients can be analytically computed. Key to our approach is to view rasterization as a weighted interpolation, allowing image gradients to back-propagate through various standard vertex shaders within a single framework. Our approach supports optimizing over vertex positions, colors, normals, light directions and texture coordinates, and allows us to incorporate various well-known lighting models from graphics. We showcase our approach in two ML applications: single-image 3D object prediction, and 3D textured object generation, both trained using exclusively 2D supervision.

...read moreread less

231 citations

Posted Content•

High-Quality Self-Supervised Deep Image Denoising

[...]

Samuli Laine¹, Tero Karras¹, Jaakko Lehtinen², Timo Aila¹•Institutions (2)

Nvidia¹, Helsinki Institute for Information Technology²

29 Jan 2019-arXiv: Learning

TL;DR: This work builds on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improves two key aspects: image quality and training efficiency.

...read moreread less

Abstract: We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improve two key aspects: image quality and training efficiency. Our result quality is on par with state-of-the-art neural network denoisers in the case of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse noise. We also successfully handle cases where parameters of the noise model are variable and/or unknown in both training and evaluation data.

...read moreread less

149 citations

Journal Article•DOI•

Machine learning and feature selection for drug response prediction in precision oncology applications

[...]

Mehreen Ali¹, Tero Aittokallio², Tero Aittokallio³, Tero Aittokallio¹•Institutions (3)

University of Helsinki¹, University of Turku², Helsinki Institute for Information Technology³

01 Feb 2019-Biophysical Reviews

TL;DR: The state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction are described and a perspective on further opportunities to make better use of high-dimensional multi-omics profiles are given.

...read moreread less

Abstract: In-depth modeling of the complex interplay among multiple omics data measured from cancer cell lines or patient tumors is providing new opportunities toward identification of tailored therapies for individual cancer patients. Supervised machine learning algorithms are increasingly being applied to the omics profiles as they enable integrative analyses among the high-dimensional data sets, as well as personalized predictions of therapy responses using multi-omics panels of response-predictive biomarkers identified through feature selection and cross-validation. However, technical variability and frequent missingness in input “big data” require the application of dedicated data preprocessing pipelines that often lead to some loss of information and compressed view of the biological signal. We describe here the state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction and give our perspective on further opportunities to make better use of high-dimensional multi-omics profiles along with knowledge about cancer pathways targeted by anti-cancer compounds when predicting their phenotypic responses.

...read moreread less

138 citations

Journal Article•DOI•

International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact

[...]

Rebecca A. Gladstone¹, Stephanie W. Lo¹, John A. Lees², Nicholas J. Croucher³, Andries J. van Tonder¹, Jukka Corander¹, Jukka Corander⁴, Andrew J. Page¹, Pekka Marttinen⁵, Leon J. Bentley¹, Theresa J. Ochoa⁶, Pak-Leung Ho⁷, Mignon du Plessis, Jennifer E. Cornick⁸, Brenda Kwambana-Adams⁹, Brenda Kwambana-Adams¹⁰, Rachel Benisty¹¹, Susan A. Nzenze¹², Shabir A. Madhi¹², Paulina A. Hawkins¹³, Dean Everett¹⁴, Martin Antonio¹⁵, Martin Antonio⁹, Ron Dagan¹¹, Keith P. Klugman¹³, Anne von Gottberg, Lesley McGee¹⁶, Robert F. Breiman¹³, Stephen D. Bentley¹ - Show less +25 more•Institutions (16)

Wellcome Trust Sanger Institute¹, New York University², Imperial College London³, University of Oslo⁴, Helsinki Institute for Information Technology⁵, Cayetano Heredia University⁶, University of Hong Kong⁷, Malawi-Liverpool-Wellcome Trust Clinical Research Programme⁸, University of London⁹, University College London¹⁰, Ben-Gurion University of the Negev¹¹, University of the Witwatersrand¹², Emory University¹³, University of Edinburgh¹⁴, University of Warwick¹⁵, Centers for Disease Control and Prevention¹⁶

01 May 2019-EBioMedicine

TL;DR: Existing non-vaccine-serotypes in most GPSCs preclude the removal of these lineages by pneumococcal conjugate vaccines; leaving potential for serotype replacement, and a subset of GPSCs have increased resistance, and/or serotype-independent invasiveness.

...read moreread less

124 citations

Journal Article•DOI•

Fast hierarchical Bayesian analysis of population structure.

[...]

Gerry Tonkin-Hill¹, John A. Lees², Stephen D. Bentley¹, Simon D. W. Frost³, Simon D. W. Frost⁴, Jukka Corander⁵, Jukka Corander¹, Jukka Corander⁶ - Show less +4 more•Institutions (6)

Wellcome Trust Sanger Institute¹, New York University², University of Cambridge³, The Turing Institute⁴, Helsinki Institute for Information Technology⁵, University of Oslo⁶

20 Jun 2019-Nucleic Acids Research

TL;DR: This work rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data and provides a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood.

...read moreread less

Abstract: We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analyzing an alignment of over 110 000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.

...read moreread less

123 citations

Proceedings Article•

ODE2VAE: Deep generative second order ODEs with Bayesian neural networks

[...]

Cagatay Yildiz¹, Markus Heinonen², Harri Lähdesmäki¹•Institutions (2)

Aalto University¹, Helsinki Institute for Information Technology²

01 Jan 2019

TL;DR: This work presents Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data that can simultaneously learn the embedding of high dimensional trajectories and infer arbitrarily complex continuous-time latent dynamics.

...read moreread less

Abstract: We present Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data. Leveraging the advances in deep generative models, ODE2VAE can simultaneously learn the embedding of high dimensional trajectories and infer arbitrarily complex continuous-time latent dynamics. Our model explicitly decomposes the latent space into momentum and position components and solves a second order ODE system, which is in contrast to recurrent neural network (RNN) based time series models and recently proposed black-box ODE techniques. In order to account for uncertainty, we propose probabilistic latent ODE dynamics parameterized by deep Bayesian neural networks. We demonstrate our approach on motion capture, image rotation, and bouncing balls datasets. We achieve state-of-the-art performance in long term motion prediction and imputation tasks.

...read moreread less

105 citations

Journal Article•DOI•

Genetic architecture of human plasma lipidome and its link to cardiovascular disease

[...]

Rubina Tabassum¹, Joel T. Rämö¹, Pietari Ripatti¹, Jukka Koskela¹, Mitja I. Kurki¹, Mitja I. Kurki², Mitja I. Kurki³, Juha Karjalainen⁴, Juha Karjalainen¹, Juha Karjalainen², Priit Palta⁵, Priit Palta¹, Shabbeer Hassan¹, Javier Nunez-Fontarnau¹, Tuomo Kiiskinen¹, Tuomo Kiiskinen⁶, Sanni Söderlund¹, Niina Matikainen¹, Mathias J. Gerl, Michal A. Surma, Christian Klose, Nathan O. Stitziel⁷, Hannele Laivuori⁸, Hannele Laivuori¹, Aki S. Havulinna⁶, Aki S. Havulinna¹, Veikko Salomaa⁶, Matti Pirinen¹, Matti Pirinen⁹, FinnGen¹⁰, FinnGen⁶, Matti Jauhiainen⁶, Matti Jauhiainen¹⁰, Mark J. Daly⁴, Mark J. Daly¹, Nelson B. Freimer¹¹, Aarno Palotie⁴, Aarno Palotie², Aarno Palotie¹, Marja-Riitta Taskinen¹, Kai Simons¹², Samuli Ripatti¹, Samuli Ripatti⁴ - Show less +39 more•Institutions (12)

University of Helsinki¹, Harvard University², Broad Institute³, Massachusetts Institute of Technology⁴, University of Tartu⁵, National Institutes of Health⁶, Washington University in St. Louis⁷, University of Tampere⁸, Helsinki Institute for Information Technology⁹, Minerva Foundation Institute for Medical Research¹⁰, Semel Institute for Neuroscience and Human Behavior¹¹, Max Planck Society¹²

24 Sep 2019-Nature Communications

TL;DR: In this paper, the authors performed genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes.

...read moreread less

Abstract: Understanding genetic architecture of plasma lipidome could provide better insights into lipid metabolism and its link to cardiovascular diseases (CVDs). Here, we perform genome-wide association analyses of 141 lipid species (n = 2,181 individuals), followed by phenome-wide scans with 25 CVD related phenotypes (n = 511,700 individuals). We identify 35 lipid-species-associated loci (P <5 ×10-8), 10 of which associate with CVD risk including five new loci-COL5A1, GLTPD2, SPTLC3, MBOAT7 and GALNT16 (false discovery rate<0.05). We identify loci for lipid species that are shown to predict CVD e.g., SPTLC3 for CER(d18:1/24:1). We show that lipoprotein lipase (LPL) may more efficiently hydrolyze medium length triacylglycerides (TAGs) than others. Polyunsaturated lipids have highest heritability and genetic correlations, suggesting considerable genetic regulation at fatty acids levels. We find low genetic correlations between traditional lipids and lipid species. Our results show that lipidomic profiles capture information beyond traditional lipids and identify genetic variants modifying lipid levels and risk of CVD.

...read moreread less

103 citations

Proceedings Article•

High-Quality Self-Supervised Deep Image Denoising

[...]

Samuli Laine¹, Tero Karras¹, Jaakko Lehtinen², Timo Aila¹•Institutions (2)

Nvidia¹, Helsinki Institute for Information Technology²

01 Jan 2019

TL;DR: In this article, a blind-spot network is used to train a denoising model on unorganized collections of corrupted images without access to clean reference images, or explicit pairs of corrupted image pairs.

...read moreread less

84 citations

Journal Article•DOI•

The Impact of Antimicrobials on Gonococcal Evolution

[...]

Leonor Sánchez-Busó¹, Daniel Golparian², Jukka Corander¹, Jukka Corander³, Jukka Corander⁴, Yonatan H. Grad⁵, Yonatan H. Grad⁶, Makoto Ohnishi⁷, Rebecca Flemming⁸, Julian Parkhill¹, Stephen D. Bentley¹, Magnus Unemo², Simon R. Harris¹ - Show less +9 more•Institutions (8)

Wellcome Trust Sanger Institute¹, Örebro University², University of Oslo³, Helsinki Institute for Information Technology⁴, Harvard University⁵, Brigham and Women's Hospital⁶, National Institutes of Health⁷, University of Cambridge⁸

29 Jul 2019-Nature microbiology

TL;DR: Phylogeographical genomic analysis of Neisseria gonorrhoeae uncovers its recent emergence and current distribution into two distinct lineages that are differentially associated with antibiotic resistance and sexual networks.

...read moreread less

Abstract: The sexually transmitted pathogen Neisseria gonorrhoeae is regarded as being on the way to becoming an untreatable superbug. Despite its clinical importance, little is known about its emergence and ...

...read moreread less

Journal Article•DOI•

Geographic Variation and Bias in the Polygenic Scores of Complex Diseases and Traits in Finland

[...]

Sini Kerminen¹, Alicia R. Martin², Alicia R. Martin³, Jukka Koskela¹, Sanni Ruotsalainen¹, Aki S. Havulinna¹, Ida Surakka¹, Ida Surakka⁴, Aarno Palotie, Markus Perola¹, Veikko Salomaa, Mark J. Daly, Samuli Ripatti¹, Matti Pirinen⁵, Matti Pirinen¹ - Show less +11 more•Institutions (5)

University of Helsinki¹, Harvard University², Broad Institute³, University of Michigan⁴, Helsinki Institute for Information Technology⁵

06 Jun 2019-American Journal of Human Genetics

TL;DR: This work demonstrates how sensitive the geographic patterns of current PSs are for small biases even within relatively homogeneous populations and provides simple tools to identify such biases.

...read moreread less

Abstract: Polygenic scores (PSs) are becoming a useful tool to identify individuals with high genetic risk for complex diseases, and several projects are currently testing their utility for translational applications. It is also tempting to use PSs to assess whether genetic variation can explain a part of the geographic distribution of a phenotype. However, it is not well known how the population genetic properties of the training and target samples affect the geographic distribution of PSs. Here, we evaluate geographic differences, and related biases, of PSs in Finland in a geographically well-defined sample of 2,376 individuals from the National FINRISK study. First, we detect geographic differences in PSs for coronary artery disease (CAD), rheumatoid arthritis, schizophrenia, waist-hip ratio (WHR), body-mass index (BMI), and height, but not for Crohn disease or ulcerative colitis. Second, we use height as a model trait to thoroughly assess the possible population genetic biases in PSs and apply similar approaches to the other phenotypes. Most importantly, we detect suspiciously large accumulations of geographic differences for CAD, WHR, BMI, and height, suggesting bias arising from the population's genetic structure rather than from a direct genotype-phenotype association. This work demonstrates how sensitive the geographic patterns of current PSs are for small biases even within relatively homogeneous populations and provides simple tools to identify such biases. A thorough understanding of the effects of population genetic structure on PSs is essential for translational applications of PSs.

...read moreread less

Journal Article•DOI•

Prediction of drug combination effects with a minimal set of experiments.

[...]

Aleksandr Ianevski¹, Anil K. Giri¹, Prson Gautam¹, Alexander Kononov¹, Swapnil Potdar¹, Jani Saarela¹, Krister Wennerberg², Krister Wennerberg¹, Tero Aittokallio³, Tero Aittokallio¹, Tero Aittokallio⁴ - Show less +7 more•Institutions (4)

University of Helsinki¹, University of Copenhagen², University of Turku³, Helsinki Institute for Information Technology⁴

01 Dec 2019-Nature Machine Intelligence

TL;DR: DECREASE, an efficient machine learning model that requires only a limited set of pairwise dose–response measurements for the accurate prediction of synergistic and antagonistic drug combinations, is implemented.

...read moreread less

Abstract: High-throughput drug combination screening provides a systematic strategy to discover unexpected combinatorial synergies in pre-clinical cell models. However, phenotypic combinatorial screening with multi-dose matrix assays is experimentally expensive, especially when the aim is to identify selective combination synergies across a large panel of cell lines or patient samples. Here we implemented DECREASE, an efficient machine learning model that requires only a limited set of pairwise dose-response measurements for accurate prediction of drug combination synergy and antagonism. Using a compendium of 23,595 drug combination matrices tested in various cancer cell lines, and malaria and Ebola infection models, we demonstrate how cost-effective experimental designs with DECREASE capture almost the same degree of information for synergy and antagonism detection as the fully-measured dose-response matrices. Measuring only the diagonal of the matrix provides an accurate and practical option for combinatorial screening. The open-source web-implementation enables applications of DECREASE to both pre-clinical and translational studies.

...read moreread less

Journal Article•DOI•

Efficient acquisition rules for model-based approximate Bayesian computation

[...]

Marko Järvenpää, Michael U. Gutmann¹, Arijus Pleska, Aki Vehtari², Pekka Marttinen - Show less +1 more•Institutions (2)

University of Edinburgh¹, Helsinki Institute for Information Technology²

01 Jun 2019-Bayesian Analysis

TL;DR: This paper proposes to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and defines a loss function that measures this uncertainty and proposes to select the next evaluation location to minimise the expected loss.

...read moreread less

Abstract: Approximate Bayesian computation (ABC) is a method for Bayesian inference when the likelihood is unavailable but simulating from the model is possible. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the computational cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables one to intelligently decide where to evaluate the model next but common BO strategies are not designed for the goal of estimating the posterior distribution. Our paper addresses this gap in the literature. We propose to compute the uncertainty in the ABC posterior density, which is due to a lack of simulations to estimate this quantity accurately, and define a loss function that measures this uncertainty. We then propose to select the next evaluation location to minimise the expected loss. Experiments show that the proposed method often produces the most accurate approximations as compared to common BO strategies.

...read moreread less

Journal Article•DOI•

DNMT Inhibitors Increase Methylation in the Cancer Genome.

[...]

Anil K. Giri¹, Tero Aittokallio², Tero Aittokallio¹, Tero Aittokallio³•Institutions (3)

University of Helsinki¹, Helsinki Institute for Information Technology², University of Turku³

24 Apr 2019-Frontiers in Pharmacology

TL;DR: Hypermethylated CpGs are revealed as a novel mechanism of action for DNMTi agents and identified 638 hypermethylated molecular targets (CpGs) common to decitabine and azacytidine therapy, suggesting that hypermethylation of C pGs should be considered when predicting theDNMTi responses and side effects in cancer patients.

...read moreread less

Abstract: DNA methyltransferase inhibitors (DNMTi) decitabine and azacytidine are approved therapies for myelodysplastic syndrome and acute myeloid leukemia, and their combinations with other anticancer agents are being tested as therapeutic options for multiple solid cancers such as colon, ovarian, and lung cancer. However, the current therapeutic challenges of DNMTis include development of resistance, severe side effects and no or partial treatment responses, as observed in more than half of the patients. Therefore, there is a critical need to better understand the mechanisms of action of these drugs. In order to discover molecular targets of DNMTi therapy, we identified 638 novel CpGs with an increased methylation in response to decitabine treatment in HCT116 cell lines and validated the findings in multiple cancer types (e.g., bladder, ovarian, breast, and lymphoma) cell lines, bone marrow mononuclear cells from primary leukemia patients, as well as peripheral blood mononuclear cells and ascites from platinum resistance epithelial ovarian cancer patients. Azacytidine treatment also increased methylation of these CpGs in colon, ovarian, breast, and lymphoma cancer cell lines. Methylation at 166 identified CpGs strongly correlated (|r|≥ 0.80) with corresponding gene expression in HCT116 cell line. Differences in methylation at some of the identified CpGs and expression changes of the corresponding genes was observed in TCGA colon cancer tissue as compared to adjacent healthy tissue. Our analysis revealed that hypermethylated CpGs are involved in cancer cell proliferation and apoptosis by P53 and olfactory receptor pathways, hence influencing DNMTi responses. In conclusion, we showed hypermethylation of CpGs as a novel mechanism of action for DNMTi agents and identified 638 hypermethylated molecular targets (CpGs) common to decitabine and azacytidine therapy. These novel results suggest that hypermethylation of CpGs should be considered when predicting the DNMTi responses and side effects in cancer patients.

...read moreread less

Journal Article•DOI•

Improving the Privacy of IoT with Decentralised Identifiers (DIDs)

[...]

Yki Kortesniemi¹, Dmitrij Lagutin, Tommi Elo, Nikos Fotiou²•Institutions (2)

Helsinki Institute for Information Technology¹, Athens University of Economics and Business²

20 Mar 2019-Journal of Computer Networks and Communications

TL;DR: The paper analyses the resource requirements of running DIDs on the IoT devices and finds that even quite small devices can successfully deploy DIDs and proposes that the most constrained devices could rely on a proxy approach.

...read moreread less

Abstract: When IoT devices operate not only with the owner of the device but also with third parties, identifying the device using a permanent identifier, e.g., a hardware identifier, can present privacy problems due to the identifier facilitating tracking and correlation attacks. A changeable identifier can be used to reduce the risk on privacy. This paper looks at using decentralised identifiers (DIDs), an upcoming standard of self-sovereign identifiers with multiple competing implementations, with IoT devices. The paper analyses the resource requirements of running DIDs on the IoT devices and finds that even quite small devices can successfully deploy DIDs and proposes that the most constrained devices could rely on a proxy approach. Finally, the privacy benefits and limitations of using DIDs are analysed, with the conclusion that DIDs significantly improve the users’ privacy when utilised properly.

...read moreread less

Journal Article•DOI•

SoK: Modular and Efficient Private Decision Tree Evaluation

[...]

Ágnes Kiss¹, Masoud Naderpour², Jian Liu³, Nadarajah Asokan⁴, Thomas Schneider¹ - Show less +1 more•Institutions (4)

Technische Universität Darmstadt¹, University of Helsinki², University of California, Berkeley³, Helsinki Institute for Information Technology⁴

01 Apr 2019

TL;DR: This work systematically review and analyse state-of-the-art protocols for the three phases of private decision tree evaluation protocols: feature selection, comparison, and path evaluation, and identifies novel combinations of these protocols that provide better tradeoffs than existing protocols.

...read moreread less

Abstract: Abstract Decision trees and random forests are widely used classifiers in machine learning. Service providers often host classification models in a cloud service and provide an interface for clients to use the model remotely. While the model is sensitive information of the server, the input query and prediction results are sensitive information of the client. This motivates the need for private decision tree evaluation, where the service provider does not learn the client’s input and the client does not learn the model except for its size and the result. In this work, we identify the three phases of private decision tree evaluation protocols: feature selection, comparison, and path evaluation. We systematize constant-round protocols for each of these phases to identify the best available instantiations using the two main paradigms for secure computation: garbling techniques and homomorphic encryption. There is a natural tradeoff between runtime and communication considering these two paradigms: garbling techniques use fast symmetric-key operations but require a large amount of communication, while homomorphic encryption is computationally heavy but requires little communication. Our contributions are as follows: Firstly, we systematically review and analyse state-of-the-art protocols for the three phases of private decision tree evaluation. Our methodology allows us to identify novel combinations of these protocols that provide better tradeoffs than existing protocols. Thereafter, we empirically evaluate all combinations of these protocols by providing communication and runtime measures, and provide recommendations based on the identified concrete tradeoffs.

...read moreread less

Proceedings Article•DOI•

On the Complexity of String Matching for Graphs

[...]

Massimo Equi¹, Roberto Grossi², Veli Mäkinen³, Alexandru I. Tomescu¹•Institutions (3)

University of Helsinki¹, University of Pisa², Helsinki Institute for Information Technology³

01 Jan 2019

TL;DR: A conditional lower bound is proved stating that, for any constant > 0, an O(|E|1− m)-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false.

...read moreread less

Abstract: Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V,E) such that the concatenation of their node labels is equal to the given pattern string P[1..m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks. We prove a conditional lower bound stating that, for any constant epsilon>0, an O(|E|^{1 - epsilon} m)-time, or an O(|E| m^{1 - epsilon})-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree two, i.e. to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree three. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS'16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear-time solvable problems. An interesting corollary is that exact and approximate matching are equally hard (quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear-time vs quadratic-time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC'15)).

...read moreread less

Journal Article•DOI•

Seasonal Variation in Genome-Wide DNA Methylation Patterns and the Onset of Seasonal Timing of Reproduction in Great Tits

[...]

Heidi M. Viitaniemi¹, Irene Verhagen, Marcel E. Visser, Antti Honkela², Antti Honkela¹, Kees van Oers, Arild Husby³, Arild Husby⁴, Arild Husby¹ - Show less +5 more•Institutions (4)

University of Helsinki¹, Helsinki Institute for Information Technology², Uppsala University³, Norwegian University of Science and Technology⁴

01 Mar 2019-Genome Biology and Evolution

TL;DR: Reduced representation bisulfite sequencing on red blood cell derived DNA showed genome-wide temporal changes in more than 40,000 out of the 522,643 CpG sites examined, and sites that showed a temporal and treatment-specific response in DNA methylation are candidate sites of interest for future studies trying to understand the link betweenDNA methylation patterns and timing of reproduction.

...read moreread less

Abstract: In seasonal environments, timing of reproduction is a trait with important fitness consequences, but we know little about the molecular mechanisms that underlie the variation in this trait. Recentl ...

...read moreread less

Posted Content•DOI•

TCRGP: Determining epitope specificity of T cell receptors

[...]

Emmi Jokinen¹, Markus Heinonen², Jani Huuhtanen³, Satu Mustjoki³, Harri Lähdesmäki¹ - Show less +1 more•Institutions (3)

Aalto University¹, Helsinki Institute for Information Technology², University of Helsinki³

06 Feb 2019-bioRxiv

TL;DR: A novel Gaussian process method to predict if TCRs recognize certain epitopes, which outperforms other state-of-the-art methods in epitope-specificity predictions is developed and is found in HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.

...read moreread less

Abstract: T cell receptors (TCRs) can recognize various pathogens and consequently start immune responses. TCRs can be sequenced from individuals and methods that can analyze the specificity of the TCRs can help us better understand the individual9s immune status in different diseases. We have developed TCRGP, a novel Gaussian process (GP) method that can predict if TCRs recognize certain epitopes. This method can utilize different CDR sequences from both TCRα and TCRβ chains from single-cell data and learn which CDRs are important in recognizing the different epitopes. We have experimented with one previously presented and one new data set and show that TCRGP outperforms other state-of-the-art methods in predicting the epitope specificity of TCRs on both data sets. The software implementation and data sets are available at https://github.com/emmijokinen/TCRGP.

...read moreread less

Posted Content•

Causal Discovery with General Non-Linear Relationships Using Non-Linear ICA

[...]

Ricardo Pio Monti¹, Kun Zhang², Aapo Hyvärinen³•Institutions (3)

University College London¹, Carnegie Mellon University², Helsinki Institute for Information Technology³

19 Apr 2019-arXiv: Machine Learning

TL;DR: In this paper, a non-linear independent component analysis (ICA) is proposed to infer causal relationships between two or more passively observed variables in the presence of general nonlinear dependencies, exploiting the non-stationarity of observations to recover the underlying sources or latent disturbances.

...read moreread less

Abstract: We consider the problem of inferring causal relationships between two or more passively observed variables. While the problem of such causal discovery has been extensively studied especially in the bivariate setting, the majority of current methods assume a linear causal relationship, and the few methods which consider non-linear dependencies usually make the assumption of additive noise. Here, we propose a framework through which we can perform causal discovery in the presence of general non-linear relationships. The proposed method is based on recent progress in non-linear independent component analysis and exploits the non-stationarity of observations in order to recover the underlying sources or latent disturbances. We show rigorously that in the case of bivariate causal discovery, such non-linear ICA can be used to infer the causal direction via a series of independence tests. We further propose an alternative measure of causal direction based on asymptotic approximations to the likelihood ratio, as well as an extension to multivariate causal discovery. We demonstrate the capabilities of the proposed method via a series of simulation studies and conclude with an application to neuroimaging data.

...read moreread less

Proceedings Article•

PAC it up: Towards Pointer Integrity using ARM Pointer Authentication

[...]

Hans Liljestrand, Thomas Nyman, Kui Wang¹, Carlos Chinea Perez¹, Jan-Erik Ekberg¹, Nadarajah Asokan² - Show less +2 more•Institutions (2)

Huawei¹, Helsinki Institute for Information Technology²

01 Jan 2019

TL;DR: In this article, the authors use pointer authentication (PA) to build novel defenses against various classes of run-time attacks, including the first PA-based mechanism for data pointer integrity.

...read moreread less

Abstract: Run-time attacks against programs written in memory-unsafe programming languages (e.g., C and C++) remain a prominent threat against computer systems. The prevalence of techniques like return-oriented programming (ROP) in attacking real-world systems has prompted major processor manufacturers to design hardware-based countermeasures against specific classes of run-time attacks. An example is the recently added support for pointer authentication (PA) in the ARMv8-A processor architecture, commonly used in devices like smartphones. PA is a low-cost technique to authenticate pointers so as to resist memory vulnerabilities. It has been shown to enable practical protection against memory vulnerabilities that corrupt return addresses or function pointers. However, so far, PA has received very little attention as a general purpose protection mechanism to harden software against various classes of memory attacks. In this paper, we use PA to build novel defenses against various classes of run-time attacks, including the first PA-based mechanism for data pointer integrity. We present PARTS, an instrumentation framework that integrates our PA-based defenses into the LLVM compiler and the GNU/Linux operating system and show, via systematic evaluation, that PARTS provides better protection than current solutions at a reasonable performance overhead

...read moreread less

Proceedings Article•

Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution

[...]

Topi Paananen, Juho Piironen, Michael Riis Andersen, Aki Vehtari¹•Institutions (1)

Helsinki Institute for Information Technology¹

16 Apr 2019

TL;DR: This article proposed two variable selection methods for Gaussian process models that utilize the predictions of a full model in the vicinity of the training points and thereby rank the variables based on their predictive relevance.

...read moreread less

Abstract: Variable selection for Gaussian process models is often done using automatic relevance determination, which uses the inverse length-scale parameter of each input variable as a proxy for variable relevance. This implicitly determined relevance has several drawbacks that prevent the selection of optimal input variables in terms of predictive performance. To improve on this, we propose two novel variable selection methods for Gaussian process models that utilize the predictions of a full model in the vicinity of the training points and thereby rank the variables based on their predictive relevance. Our empirical results on synthetic and real world data sets demonstrate improved variable selection compared to automatic relevance determination in terms of variability and predictive performance.

...read moreread less

Journal Article•DOI•

Description of Klebsiella spallanzanii sp. nov. and of Klebsiella pasteurii sp. nov.

[...]

Cristina Merla¹, Carla Rodrigues², Virginie Passet², Marta Corbella, Harry A. Thorpe³, Teemu Kallonen⁴, Teemu Kallonen⁵, Zhiyong Zong⁶, Piero Marone, Claudio Bandi⁷, Davide Sassera¹, Jukka Corander⁴, Jukka Corander⁸, Jukka Corander⁵, Edward J. Feil³, Sylvain Brisse² - Show less +12 more•Institutions (8)

University of Pavia¹, Pasteur Institute², University of Bath³, Wellcome Trust Sanger Institute⁴, University of Oslo⁵, Sichuan University⁶, University of Milan⁷, Helsinki Institute for Information Technology⁸

25 Oct 2019-Frontiers in Microbiology

TL;DR: Genomic sequence-based phylogenetic analyses demonstrate that Ko3 and Ko4 formed well-defined sequence clusters related to, but distinct from, Klebsiella michiganensis and K. huaxiensis, and differentiating Ko3, Ko4, and Ko8 from the other K. oxytoca species.

...read moreread less

Abstract: Klebsiella oxytoca causes opportunistic human infections and post-antibiotic haemorrhagic diarrhea. This Enterobacteriaceae species is genetically heterogeneous and is currently subdivided into seven phylogroups (Ko1 to Ko4 and Ko6 to Ko8). Here we investigated the taxonomic status of phylogroups Ko3 and Ko4. Genomic sequence-based phylogenetic analyses demonstrate that Ko3 and Ko4 formed well-defined sequence clusters related to, but distinct from, Klebsiella michiganensis (Ko1), K. oxytoca (Ko2), K. huaxiensis (Ko8), and K. grimontii (Ko6). The average nucleotide identity (ANI) of Ko3 and Ko4 were 90.7% with K. huaxiensis and 95.5% with K. grimontii, respectively. In addition, three strains of K. huaxiensis, a species so far described based on a single strain from a urinary tract infection patient in China, were isolated from cattle and human feces. Biochemical and MALDI-ToF mass spectrometry analysis allowed differentiating Ko3, Ko4, and Ko8 from the other K. oxytoca species. Based on these results, we propose the names Klebsiella spallanzanii for the Ko3 phylogroup, with SPARK_775_C1T (CIP 111695T and DSM 109531T) as type strain, and Klebsiella pasteurii for Ko4, with SPARK_836_C1T (CIP 111696T and DSM 109530T) as type strain. Strains of K. spallanzanii were isolated from human urine, cow feces, and farm surfaces, while strains of K. pasteurii were found in fecal carriage from humans, cows, and turtles.

...read moreread less

Journal Article•DOI•

Genome-wide epistasis and co-selection study using mutual information.

[...]

Johan Pensar¹, Santeri Puranen¹, Santeri Puranen², Brian J. Arnold³, Neil MacAlasdair⁴, Juri Kuronen⁵, Gerry Tonkin-Hill⁴, Maiju Pesonen², Maiju Pesonen¹, Yingying Xu¹, Yingying Xu², Aleksi Sipola¹, Leonor Sánchez-Busó⁴, John A. Lees⁶, Claire Chewapreecha⁷, Claire Chewapreecha⁸, Stephen D. Bentley⁴, Simon R. Harris⁴, Julian Parkhill⁷, Nicholas J. Croucher⁹, Jukka Corander¹, Jukka Corander⁴, Jukka Corander⁵ - Show less +19 more•Institutions (9)

Helsinki Institute for Information Technology¹, Aalto University², Harvard University³, Wellcome Trust Sanger Institute⁴, University of Oslo⁵, New York University⁶, University of Cambridge⁷, King Mongkut's University of Technology Thonburi⁸, Imperial College London⁹

10 Oct 2019-Nucleic Acids Research

TL;DR: Application of the model-free SpydrPick method to large population genomic datasets of two major human pathogens revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.

...read moreread less

Abstract: Covariance-based discovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level covariation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which adjusts for the phylogenetic signal in the data without requiring an explicit phylogenetic tree. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Simulations demonstrate the usefulness of our method and give some insight to when this type of analysis is most likely to be successful. Application of the method to large population genomic datasets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.

...read moreread less

Posted Content•DOI•

Determining epitope specificity of T cell receptors with TCRGP

[...]

Emmi Jokinen¹, Jani Huuhtanen², Satu Mustjoki², Markus Heinonen¹, Markus Heinonen³, Harri Lähdesmäki¹ - Show less +2 more•Institutions (3)

Aalto University¹, University of Helsinki², Helsinki Institute for Information Technology³

21 Aug 2019-bioRxiv

TL;DR: In this paper, a Gaussian process method was proposed to predict if TCRs recognize certain epitopes, which can utilize CDR sequences from TCRα and TCRβ chains and learn which CDRs are important in recognizing different epitopes.

...read moreread less

Abstract: T cell receptors (TCRs) can recognize various pathogens and consequently start immune responses. TCRs can be sequenced from individuals and methods analyzing the specificity of the TCRs can help us better understand individuals’ immune status in different diseases. We have developed TCRGP, a novel Gaussian process method to predict if TCRs recognize certain epitopes. This method can utilize CDR sequences from TCRα and TCRβ chains and learn which CDRs are important in recognizing different epitopes. We have experimented with with epitope-specific data against 29 epitopes and performed a comprehensive evaluation with existing prediction methods. On this data, TCRGP outperforms other state-of-the-art methods in epitope-specificity predictions. We also propose a novel analysis approach for combined single-cell RNA and TCRαβ (scRNA+TCRαβ) sequencing data by quantifying epitope-specific TCRs with TCRGP in phenotypes identified from scRNA-seq data. With this approach, we find HBV-epitope specific T cells and their transcriptomic states in hepatocellular carcinoma patients.

...read moreread less

Proceedings Article•

Deep learning with differential Gaussian process flows

[...]

Pashupati Hegde¹, Markus Heinonen, Harri Lähdesmäki¹, Samuel Kaski•Institutions (1)

Helsinki Institute for Information Technology¹

16 Apr 2019

TL;DR: In this article, a novel deep learning paradigm of differential flows that learn a stochastic differential equation transformations of inputs prior to a standard classification or regression function is proposed, where the key property of differential Gaussian processes is the warping of inputs through infinitely deep, but infinitesimal, differential fields.

...read moreread less

Abstract: We propose a novel deep learning paradigm of differential flows that learn a stochastic differential equation transformations of inputs prior to a standard classification or regression function. The key property of differential Gaussian processes is the warping of inputs through infinitely deep, but infinitesimal, differential fields, that generalise discrete layers into a dynamical system. We demonstrate excellent results as compared to deep Gaussian processes and Bayesian neural networks.

...read moreread less

Journal Article•DOI•

Computational Creativity Infrastructure for Online Software Composition: A Conceptual Blending Use Case

[...]

Pedro Martins¹, Hugo Gonçalo Oliveira¹, João Gonçalves¹, António Cruz¹, Amílcar Cardoso¹, Martin Znidarsic, Nada Lavrač, Simo Linkola, Hannu Toivonen², Raquel Hervás³, Gonzalo Méndez³, Pablo Gervás³ - Show less +8 more•Institutions (3)

University of Coimbra¹, Helsinki Institute for Information Technology², Complutense University of Madrid³

08 Feb 2019-Ibm Journal of Research and Development

TL;DR: In this article, the authors propose an infrastructure that allows CC researchers to build workflows that can be executed online and be easily reused by others through the workflow web address, leading to novel ways of software composition for computational purposes that were not expected in advance.

...read moreread less

Abstract: Computational creativity (CC) is a multidisciplinary research field, studying how to engineer software that exhibits behavior that would reasonably be deemed creative. This paper shows how composition of software solutions in this field can effectively be supported through a CC infrastructure that supports user-friendly development of CC software components and workflows, their sharing, execution, and reuse. The infrastructure allows CC researchers to build workflows that can be executed online and be easily reused by others through the workflow web address. Moreover, it enables the building of procedures composed of software developed by different researchers from different laboratories, leading to novel ways of software composition for computational purposes that were not expected in advance. This capability is illustrated on a workflow that implements a Concept Generator prototype based on the Conceptual Blending framework. The prototype consists of a composition of modules made available as web services, and is explored and tested through experiments involving blending of texts from different domains, blending of images, and poetry generation.

...read moreread less

Proceedings Article•

Large-Scale Sparse Kernel Canonical Correlation Analysis

[...]

Viivi Uurtio¹, Sahely Bhadra², Juho Rousu³•Institutions (3)

Helsinki Institute for Information Technology¹, Indian Institutes of Technology², Aalto University³

24 May 2019

TL;DR: This paper presents gradKCCA, a large-scale sparse non-linear canonical correlation method that outperforms state-of-the-art CCA methods in terms of speed and robustness to noise both in simulated and real-world datasets.

...read moreread less

Abstract: This paper presents gradKCCA, a large-scale sparse non-linear canonical correlation method. Like Kernel Canonical Correlation Analysis (KCCA), our method finds non-linear relations through kernel functions, but it does not rely on a kernel matrix, a known bottleneck for scaling up kernel methods. gradKCCA corresponds to solving KCCA with the additional constraint that the canonical projection directions in the kernelinduced feature space have preimages in the original data space. Firstly, this modification allows us to very efficiently maximize kernel canonical correlation through an alternating projected gradient algorithm working in the original data space. Secondly, we can control the sparsity of the projection directions by constraining the `1 norm of the preimages of the projection directions, facilitating the interpretation of the discovered patterns, which is not available through KCCA. Our empirical experiments demonstrate that gradKCCA outperforms state-of-the-art CCA methods in terms of speed and robustness to noise both in simulated and real-world datasets.

...read moreread less

Journal Article•DOI•

Bayesian meta-analysis across genome-wide association studies of diverse phenotypes.

[...]

Holly Trochet¹, Holly Trochet², Matti Pirinen³, Matti Pirinen⁴, Luke Jostins, Gilean McVean⁵, Gilean McVean¹, Chris C. A. Spencer¹ - Show less +4 more•Institutions (5)

Wellcome Trust Centre for Human Genetics¹, Montreal Heart Institute², Helsinki Institute for Information Technology³, University of Helsinki⁴, University of Oxford⁵

28 Mar 2019-Genetic Epidemiology

TL;DR: MetABF, a simple Bayesian framework for performing integrative meta‐analysis across multiple GWAS using summary statistics, is described, which can increase the power by 50% compared with standard frequentist tests when only a subset of studies have a true effect.

...read moreread less

Abstract: Genome-wide association studies (GWAS) are a powerful tool for understanding the genetic basis of diseases and traits, but most studies have been conducted in isolation, with a focus on either a single or a set of closely related phenotypes. We describe MetABF, a simple Bayesian framework for performing integrative meta-analysis across multiple GWAS using summary statistics. The approach is applicable across a wide range of study designs and can increase the power by 50% compared with standard frequentist tests when only a subset of studies have a true effect. We demonstrate its utility in a meta-analysis of 20 diverse GWAS which were part of the Wellcome Trust Case Control Consortium 2. The novelty of the approach is its ability to explore, and assess the evidence for a range of possible true patterns of association across studies in a computationally efficient framework.

...read moreread less