Showing papers by "Helsinki Institute for Information Technology published in 2016"

PDF

Open Access

Journal Article•DOI•

Digital competence --- an emergent boundary concept for policy and educational research

[...]

Liisa Ilomäki¹, Sami Paavola¹, Minna Lakkala¹, Anna Kantosalo²•Institutions (2)

University of Helsinki¹, Helsinki Institute for Information Technology²

01 May 2016-Education and Information Technologies

TL;DR: It is suggested that digital competence is a useful boundary concept, which can be used in various contexts and consists of technical competence, the ability to use digital technologies in a meaningful way for working, studying and in everyday life, and motivation to participate and commit in the digital culture.

...read moreread less

Abstract: Digital competence is an evolving concept related to the development of digital technology and the political aims and expectations of citizenship in a knowledge society. It is regarded as a core competence in policy papers; in educational research it is not yet a standardized concept. We suggest that it is a useful boundary concept, which can be used in various contexts. For this study, we analysed 76 educational research articles in which digital competence, described by different terms, was investigated. As a result, we found that digital competence consists of a variety of skills and competences, and its scope is wide, as is its background: from media studies and computer science to library and literacy studies. In the article review, we found a total of 34 terms that had used to describe the digital technology related skills and competences; the most often used terms were digital literacy, new literacies, multiliteracy and media literacy, each with somewhat different focus. We suggest that digital competence is defined as consisting of (1) technical competence, (2) the ability to use digital technologies in a meaningful way for working, studying and in everyday life, (3) the ability to evaluate digital technologies critically, and (4) motivation to participate and commit in the digital culture.

...read moreread less

299 citations

Book Chapter•DOI•

An Overview of Concept Drift Applications

[...]

Indrė Žliobaitė¹, Indrė Žliobaitė², Indrė Žliobaitė³, Mykola Pechenizkiy⁴, João Gama⁵ - Show less +1 more•Institutions (5)

Aalto University¹, Helsinki Institute for Information Technology², University of Helsinki³, Eindhoven University of Technology⁴, University of Porto⁵

01 Jan 2016

TL;DR: This chapter provides an application oriented view towards concept drift research, with a focus on supervised learning tasks, and constructs a reference framework for positioning application tasks within a spectrum of problems related to concept drift.

...read moreread less

Abstract: In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift. The objective is to deploy models that would diagnose themselves and adapt to changing data over time. This chapter provides an application oriented view towards concept drift research, with a focus on supervised learning tasks. First we overview and categorize application tasks for which the problem of concept drift is particularly relevant. Then we construct a reference framework for positioning application tasks within a spectrum of problems related to concept drift. Finally, we discuss some promising research directions from the application perspective, and present recommendations for application driven concept drift research and development.

...read moreread less

274 citations

Journal Article•DOI•

Fundamentals and Recent Developments in Approximate Bayesian Computation

[...]

Jarno Lintusaari¹, Michael U. Gutmann², Ritabrata Dutta², Samuel Kaski², Jukka Corander² - Show less +1 more•Institutions (2)

Aalto University¹, Helsinki Institute for Information Technology²

19 Oct 2016-Systematic Biology

TL;DR: Approximate Bayesian computation refers to a family of algorithms for approximate inference that makes a minimal set of assumptions by only requiring that sampling from a model is possible.

...read moreread less

Abstract: Bayesian inference plays an important role in phylogenetics, evolutionary biology, and in many other branches of science. It provides a principled framework for dealing with uncertainty and quantifying how it changes in the light of new evidence. For many complex models and inference problems, however, only approximate quantitative answers are obtainable. Approximate Bayesian computation (ABC) refers to a family of algorithms for approximate inference that makes a minimal set of assumptions by only requiring that sampling from a model is possible. We explain here the fundamentals of ABC, review the classical algorithms, and highlight recent developments. [ABC; approximate Bayesian computation; Bayesian inference; likelihood-free inference; phylogenetics; simulator-based models; stochastic simulation models; tree-based models.]

...read moreread less

221 citations

Journal Article•

Meka: a multi-label/multi-target extension to weka

[...]

Jesse Read¹, Peter Reutemann², Bernhard Pfahringer², Geoff Holmes²•Institutions (2)

Helsinki Institute for Information Technology¹, University of Waikato²

01 Jan 2016-Journal of Machine Learning Research

TL;DR: This work presents MEKA: an open-source Java framework based on the well-known WEKA library, which provides interfaces to facilitate practical application, and a wealth of multi-label classifiers, evaluation metrics, and tools for multi- label experiments and development.

...read moreread less

Abstract: Multi-label classification has rapidly attracted interest in the machine learning literature, and there are now a large number and considerable variety of methods for this type of learning. We present MEKA: an open-source Java framework based on the well-known WEKA library. MEKA provides interfaces to facilitate practical application, and a wealth of multi-label classifiers, evaluation metrics, and tools for multi-label experiments and development. It supports multi-label and multi-target data, including in incremental and semi-supervised contexts.

...read moreread less

205 citations

Journal Article•DOI•

Whole-Genome Sequencing for Routine Pathogen Surveillance in Public Health: a Population Snapshot of Invasive Staphylococcus aureus in Europe

[...]

David M. Aanensen¹, Edward J. Feil², Matthew T. G. Holden³, Matthew T. G. Holden⁴, Janina Dordel⁵, Janina Dordel³, Corin Yeats¹, Artemij Fedosejev¹, Richard Goater, Santiago Castillo-Ramírez⁶, Jukka Corander⁷, Caroline Colijn¹, Monika A. Chlebowicz⁸, Leo M. Schouls, Max Heck, Gerlinde N. Pluister, Raymond Ruimy, Gunnar Kahlmeter, Jenny Åhman, Erika Matuschek, Alexander W. Friedrich⁸, Julian Parkhill³, Stephen D. Bentley³, Brian G. Spratt¹, Hajo Grundmann⁸, Hajo Grundmann⁹ - Show less +22 more•Institutions (9)

Imperial College London¹, University of Bath², Wellcome Trust Sanger Institute³, University of St Andrews⁴, Drexel University⁵, National Autonomous University of Mexico⁶, Helsinki Institute for Information Technology⁷, University Medical Center Groningen⁸, University of Freiburg⁹

06 Jul 2016-Mbio

TL;DR: It is argued that this work provides a comprehensive road map illustrating the three vital components for future molecular epidemiological surveillance: (i) large-scale structured surveys, (ii) WGS, and (iii) community-oriented database infrastructure and analysis tools.

...read moreread less

Abstract: The implementation of routine whole-genome sequencing (WGS) promises to transform our ability to monitor the emergence and spread of bacterial pathogens. Here we combined WGS data from 308 invasive Staphylococcus aureus isolates corresponding to a pan-European population snapshot, with epidemiological and resistance data. Geospatial visualization of the data is made possible by a generic software tool designed for public health purposes that is available at the project URL (http://www.microreact.org/project/EkUvg9uY?tt=rc). Our analysis demonstrates that high-risk clones can be identified on the basis of population level properties such as clonal relatedness, abundance, and spatial structuring and by inferring virulence and resistance properties on the basis of gene content. We also show that in silico predictions of antibiotic resistance profiles are at least as reliable as phenotypic testing. We argue that this work provides a comprehensive road map illustrating the three vital components for future molecular epidemiological surveillance: (i) large-scale structured surveys, (ii) WGS, and (iii) community-oriented database infrastructure and analysis tools. IMPORTANCE The spread of antibiotic-resistant bacteria is a public health emergency of global concern, threatening medical intervention at every level of health care delivery. Several recent studies have demonstrated the promise of routine whole-genome sequencing (WGS) of bacterial pathogens for epidemiological surveillance, outbreak detection, and infection control. However, as this technology becomes more widely adopted, the key challenges of generating representative national and international data sets and the development of bioinformatic tools to manage and interpret the data become increasingly pertinent. This study provides a road map for the integration of WGS data into routine pathogen surveillance. We emphasize the importance of large-scale routine surveys to provide the population context for more targeted or localized investigation and the development of open-access bioinformatic tools to provide the means to combine and compare independently generated data with publicly available data sets.

...read moreread less

177 citations

Journal Article•DOI•

Lightweight and Secure Session-Key Establishment Scheme in Smart Home Environments

[...]

Pardeep Kumar¹, Andrei Gurtov², Jari Iinatti¹, Mika Ylianttila¹, Mangal Sain³ - Show less +1 more•Institutions (3)

University of Oulu¹, Helsinki Institute for Information Technology², Dongseo University³

01 Jan 2016-IEEE Sensors Journal

TL;DR: The proposed scheme provides important security attributes including prevention of various popular attacks, such as denial-of-service and eavesdropping attacks, and attains both computation efficiency and communication efficiency as compared with other schemes from the literature.

...read moreread less

Abstract: The proliferation of current wireless communications and information technologies have been altering humans lifestyle and social interactions—the next frontier is the smart home environments or spaces. A smart home consists of low capacity devices (e.g., sensors) and wireless networks, and therefore, all working together as a secure system that needs an adequate level of security. This paper introduces lightweight and secure session key establishment scheme for smart home environments. To establish trust among the network, every sensor and control unit uses a short authentication token and establishes a secure session key. The proposed scheme provides important security attributes including prevention of various popular attacks, such as denial-of-service and eavesdropping attacks. The preliminary evaluation and feasibility tests are demonstrated by the proof-of-concept implementation. In addition, the proposed scheme attains both computation efficiency and communication efficiency as compared with other schemes from the literature.

...read moreread less

154 citations

Journal Article•DOI•

Bayesian optimization for likelihood-free inference of simulator-based statistical models

[...]

Michael U. Gutmann¹, Jukka Corander¹•Institutions (1)

Helsinki Institute for Information Technology¹

01 Jan 2016-Journal of Machine Learning Research

TL;DR: In this article, a Bayesian optimization strategy is proposed to accelerate the likelihood-free inference through a reduction in the number of required simulations by several orders of magnitude, where the discrepancy between simulated and observed data is small.

...read moreread less

Abstract: Our paper deals with inferring simulator-based statistical models given some observed data. A simulator-based model is a parametrized mechanism which specifies how data are generated. It is thus also referred to as generative model. We assume that only a finite number of parameters are of interest and allow the generative process to be very general; it may be a noisy nonlinear dynamical system with an unrestricted number of hidden variables. This weak assumption is useful for devising realistic models but it renders statistical inference very difficult. The main challenge is the intractability of the likelihood function. Several likelihood-free inference methods have been proposed which share the basic idea of identifying the parameters by finding values for which the discrepancy between simulated and observed data is small. A major obstacle to using these methods is their computational cost. The cost is largely due to the need to repeatedly simulate data sets and the lack of knowledge about how the parameters affect the discrepancy. We propose a strategy which combines probabilistic modeling of the discrepancy with optimization to facilitate likelihood-free inference. The strategy is implemented using Bayesian optimization and is shown to accelerate the inference through a reduction in the number of required simulations by several orders of magnitude.

...read moreread less

150 citations

Journal Article•DOI•

Design and results of the Fifth Answer Set Programming Competition

[...]

Francesco Calimeri¹, Martin Gebser², Marco Maratea³, Francesco Ricca¹•Institutions (3)

University of Calabria¹, Helsinki Institute for Information Technology², University of Genoa³

01 Feb 2016-Artificial Intelligence

TL;DR: This paper reports about the fifth edition of the ASP Competition by covering all aspects of the event, ranging from the new design of the competition to an in-depth analysis of the results, including additional analyses that were conceived for measuring the progress of the state of the art, as well as for studying aspects orthogonal to solving technology, such as the effects of modeling.

...read moreread less

148 citations

Journal Article•DOI•

Social norms and self-presentation on social network sites: Profile work in action

[...]

Suvi Uski¹, Airi Lampinen¹•Institutions (1)

Helsinki Institute for Information Technology¹

01 Mar 2016-New Media & Society

TL;DR: This article identified social norms that were formed around the prevailing sharing practices in the two sites and compared them in relation to the sharing mechanisms, and revealed that automated and manual sharing were sanctioned differently.

...read moreread less

Abstract: “Profile work,” that is strategic self-presentation in social network sites, is configured by both the technical affordances and related social norms. In this article, we address technical and social psychological aspects that underlie acts of sharing by analyzing the social in relation to the technical. Our analysis is based on two complementary sets of qualitative data gleaned from in situ experiences of Finnish youth and young adults within the sharing mechanisms of Facebook and Last.fm. In our analysis, we identified social norms that were formed around the prevailing sharing practices in the two sites and compared them in relation to the sharing mechanisms. The analysis revealed that automated and manual sharing were sanctioned differently. We conclude that although the social norms that guide content sharing differed between the two contexts, there was an identical sociocultural goal in profile work: presentation of authenticity.

...read moreread less

122 citations

Journal Article•DOI•

The State-of-the-Art of Set Visualization

[...]

Bilal Alsallakh¹, Luana Micallef², Luana Micallef³, Wolfgang Aigner¹, Wolfgang Aigner⁴, Helwig Hauser⁵, Silvia Miksch¹, Peter Rodgers² - Show less +4 more•Institutions (5)

Vienna University of Technology¹, University of Kent², Helsinki Institute for Information Technology³, St. Pölten University of Applied Sciences⁴, University of Bergen⁵

01 Feb 2016-Computer Graphics Forum

TL;DR: A systematic overview of state‐of‐the‐art techniques for visualizing different kinds of set relations is provided and these techniques are classified into six main categories according to the visual representations they use and the tasks they support.

...read moreread less

Abstract: Sets comprise a generic data model that has been used in a variety of data analysis problems. Such problems involve analysing and visualizing set relations between multiple sets defined over the same collection of elements. However, visualizing sets is a non-trivial problem due to the large number of possible relations between them. We provide a systematic overview of state-of-the-art techniques for visualizing different kinds of set relations. We classify these techniques into six main categories according to the visual representations they use and the tasks they support. We compare the categories to provide guidance for choosing an appropriate technique for a given problem. Finally, we identify challenges in this area that need further research and propose possible directions to address these challenges. Further resources on set visualization are available at http://www.setviz.net.

...read moreread less

115 citations

Journal Article•DOI•

metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

[...]

Anna Cichonska¹, Juho Rousu¹, Pekka Marttinen¹, Antti J. Kangas², Pasi Soininen³, Terho Lehtimäki⁴, Olli T. Raitakari⁵, Marjo-Riitta Järvelin², Veikko Salomaa⁶, Mika Ala-Korpela³, Samuli Ripatti⁷, Matti Pirinen⁸ - Show less +8 more•Institutions (8)

Helsinki Institute for Information Technology¹, Oulu University Hospital², University of Eastern Finland³, University of Tampere⁴, Turku University Hospital⁵, National Institutes of Health⁶, Wellcome Trust Sanger Institute⁷, University of Helsinki⁸

01 Jul 2016-Bioinformatics

TL;DR: MetaCCA as discussed by the authors is a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype, and employs a covariance shrinkage algorithm to achieve robustness.

...read moreread less

Abstract: Motivation: A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analyzing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts, and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. Results: We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness. Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies. Availability and implementation: Code is available at https://github.com/aalto-ics-kepaco Contacts: if.iknisleh@aksnohcic.anna or if.iknisleh@nenirip.ittam Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Is exploratory search different? A comparison of information search behavior for exploratory and lookup tasks

[...]

Kumaripaba Athukorala¹, Dorota Glowacka¹, Giulio Jacucci¹, Antti Oulasvirta², Jilles Vreeken³ - Show less +1 more•Institutions (3)

Helsinki Institute for Information Technology¹, Aalto University², Max Planck Society³

01 Nov 2016

TL;DR: The goal of this article is to investigate how to separate the 2 types of tasks in an IR system using easily measurable behaviors, and shows that IR systems can distinguish the 2 search categories in the course of a search session.

...read moreread less

Abstract: Exploratory search is an increasingly important activity yet challenging for users. Although there exists an ample amount of research into understanding exploration, most of the major information retrieval IR systems do not provide tailored and adaptive support for such tasks. One reason is the lack of empirical knowledge on how to distinguish exploratory and lookup search behaviors in IR systems. The goal of this article is to investigate how to separate the 2 types of tasks in an IR system using easily measurable behaviors. In this article, we first review characteristics of exploratory search behavior. We then report on a controlled study of 6 search tasks with 3 exploratory-comparison, knowledge acquisition, planning-and 3 lookup tasks-fact-finding, navigational, question answering. The results are encouraging, showing that IR systems can distinguish the 2 search categories in the course of a search session. The most distinctive indicators that characterize exploratory search behaviors are query length, maximum scroll depth, and task completion time. However, 2 tasks are borderline and exhibit mixed characteristics. We assess the applicability of this finding by reporting on several classification experiments. Our results have valuable implications for designing tailored and adaptive IR systems.

...read moreread less

Journal Article•DOI•

Accurate self-correction of errors in long reads using de Bruijn graphs

[...]

Leena Salmela¹, Riku Walve¹, Eric Rivals², Esko Ukkonen¹•Institutions (2)

Helsinki Institute for Information Technology¹, University of Montpellier²

06 Jun 2016-Bioinformatics

TL;DR: The proposed error correction method, LoRMA, is the most accurate one relying on long reads only for read sets with high coverage and when the coverage of the read set is at least 75×, the throughput of the new method is at at least 20% higher.

...read moreread less

Abstract: Motivation: New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results: We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75Â, the throughput of the new method is at least 20% higher. Availability and Implementation: LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/.

...read moreread less

Journal Article•DOI•

Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models

[...]

Indre¹, Žliobaite², Bart Custers³, Bart Custers²•Institutions (3)

Helsinki Institute for Information Technology¹, Leiden University², Dutch Ministry of Justice³

01 Jun 2016-Artificial Intelligence and Law

TL;DR: This paper demonstrates empirically and theoretically with standard regression models that in order to make sure that decision models are non-discriminatory, for instance, with respect to race, the sensitive racial information needs to be used in the model building process.

...read moreread less

Abstract: Increasing numbers of decisions about everyday life are made using algorithms. By algorithms we mean predictive models (decision rules) captured from historical data using data mining. Such models often decide prices we pay, select ads we see and news we read online, match job descriptions and candidate CVs, decide who gets a loan, who goes through an extra airport security check, or who gets released on parole. Yet growing evidence suggests that decision making by algorithms may discriminate people, even if the computing process is fair and well-intentioned. This happens due to biased or non-representative learning data in combination with inadvertent modeling procedures. From the regulatory perspective there are two tendencies in relation to this issue: (1) to ensure that data-driven decision making is not discriminatory, and (2) to restrict overall collecting and storing of private data to a necessary minimum. This paper shows that from the computing perspective these two goals are contradictory. We demonstrate empirically and theoretically with standard regression models that in order to make sure that decision models are non-discriminatory, for instance, with respect to race, the sensitive racial information needs to be used in the model building process. Of course, after the model is ready, race should not be required as an input variable for decision making. From the regulatory perspective this has an important implication: collecting sensitive personal data is necessary in order to guarantee fairness of algorithms, and law making needs to find sensible ways to allow using such data in the modeling process.

...read moreread less

Journal Article•DOI•

Drug response prediction by inferring pathway-response associations with kernelized Bayesian matrix factorization

[...]

Muhammad Ammad-ud-din¹, Suleiman A. Khan², Disha Malani², Astrid Murumägi², Olli-P. Kallioniemi³, Tero Aittokallio², Samuel Kaski¹ - Show less +3 more•Institutions (3)

Helsinki Institute for Information Technology¹, University of Helsinki², Science for Life Laboratory³

01 Sep 2016-Bioinformatics

TL;DR: It is demonstrated that pathway-response associations can be learned by the proposed model for the well-known EGFR and MEK inhibitors, opening up the opportunity for elucidating drug action mechanisms.

...read moreread less

Abstract: Motivation A key goal of computational personalized medicine is to systematically utilize genomic and other molecular features of samples to predict drug responses for a previously unseen sample. Such predictions are valuable for developing hypotheses for selecting therapies tailored for individual patients. This is especially valuable in oncology, where molecular and genetic heterogeneity of the cells has a major impact on the response. However, the prediction task is extremely challenging, raising the need for methods that can effectively model and predict drug responses. Results In this study, we propose a novel formulation of multi-task matrix factorization that allows selective data integration for predicting drug responses. To solve the modeling task, we extend the state-of-the-art kernelized Bayesian matrix factorization (KBMF) method with component-wise multiple kernel learning. In addition, our approach exploits the known pathway information in a novel and biologically meaningful fashion to learn the drug response associations. Our method quantitatively outperforms the state of the art on predicting drug responses in two publicly available cancer datasets as well as on a synthetic dataset. In addition, we validated our model predictions with lab experiments using an in-house cancer cell line panel. We finally show the practical applicability of the proposed method by utilizing prior knowledge to infer pathway-drug response associations, opening up the opportunity for elucidating drug action mechanisms. We demonstrate that pathway-response associations can be learned by the proposed model for the well-known EGFR and MEK inhibitors. Availability and implementation The source code implementing the method is available at http://research.cs.aalto.fi/pml/software/cwkbmf/ Contacts muhammad.ammad-ud-din@aalto.fi or samuel.kaski@aalto.fi Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Crowdsourced assessment of common genetic contribution to predicting anti-TNF treatment response in rheumatoid arthritis

[...]

Solveig K. Sieberts¹, Zhu Fan², Javier Garcia-Garcia³, Eli A. Stahl⁴, Abhishek Pratap¹, Gaurav Pandey⁴, Dimitrios A. Pappas, Daniel Aguilar³, Bernat Anton³, Jaume Bonet³, Ridvan Eksi², Oriol Fornes³, Emre Guney⁵, Hong-Dong Li², Manuel Alejandro Marín³, Bharat Panwar², Joan Planas-Iglesias³, Daniel Poglayen³, Jing Cui⁶, André O. Falcão⁷, Christine Suver¹, Bruce Hoff¹, Venkatachalapathy S. K. Balagurusamy⁸, Donna N. Dillenberger⁸, Elias Chaibub Neto¹, Thea Norman¹, Tero Aittokallio⁸, Muhammad Ammad-ud-din⁹, Muhammad Ammad-ud-din¹⁰, Chloé-Agathe Azencott¹¹, Victor Bellon¹¹, Valentina Boeva¹¹, Kerstin Bunte⁹, Kerstin Bunte¹⁰, Himanshu Chheda¹², Lu Cheng¹², Lu Cheng¹⁰, Lu Cheng⁹, Jukka Corander¹², Jukka Corander⁹, Michel Dumontier¹³, Anna Goldenberg¹⁴, Peddinti Gopalacharyulu¹², Mohsen Hajiloo¹⁴, Daniel Hidru¹⁴, Alok Jaiswal¹², Samuel Kaski¹², Samuel Kaski¹⁰, Samuel Kaski⁹, Beyrem Khalfaoui¹⁴, Suleiman A. Khan¹⁰, Suleiman A. Khan⁹, Suleiman A. Khan¹², Eric R. Kramer¹⁵, Pekka Marttinen¹⁰, Pekka Marttinen⁹, Aziz M. Mezlini¹⁴, Bhuvan Molparia¹⁵, Matti Pirinen¹², Janna Saarela¹², Matthias Samwald¹⁶, Véronique Stoven¹¹, Hao Tang¹⁷, Jing Tang¹², Ali Torkamani¹⁵, Jean Phillipe Vert¹¹, Bo Wang¹³, Tao Wang¹⁷, Krister Wennerberg¹², Nathan E. Wineinger¹⁵, Guanghua Xiao¹⁷, Yang Xie¹⁷, Rae S. M. Yeung¹⁴, Xiaowei Zhan¹⁷, Cheng Zhao¹⁴, Jeff Greenberg¹⁸, Joel M. Kremer¹⁹, Kaleb Michaud, Anne Barton, Marieke J H Coenen²⁰, Xavier Mariette¹¹, Corinne Miceli¹¹, Nancy A. Shadick⁶, Michael E. Weinblatt⁶, Niek de Vries²¹, Paul P. Tak²², Danielle M. Gerlag²², Tom W J Huizinga²³, Fina A S Kurreeman²³, Cornelia F Allaart²³, S. Louis Bridges²⁴, Lindsey A. Criswell²⁵, Larry W. Moreland²⁶, Lars Klareskog²⁷, Saedis Saevarsdottir²⁷, Leonid Padyukov²⁷, Peter K. Gregersen²⁸, Stephen H. Friend¹, Robert M. Plenge²⁹, Gustavo Stolovitzky⁷, Baldo Oliva³, Yuanfang Guan², Lara M. Mangravite¹ - Show less +99 more•Institutions (29)

23 Aug 2016-Nature Communications

TL;DR: Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data.

...read moreread less

Abstract: Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment efficacy in RA patients was performed in the context of a DREAM Challenge (http://www.synapse.org/RA_Challenge). An open challenge framework enabled the comparative evaluation of predictions developed by 73 research groups using the most comprehensive available data and covering a wide range of state-of-the-art modelling methodologies. Despite a significant genetic heritability estimate of treatment non-response trait (h(2)=0.18, P value=0.02), no significant genetic contribution to prediction accuracy is observed. Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data.

...read moreread less

Journal Article•DOI•

An ecometric analysis of the fossil mammal record of the Turkana Basin.

[...]

Mikael Fortelius¹, Indrė Žliobaitė², Ferhat Kaya¹, Faysal Bibi³, René Bobe⁴, Louise N. Leakey⁵, Meave G. Leakey⁵, David B. Patterson⁶, Janina Rannikko¹, Lars Werdelin⁷ - Show less +6 more•Institutions (7)

University of Helsinki¹, Helsinki Institute for Information Technology², Museum für Naturkunde³, University of Chile⁴, Stony Brook University⁵, George Washington University⁶, Swedish Museum of Natural History⁷

05 Jul 2016-Philosophical Transactions of the Royal Society B

TL;DR: It is suggested that the regionally arid Turkana Basin may between 4 and 2 Ma have acted as a ‘species factory’, generating ecological adaptations in advance of the global trend, and temporally and spatially resolved estimates of temperature and precipitation are provided.

...read moreread less

Abstract: Although ecometric methods have been used to analyse fossil mammal faunas and environments of Eurasia and North America, such methods have not yet been applied to the rich fossil mammal record of e...

...read moreread less

Journal Article•DOI•

Fast metabolite identification with Input Output Kernel Regression

[...]

Céline Brouard¹, Huibin Shen¹, Kai Dührkop², Florence d'Alché-Buc³, Sebastian Böcker², Juho Rousu¹ - Show less +2 more•Institutions (3)

Helsinki Institute for Information Technology¹, University of Jena², Université Paris-Saclay³

15 Jun 2016-Bioinformatics

TL;DR: This work proposes to address the metabolite identification problem using a structured output prediction approach that is not limited to vector output space and can handle structured output space such as the molecule space, and achieves state-of-the-art accuracy in metabolites identification.

...read moreread less

Abstract: Motivation: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space. Results: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. Availability and implementation: Contact: if.otlaa@drauorb.enilec Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Posted Content•

On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior

[...]

Juho Piironen, Aki Vehtari¹•Institutions (1)

Helsinki Institute for Information Technology¹

18 Oct 2016-arXiv: Methodology

TL;DR: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter due to the previous default choices.

...read moreread less

Abstract: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter. We argue that the previous default choices are dubious due to their tendency to favor solutions with more unshrunk coefficients than we typically expect a priori. This can lead to bad results if this parameter is not strongly identified by data. We derive the relationship between the global parameter and the effective number of nonzeros in the coefficient vector, and show an easy and intuitive way of setting up the prior for the global parameter based on our prior beliefs about the number of nonzero coefficients in the model. The results on real world data show that one can benefit greatly -- in terms of improved parameter estimates, prediction accuracy, and reduced computation time -- from transforming even a crude guess for the number of nonzero coefficients into the prior for the global parameter using our framework.

...read moreread less

Journal Article•DOI•

Probabilistic archetypal analysis

[...]

Sohan Seth¹, Manuel J. A. Eugster¹•Institutions (1)

Helsinki Institute for Information Technology¹

01 Jan 2016-Machine Learning

TL;DR: This paper revisits archetypal analysis from the basic principles, and proposes a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors that corroborate the proposed methodology with convincing real-world applications.

...read moreread less

Abstract: Archetypal analysis represents a set of observations as convex combinations of pure patterns, or archetypes. The original geometric formulation of finding archetypes by approximating the convex hull of the observations assumes them to be real---valued. This, unfortunately, is not compatible with many practical situations. In this paper we revisit archetypal analysis from the basic principles, and propose a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors. We corroborate the proposed methodology with convincing real-world applications on finding archetypal soccer players based on performance data, archetypal winter tourists based on binary survey data, archetypal disaster-affected countries based on disaster count data, and document archetypes based on term-frequency data. We also present an appropriate visualization tool to summarize archetypal analysis solution better.

...read moreread less

Journal Article•DOI•

Top-k overlapping densest subgraphs

[...]

Esther Galbrun¹, Aristides Gionis², Nikolaj Tatti²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Helsinki Institute for Information Technology²

01 Sep 2016-Data Mining and Knowledge Discovery

TL;DR: This paper reformulates the problem definition in a way that it is able to obtain an algorithm with constant-factor approximation guarantee, and presents a new approach that improves over the existing techniques, both in theory and practice.

...read moreread less

Abstract: Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top-kdensest subgraphs. One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top-koverlapping densest subgraphs, and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee. Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency.

...read moreread less

Journal Article•DOI•

Constrained Multilinear Detection and Generalized Graph Motifs

[...]

Andreas Björklund¹, Petteri Kaski², Łukasz Kowalik³•Institutions (3)

Lund University¹, Helsinki Institute for Information Technology², University of Warsaw³

01 Feb 2016-Algorithmica

TL;DR: A new algebraic sieving technique to detect constrained multilinear monomials in multivariate polynomial generating functions given by an evaluation oracle is introduced and shown to show an $$O^*(2^k)$$O∗(2k)-time polynomials space algorithm for the k-sized Graph Motif problem.

...read moreread less

Abstract: We introduce a new algebraic sieving technique to detect constrained multilinear monomials in multivariate polynomial generating functions given by an evaluation oracle. The polynomials are assumed to have coefficients from a field of characteristic two. As applications of the technique, we show an $$O^*(2^k)$$O?(2k)-time polynomial space algorithm for the $$k$$k-sized Graph Motif problem. We also introduce a new optimization variant of the problem, called Closest Graph Motif and solve it within the same time bound. The Closest Graph Motif problem encompasses several previously studied optimization variants, like Maximum Graph Motif, Min-Substitute Graph Motif, and Min-Add Graph Motif. Finally, we provide a piece of evidence that our result might be essentially tight: the existence of an $$O^*((2-\epsilon )^k)$$O?((2-∈)k)-time algorithm for the Graph Motif problem implies an $$O((2-\epsilon ')^n)$$O((2-∈?)n)-time algorithm for Set Cover.

...read moreread less

Journal Article•DOI•

The Answer Set Programming Paradigm

[...]

Tomi Janhunen¹, Ilkka Niemelä¹•Institutions (1)

Helsinki Institute for Information Technology¹

07 Oct 2016-Ai Magazine

TL;DR: An overview of the answer set programming paradigm is given, its strengths are explained, and its main features are illustrated in terms of examples and an application problem.

...read moreread less

Abstract: In this article, we give an overview of the answer set programming paradigm, explain its strengths, and illustrate its main features in terms of examples and an application problem.

...read moreread less

Journal Article•DOI•

Complexity results and algorithms for extension enforcement in abstract argumentation

[...]

Johannes Peter Wallner¹, Andreas Niskanen¹, Matti Järvisalo¹•Institutions (1)

Helsinki Institute for Information Technology¹

12 Feb 2016

TL;DR: This work provides a nearly complete computational complexity map of fixed-argument extension enforcement under various major AF semantics, with results ranging from polynomial-time algorithms to completeness for the second-level of thePolynomial hierarchy.

...read moreread less

Abstract: Understanding the dynamics of argumentation frameworks (AFs) is important in the study of argumentation in AI. In this work, we focus on the so-called extension enforcement problem in abstract argumentation. We provide a nearly complete computational complexity map of fixed-argument extension enforcement under various major AF semantics, with results ranging from polynomial-time algorithms to completeness for the second-level of the polynomial hierarchy. Complementing the complexity results, we propose algorithms for NP-hard extension enforcement based on constrained optimization. Going beyond NP, we propose novel counterexample-guided abstraction refinement procedures for the second-level complete problems and present empirical results on a prototype system constituting the first approach to extension enforcement in its generality.

...read moreread less

Proceedings Article•

Non-Stationary Gaussian Process Regression with Hamiltonian Monte Carlo

[...]

Markus Heinonen, Henrik Mannerström, Juho Rousu¹, Samuel Kaski, Harri Lähdesmäki - Show less +1 more•Institutions (1)

Helsinki Institute for Information Technology¹

02 May 2016

TL;DR: In this article, a gradient-based inference method was proposed to learn the unknown function and the non-stationary model parameters, without requiring any model approximations, where all three key parameters (i.e., noise variance, signal variance and lengthscale) can be simultaneously input-dependent.

...read moreread less

Abstract: We present a novel approach for fully non-stationary Gaussian process regression (GPR), where all three key parameters -- noise variance, signal variance and lengthscale -- can be simultaneously input-dependent. We develop gradient-based inference methods to learn the unknown function and the non-stationary model parameters, without requiring any model approximations. We propose to infer full parameter posterior with Hamiltonian Monte Carlo (HMC), which conveniently extends the analytical gradient-based GPR learning by guiding the sampling with model gradients. We also learn the MAP solution from the posterior by gradient ascent. In experiments on several synthetic datasets and in modelling of temporal gene expression, the nonstationary GPR is shown to be necessary for modeling realistic input-dependent dynamics, while it performs comparably to conventional stationary or previous non-stationary GPR models otherwise.

...read moreread less

Journal Article•DOI•

Profiling persistent tubercule bacilli from patient sputa during therapy predicts early drug efficacy

[...]

Isobella Honeyborne¹, Timothy D. McHugh¹, Iitu Kuittinen², Anna Cichonska², Anna Cichonska³, Dimitrios Evangelopoulos¹, Katharina Ronacher⁴, Paul D. van Helden⁴, Stephen H. Gillespie⁵, Delmiro Fernandez-Reyes⁶, Delmiro Fernandez-Reyes¹, Gerhard Walzl⁴, Juho Rousu², Philip D. Butcher⁷, Simon J. Waddell⁸ - Show less +11 more•Institutions (8)

University College London¹, Helsinki Institute for Information Technology², University of Helsinki³, National Research Foundation of South Africa⁴, University of St Andrews⁵, University College Hospital⁶, St George's, University of London⁷, Brighton and Sussex Medical School⁸

07 Apr 2016-BMC Medicine

TL;DR: It is demonstrated that variability in clinical manifestations of disease are detectable in bacterial sputa signatures, and that the changing M.tb mRNA profiles 0–2 weeks into chemotherapy predict the efficacy of treatment 6 weeks later, which advocate assaying dynamic bacterial phenotypes through drug therapy as biomarkers for treatment success.

...read moreread less

Abstract: New treatment options are needed to maintain and improve therapy for tuberculosis, which caused the death of 1.5 million people in 2013 despite potential for an 86 % treatment success rate. A greater understanding of Mycobacterium tuberculosis (M.tb) bacilli that persist through drug therapy will aid drug development programs. Predictive biomarkers for treatment efficacy are also a research priority. Genome-wide transcriptional profiling was used to map the mRNA signatures of M.tb from the sputa of 15 patients before and 3, 7 and 14 days after the start of standard regimen drug treatment. The mRNA profiles of bacilli through the first 2 weeks of therapy reflected drug activity at 3 days with transcriptional signatures at days 7 and 14 consistent with reduced M.tb metabolic activity similar to the profile of pre-chemotherapy bacilli. These results suggest that a pre-existing drug-tolerant M.tb population dominates sputum before and after early drug treatment, and that the mRNA signature at day 3 marks the killing of a drug-sensitive sub-population of bacilli. Modelling patient indices of disease severity with bacterial gene expression patterns demonstrated that both microbiological and clinical parameters were reflected in the divergent M.tb responses and provided evidence that factors such as bacterial load and disease pathology influence the host-pathogen interplay and the phenotypic state of bacilli. Transcriptional signatures were also defined that predicted measures of early treatment success (rate of decline in bacterial load over 3 days, TB test positivity at 2 months, and bacterial load at 2 months). This study defines the transcriptional signature of M.tb bacilli that have been expectorated in sputum after two weeks of drug therapy, characterizing the phenotypic state of bacilli that persist through treatment. We demonstrate that variability in clinical manifestations of disease are detectable in bacterial sputa signatures, and that the changing M.tb mRNA profiles 0–2 weeks into chemotherapy predict the efficacy of treatment 6 weeks later. These observations advocate assaying dynamic bacterial phenotypes through drug therapy as biomarkers for treatment success.

...read moreread less

Posted Content•

Edit Distance: Sketching, Streaming and Document Exchange

[...]

Djamal Belazzougui¹, Qin Zhang²•Institutions (2)

Helsinki Institute for Information Technology¹, Indiana University²

14 Jul 2016-arXiv: Data Structures and Algorithms

TL;DR: In this paper, it was shown that Alice can send Bob a message of size O(K(log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x and$y$ is no more than $K, and output "error" otherwise.

...read moreread less

Abstract: We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time $\tilde{O}(n+\mathsf{poly}(K))$. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold $x$ and $y$ respectively, they can compute sketches of $x$ and $y$ of sizes $\mathsf{poly}(K \log n)$ bits (the encoding), and send to the referee, who can then compute the edit distance between $x$ and $y$ together with all the edit operations if the edit distance is no more than $K$, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using $\mathsf{poly}(K \log n)$ bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using $\mathsf{poly}(K \log n)$ bits of space.

...read moreread less

Journal Article•DOI•

Sparse group factor analysis for biclustering of multiple data sources

[...]

Kerstin Bunte¹, Eemeli Leppäaho¹, Inka Saarinen¹, Samuel Kaski¹•Institutions (1)

Helsinki Institute for Information Technology¹

15 Aug 2016-Bioinformatics

TL;DR: A Bayesian approach for joint bic Lustering of multiple data sources is presented, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions, and enables data-driven detection of linear structure present in parts of the data sources.

...read moreread less

Abstract: Motivation: Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple data sources, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions. The resulting method enables data-driven detection of linear structure present in parts of the data sources. Results: Our simulation studies show that the proposed method reliably infers biclusters from heterogeneous data sources. We tested the method on data from the NCI-DREAM drug sensitivity prediction challenge, resulting in an excellent prediction accuracy. Moreover, the predictions are based on several biclusters which provide insight into the data sources, in this case on gene expression, DNA methylation, protein abundance, exome sequence, functional connectivity fingerprints and drug sensitivity. Availability and Implementation: http://research.cs.aalto.fi/pml/software/GFAsparse/ Contacts: kerstin.bunte@googlemail.com or samuel.kaski@aalto.fi

...read moreread less

Proceedings Article•DOI•

Beyond Relevance: Adapting Exploration/Exploitation in Information Retrieval

[...]

Kumaripaba Athukorala¹, Alan Medlar², Antti Oulasvirta³, Giulio Jacucci³, Dorota Glowacka² - Show less +1 more•Institutions (3)

Helsinki Institute for Information Technology¹, University of Helsinki², Aalto University³

07 Mar 2016

TL;DR: A classifier that recognizes task type (lookup vs. exploratory) as a user is searching and a reinforcement learning based search engine that adapts accordingly the balance of exploration/exploitation in ranking the documents is described.

...read moreread less

Abstract: We present a novel adaptation technique for search engines to better support information-seeking activities that include both lookup and exploratory tasks. Building on previous findings, we describe (1) a classifier that recognizes task type (lookup vs. exploratory) as a user is searching and (2) a reinforcement learning based search engine that adapts accordingly the balance of exploration/exploitation in ranking the documents. This allows supporting both task types surreptitiously without changing the familiar list-based interface. Search results include more diverse results when users are exploring and more precise results for lookup tasks. Users found more useful results in exploratory tasks when compared to a base-line system, which is specifically tuned for lookup tasks.

...read moreread less

Proceedings Article•DOI•

Edit Distance: Sketching, Streaming, and Document Exchange

[...]

Djamal Belazzougui¹, Qin Zhang²•Institutions (2)

Helsinki Institute for Information Technology¹, Indiana University²

19 Dec 2016

TL;DR: This work shows that in the document exchange problem, Alice can send Bob a message of size O(K(log2 K + log n) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise.

...read moreread less

Abstract: We show that in the document exchange problem, where Alice holds x e {0, 1}n and Bob holds y e {0, 1}n, Alice can send Bob a message of size O(K(log2 K + log n)) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise. Both the encoding and decoding can be done in time O(n + poly(K)). This result significantly improves on the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold x and y respectively, they can compute sketches of x and y of sizes poly(K log n) bits (the encoding), and send to the referee, who can then compute the edit distance between x and y together with all the edit operations if the edit distance is no more than K, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(K log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(K log n) bits of space.

...read moreread less

Collapse