scispace - formally typeset
Search or ask a question

Showing papers by "Helsinki Institute for Information Technology published in 2016"


Journal ArticleDOI
TL;DR: It is suggested that digital competence is a useful boundary concept, which can be used in various contexts and consists of technical competence, the ability to use digital technologies in a meaningful way for working, studying and in everyday life, and motivation to participate and commit in the digital culture.
Abstract: Digital competence is an evolving concept related to the development of digital technology and the political aims and expectations of citizenship in a knowledge society. It is regarded as a core competence in policy papers; in educational research it is not yet a standardized concept. We suggest that it is a useful boundary concept, which can be used in various contexts. For this study, we analysed 76 educational research articles in which digital competence, described by different terms, was investigated. As a result, we found that digital competence consists of a variety of skills and competences, and its scope is wide, as is its background: from media studies and computer science to library and literacy studies. In the article review, we found a total of 34 terms that had used to describe the digital technology related skills and competences; the most often used terms were digital literacy, new literacies, multiliteracy and media literacy, each with somewhat different focus. We suggest that digital competence is defined as consisting of (1) technical competence, (2) the ability to use digital technologies in a meaningful way for working, studying and in everyday life, (3) the ability to evaluate digital technologies critically, and (4) motivation to participate and commit in the digital culture.

299 citations


Book ChapterDOI
01 Jan 2016
TL;DR: This chapter provides an application oriented view towards concept drift research, with a focus on supervised learning tasks, and constructs a reference framework for positioning application tasks within a spectrum of problems related to concept drift.
Abstract: In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift. The objective is to deploy models that would diagnose themselves and adapt to changing data over time. This chapter provides an application oriented view towards concept drift research, with a focus on supervised learning tasks. First we overview and categorize application tasks for which the problem of concept drift is particularly relevant. Then we construct a reference framework for positioning application tasks within a spectrum of problems related to concept drift. Finally, we discuss some promising research directions from the application perspective, and present recommendations for application driven concept drift research and development.

274 citations


Journal ArticleDOI
TL;DR: Approximate Bayesian computation refers to a family of algorithms for approximate inference that makes a minimal set of assumptions by only requiring that sampling from a model is possible.
Abstract: Bayesian inference plays an important role in phylogenetics, evolutionary biology, and in many other branches of science. It provides a principled framework for dealing with uncertainty and quantifying how it changes in the light of new evidence. For many complex models and inference problems, however, only approximate quantitative answers are obtainable. Approximate Bayesian computation (ABC) refers to a family of algorithms for approximate inference that makes a minimal set of assumptions by only requiring that sampling from a model is possible. We explain here the fundamentals of ABC, review the classical algorithms, and highlight recent developments. [ABC; approximate Bayesian computation; Bayesian inference; likelihood-free inference; phylogenetics; simulator-based models; stochastic simulation models; tree-based models.]

221 citations


Journal Article
TL;DR: This work presents MEKA: an open-source Java framework based on the well-known WEKA library, which provides interfaces to facilitate practical application, and a wealth of multi-label classifiers, evaluation metrics, and tools for multi- label experiments and development.
Abstract: Multi-label classification has rapidly attracted interest in the machine learning literature, and there are now a large number and considerable variety of methods for this type of learning. We present MEKA: an open-source Java framework based on the well-known WEKA library. MEKA provides interfaces to facilitate practical application, and a wealth of multi-label classifiers, evaluation metrics, and tools for multi-label experiments and development. It supports multi-label and multi-target data, including in incremental and semi-supervised contexts.

205 citations


Journal ArticleDOI
06 Jul 2016-Mbio
TL;DR: It is argued that this work provides a comprehensive road map illustrating the three vital components for future molecular epidemiological surveillance: (i) large-scale structured surveys, (ii) WGS, and (iii) community-oriented database infrastructure and analysis tools.
Abstract: The implementation of routine whole-genome sequencing (WGS) promises to transform our ability to monitor the emergence and spread of bacterial pathogens. Here we combined WGS data from 308 invasive Staphylococcus aureus isolates corresponding to a pan-European population snapshot, with epidemiological and resistance data. Geospatial visualization of the data is made possible by a generic software tool designed for public health purposes that is available at the project URL (http://www.microreact.org/project/EkUvg9uY?tt=rc). Our analysis demonstrates that high-risk clones can be identified on the basis of population level properties such as clonal relatedness, abundance, and spatial structuring and by inferring virulence and resistance properties on the basis of gene content. We also show that in silico predictions of antibiotic resistance profiles are at least as reliable as phenotypic testing. We argue that this work provides a comprehensive road map illustrating the three vital components for future molecular epidemiological surveillance: (i) large-scale structured surveys, (ii) WGS, and (iii) community-oriented database infrastructure and analysis tools. IMPORTANCE The spread of antibiotic-resistant bacteria is a public health emergency of global concern, threatening medical intervention at every level of health care delivery. Several recent studies have demonstrated the promise of routine whole-genome sequencing (WGS) of bacterial pathogens for epidemiological surveillance, outbreak detection, and infection control. However, as this technology becomes more widely adopted, the key challenges of generating representative national and international data sets and the development of bioinformatic tools to manage and interpret the data become increasingly pertinent. This study provides a road map for the integration of WGS data into routine pathogen surveillance. We emphasize the importance of large-scale routine surveys to provide the population context for more targeted or localized investigation and the development of open-access bioinformatic tools to provide the means to combine and compare independently generated data with publicly available data sets.

177 citations


Journal ArticleDOI
TL;DR: The proposed scheme provides important security attributes including prevention of various popular attacks, such as denial-of-service and eavesdropping attacks, and attains both computation efficiency and communication efficiency as compared with other schemes from the literature.
Abstract: The proliferation of current wireless communications and information technologies have been altering humans lifestyle and social interactions—the next frontier is the smart home environments or spaces. A smart home consists of low capacity devices (e.g., sensors) and wireless networks, and therefore, all working together as a secure system that needs an adequate level of security. This paper introduces lightweight and secure session key establishment scheme for smart home environments. To establish trust among the network, every sensor and control unit uses a short authentication token and establishes a secure session key. The proposed scheme provides important security attributes including prevention of various popular attacks, such as denial-of-service and eavesdropping attacks. The preliminary evaluation and feasibility tests are demonstrated by the proof-of-concept implementation. In addition, the proposed scheme attains both computation efficiency and communication efficiency as compared with other schemes from the literature.

154 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian optimization strategy is proposed to accelerate the likelihood-free inference through a reduction in the number of required simulations by several orders of magnitude, where the discrepancy between simulated and observed data is small.
Abstract: Our paper deals with inferring simulator-based statistical models given some observed data. A simulator-based model is a parametrized mechanism which specifies how data are generated. It is thus also referred to as generative model. We assume that only a finite number of parameters are of interest and allow the generative process to be very general; it may be a noisy nonlinear dynamical system with an unrestricted number of hidden variables. This weak assumption is useful for devising realistic models but it renders statistical inference very difficult. The main challenge is the intractability of the likelihood function. Several likelihood-free inference methods have been proposed which share the basic idea of identifying the parameters by finding values for which the discrepancy between simulated and observed data is small. A major obstacle to using these methods is their computational cost. The cost is largely due to the need to repeatedly simulate data sets and the lack of knowledge about how the parameters affect the discrepancy. We propose a strategy which combines probabilistic modeling of the discrepancy with optimization to facilitate likelihood-free inference. The strategy is implemented using Bayesian optimization and is shown to accelerate the inference through a reduction in the number of required simulations by several orders of magnitude.

150 citations


Journal ArticleDOI
TL;DR: This paper reports about the fifth edition of the ASP Competition by covering all aspects of the event, ranging from the new design of the competition to an in-depth analysis of the results, including additional analyses that were conceived for measuring the progress of the state of the art, as well as for studying aspects orthogonal to solving technology, such as the effects of modeling.

148 citations


Journal ArticleDOI
TL;DR: This article identified social norms that were formed around the prevailing sharing practices in the two sites and compared them in relation to the sharing mechanisms, and revealed that automated and manual sharing were sanctioned differently.
Abstract: “Profile work,” that is strategic self-presentation in social network sites, is configured by both the technical affordances and related social norms. In this article, we address technical and social psychological aspects that underlie acts of sharing by analyzing the social in relation to the technical. Our analysis is based on two complementary sets of qualitative data gleaned from in situ experiences of Finnish youth and young adults within the sharing mechanisms of Facebook and Last.fm. In our analysis, we identified social norms that were formed around the prevailing sharing practices in the two sites and compared them in relation to the sharing mechanisms. The analysis revealed that automated and manual sharing were sanctioned differently. We conclude that although the social norms that guide content sharing differed between the two contexts, there was an identical sociocultural goal in profile work: presentation of authenticity.

122 citations


Journal ArticleDOI
TL;DR: A systematic overview of state‐of‐the‐art techniques for visualizing different kinds of set relations is provided and these techniques are classified into six main categories according to the visual representations they use and the tasks they support.
Abstract: Sets comprise a generic data model that has been used in a variety of data analysis problems. Such problems involve analysing and visualizing set relations between multiple sets defined over the same collection of elements. However, visualizing sets is a non-trivial problem due to the large number of possible relations between them. We provide a systematic overview of state-of-the-art techniques for visualizing different kinds of set relations. We classify these techniques into six main categories according to the visual representations they use and the tasks they support. We compare the categories to provide guidance for choosing an appropriate technique for a given problem. Finally, we identify challenges in this area that need further research and propose possible directions to address these challenges. Further resources on set visualization are available at http://www.setviz.net.

115 citations


Journal ArticleDOI
TL;DR: MetaCCA as discussed by the authors is a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype, and employs a covariance shrinkage algorithm to achieve robustness.
Abstract: Motivation: A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analyzing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts, and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. Results: We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness. Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies. Availability and implementation: Code is available at https://github.com/aalto-ics-kepaco Contacts: if.iknisleh@aksnohcic.anna or if.iknisleh@nenirip.ittam Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
01 Nov 2016
TL;DR: The goal of this article is to investigate how to separate the 2 types of tasks in an IR system using easily measurable behaviors, and shows that IR systems can distinguish the 2 search categories in the course of a search session.
Abstract: Exploratory search is an increasingly important activity yet challenging for users. Although there exists an ample amount of research into understanding exploration, most of the major information retrieval IR systems do not provide tailored and adaptive support for such tasks. One reason is the lack of empirical knowledge on how to distinguish exploratory and lookup search behaviors in IR systems. The goal of this article is to investigate how to separate the 2 types of tasks in an IR system using easily measurable behaviors. In this article, we first review characteristics of exploratory search behavior. We then report on a controlled study of 6 search tasks with 3 exploratory-comparison, knowledge acquisition, planning-and 3 lookup tasks-fact-finding, navigational, question answering. The results are encouraging, showing that IR systems can distinguish the 2 search categories in the course of a search session. The most distinctive indicators that characterize exploratory search behaviors are query length, maximum scroll depth, and task completion time. However, 2 tasks are borderline and exhibit mixed characteristics. We assess the applicability of this finding by reporting on several classification experiments. Our results have valuable implications for designing tailored and adaptive IR systems.

Journal ArticleDOI
TL;DR: The proposed error correction method, LoRMA, is the most accurate one relying on long reads only for read sets with high coverage and when the coverage of the read set is at least 75×, the throughput of the new method is at at least 20% higher.
Abstract: Motivation: New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. Results: We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75Â, the throughput of the new method is at least 20% higher. Availability and Implementation: LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/.

Journal ArticleDOI
TL;DR: This paper demonstrates empirically and theoretically with standard regression models that in order to make sure that decision models are non-discriminatory, for instance, with respect to race, the sensitive racial information needs to be used in the model building process.
Abstract: Increasing numbers of decisions about everyday life are made using algorithms. By algorithms we mean predictive models (decision rules) captured from historical data using data mining. Such models often decide prices we pay, select ads we see and news we read online, match job descriptions and candidate CVs, decide who gets a loan, who goes through an extra airport security check, or who gets released on parole. Yet growing evidence suggests that decision making by algorithms may discriminate people, even if the computing process is fair and well-intentioned. This happens due to biased or non-representative learning data in combination with inadvertent modeling procedures. From the regulatory perspective there are two tendencies in relation to this issue: (1) to ensure that data-driven decision making is not discriminatory, and (2) to restrict overall collecting and storing of private data to a necessary minimum. This paper shows that from the computing perspective these two goals are contradictory. We demonstrate empirically and theoretically with standard regression models that in order to make sure that decision models are non-discriminatory, for instance, with respect to race, the sensitive racial information needs to be used in the model building process. Of course, after the model is ready, race should not be required as an input variable for decision making. From the regulatory perspective this has an important implication: collecting sensitive personal data is necessary in order to guarantee fairness of algorithms, and law making needs to find sensible ways to allow using such data in the modeling process.

Journal ArticleDOI
TL;DR: It is demonstrated that pathway-response associations can be learned by the proposed model for the well-known EGFR and MEK inhibitors, opening up the opportunity for elucidating drug action mechanisms.
Abstract: Motivation A key goal of computational personalized medicine is to systematically utilize genomic and other molecular features of samples to predict drug responses for a previously unseen sample. Such predictions are valuable for developing hypotheses for selecting therapies tailored for individual patients. This is especially valuable in oncology, where molecular and genetic heterogeneity of the cells has a major impact on the response. However, the prediction task is extremely challenging, raising the need for methods that can effectively model and predict drug responses. Results In this study, we propose a novel formulation of multi-task matrix factorization that allows selective data integration for predicting drug responses. To solve the modeling task, we extend the state-of-the-art kernelized Bayesian matrix factorization (KBMF) method with component-wise multiple kernel learning. In addition, our approach exploits the known pathway information in a novel and biologically meaningful fashion to learn the drug response associations. Our method quantitatively outperforms the state of the art on predicting drug responses in two publicly available cancer datasets as well as on a synthetic dataset. In addition, we validated our model predictions with lab experiments using an in-house cancer cell line panel. We finally show the practical applicability of the proposed method by utilizing prior knowledge to infer pathway-drug response associations, opening up the opportunity for elucidating drug action mechanisms. We demonstrate that pathway-response associations can be learned by the proposed model for the well-known EGFR and MEK inhibitors. Availability and implementation The source code implementing the method is available at http://research.cs.aalto.fi/pml/software/cwkbmf/ Contacts muhammad.ammad-ud-din@aalto.fi or samuel.kaski@aalto.fi Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
Solveig K. Sieberts1, Zhu Fan2, Javier Garcia-Garcia3, Eli A. Stahl4, Abhishek Pratap1, Gaurav Pandey4, Dimitrios A. Pappas, Daniel Aguilar3, Bernat Anton3, Jaume Bonet3, Ridvan Eksi2, Oriol Fornes3, Emre Guney5, Hong-Dong Li2, Manuel Alejandro Marín3, Bharat Panwar2, Joan Planas-Iglesias3, Daniel Poglayen3, Jing Cui6, André O. Falcão7, Christine Suver1, Bruce Hoff1, Venkatachalapathy S. K. Balagurusamy8, Donna N. Dillenberger8, Elias Chaibub Neto1, Thea Norman1, Tero Aittokallio8, Muhammad Ammad-ud-din9, Muhammad Ammad-ud-din10, Chloé-Agathe Azencott11, Victor Bellon11, Valentina Boeva11, Kerstin Bunte9, Kerstin Bunte10, Himanshu Chheda12, Lu Cheng12, Lu Cheng10, Lu Cheng9, Jukka Corander12, Jukka Corander9, Michel Dumontier13, Anna Goldenberg14, Peddinti Gopalacharyulu12, Mohsen Hajiloo14, Daniel Hidru14, Alok Jaiswal12, Samuel Kaski12, Samuel Kaski10, Samuel Kaski9, Beyrem Khalfaoui14, Suleiman A. Khan10, Suleiman A. Khan9, Suleiman A. Khan12, Eric R. Kramer15, Pekka Marttinen10, Pekka Marttinen9, Aziz M. Mezlini14, Bhuvan Molparia15, Matti Pirinen12, Janna Saarela12, Matthias Samwald16, Véronique Stoven11, Hao Tang17, Jing Tang12, Ali Torkamani15, Jean Phillipe Vert11, Bo Wang13, Tao Wang17, Krister Wennerberg12, Nathan E. Wineinger15, Guanghua Xiao17, Yang Xie17, Rae S. M. Yeung14, Xiaowei Zhan17, Cheng Zhao14, Jeff Greenberg18, Joel M. Kremer19, Kaleb Michaud, Anne Barton, Marieke J H Coenen20, Xavier Mariette11, Corinne Miceli11, Nancy A. Shadick6, Michael E. Weinblatt6, Niek de Vries21, Paul P. Tak22, Danielle M. Gerlag22, Tom W J Huizinga23, Fina A S Kurreeman23, Cornelia F Allaart23, S. Louis Bridges24, Lindsey A. Criswell25, Larry W. Moreland26, Lars Klareskog27, Saedis Saevarsdottir27, Leonid Padyukov27, Peter K. Gregersen28, Stephen H. Friend1, Robert M. Plenge29, Gustavo Stolovitzky7, Baldo Oliva3, Yuanfang Guan2, Lara M. Mangravite1 
TL;DR: Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data.
Abstract: Rheumatoid arthritis (RA) affects millions world-wide. While anti-TNF treatment is widely used to reduce disease progression, treatment fails in ∼one-third of patients. No biomarker currently exists that identifies non-responders before treatment. A rigorous community-based assessment of the utility of SNP data for predicting anti-TNF treatment efficacy in RA patients was performed in the context of a DREAM Challenge (http://www.synapse.org/RA_Challenge). An open challenge framework enabled the comparative evaluation of predictions developed by 73 research groups using the most comprehensive available data and covering a wide range of state-of-the-art modelling methodologies. Despite a significant genetic heritability estimate of treatment non-response trait (h(2)=0.18, P value=0.02), no significant genetic contribution to prediction accuracy is observed. Results formally confirm the expectations of the rheumatology community that SNP information does not significantly improve predictive performance relative to standard clinical traits, thereby justifying a refocusing of future efforts on collection of other data.

Journal ArticleDOI
TL;DR: It is suggested that the regionally arid Turkana Basin may between 4 and 2 Ma have acted as a ‘species factory’, generating ecological adaptations in advance of the global trend, and temporally and spatially resolved estimates of temperature and precipitation are provided.
Abstract: Although ecometric methods have been used to analyse fossil mammal faunas and environments of Eurasia and North America, such methods have not yet been applied to the rich fossil mammal record of e...

Journal ArticleDOI
TL;DR: This work proposes to address the metabolite identification problem using a structured output prediction approach that is not limited to vector output space and can handle structured output space such as the molecule space, and achieves state-of-the-art accuracy in metabolites identification.
Abstract: Motivation: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space. Results: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. Availability and implementation: Contact: if.otlaa@drauorb.enilec Supplementary information: Supplementary data are available at Bioinformatics online.

Posted Content
TL;DR: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter due to the previous default choices.
Abstract: The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter. We argue that the previous default choices are dubious due to their tendency to favor solutions with more unshrunk coefficients than we typically expect a priori. This can lead to bad results if this parameter is not strongly identified by data. We derive the relationship between the global parameter and the effective number of nonzeros in the coefficient vector, and show an easy and intuitive way of setting up the prior for the global parameter based on our prior beliefs about the number of nonzero coefficients in the model. The results on real world data show that one can benefit greatly -- in terms of improved parameter estimates, prediction accuracy, and reduced computation time -- from transforming even a crude guess for the number of nonzero coefficients into the prior for the global parameter using our framework.

Journal ArticleDOI
TL;DR: This paper revisits archetypal analysis from the basic principles, and proposes a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors that corroborate the proposed methodology with convincing real-world applications.
Abstract: Archetypal analysis represents a set of observations as convex combinations of pure patterns, or archetypes. The original geometric formulation of finding archetypes by approximating the convex hull of the observations assumes them to be real---valued. This, unfortunately, is not compatible with many practical situations. In this paper we revisit archetypal analysis from the basic principles, and propose a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors. We corroborate the proposed methodology with convincing real-world applications on finding archetypal soccer players based on performance data, archetypal winter tourists based on binary survey data, archetypal disaster-affected countries based on disaster count data, and document archetypes based on term-frequency data. We also present an appropriate visualization tool to summarize archetypal analysis solution better.

Journal ArticleDOI
TL;DR: This paper reformulates the problem definition in a way that it is able to obtain an algorithm with constant-factor approximation guarantee, and presents a new approach that improves over the existing techniques, both in theory and practice.
Abstract: Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top-kdensest subgraphs. One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top-koverlapping densest subgraphs, and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee. Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency.

Journal ArticleDOI
TL;DR: A new algebraic sieving technique to detect constrained multilinear monomials in multivariate polynomial generating functions given by an evaluation oracle is introduced and shown to show an $$O^*(2^k)$$O∗(2k)-time polynomials space algorithm for the k-sized Graph Motif problem.
Abstract: We introduce a new algebraic sieving technique to detect constrained multilinear monomials in multivariate polynomial generating functions given by an evaluation oracle. The polynomials are assumed to have coefficients from a field of characteristic two. As applications of the technique, we show an $$O^*(2^k)$$O?(2k)-time polynomial space algorithm for the $$k$$k-sized Graph Motif problem. We also introduce a new optimization variant of the problem, called Closest Graph Motif and solve it within the same time bound. The Closest Graph Motif problem encompasses several previously studied optimization variants, like Maximum Graph Motif, Min-Substitute Graph Motif, and Min-Add Graph Motif. Finally, we provide a piece of evidence that our result might be essentially tight: the existence of an $$O^*((2-\epsilon )^k)$$O?((2-∈)k)-time algorithm for the Graph Motif problem implies an $$O((2-\epsilon ')^n)$$O((2-∈?)n)-time algorithm for Set Cover.

Journal ArticleDOI
TL;DR: An overview of the answer set programming paradigm is given, its strengths are explained, and its main features are illustrated in terms of examples and an application problem.
Abstract: In this article, we give an overview of the answer set programming paradigm, explain its strengths, and illustrate its main features in terms of examples and an application problem.

Journal ArticleDOI
12 Feb 2016
TL;DR: This work provides a nearly complete computational complexity map of fixed-argument extension enforcement under various major AF semantics, with results ranging from polynomial-time algorithms to completeness for the second-level of thePolynomial hierarchy.
Abstract: Understanding the dynamics of argumentation frameworks (AFs) is important in the study of argumentation in AI. In this work, we focus on the so-called extension enforcement problem in abstract argumentation. We provide a nearly complete computational complexity map of fixed-argument extension enforcement under various major AF semantics, with results ranging from polynomial-time algorithms to completeness for the second-level of the polynomial hierarchy. Complementing the complexity results, we propose algorithms for NP-hard extension enforcement based on constrained optimization. Going beyond NP, we propose novel counterexample-guided abstraction refinement procedures for the second-level complete problems and present empirical results on a prototype system constituting the first approach to extension enforcement in its generality.

Proceedings Article
02 May 2016
TL;DR: In this article, a gradient-based inference method was proposed to learn the unknown function and the non-stationary model parameters, without requiring any model approximations, where all three key parameters (i.e., noise variance, signal variance and lengthscale) can be simultaneously input-dependent.
Abstract: We present a novel approach for fully non-stationary Gaussian process regression (GPR), where all three key parameters -- noise variance, signal variance and lengthscale -- can be simultaneously input-dependent. We develop gradient-based inference methods to learn the unknown function and the non-stationary model parameters, without requiring any model approximations. We propose to infer full parameter posterior with Hamiltonian Monte Carlo (HMC), which conveniently extends the analytical gradient-based GPR learning by guiding the sampling with model gradients. We also learn the MAP solution from the posterior by gradient ascent. In experiments on several synthetic datasets and in modelling of temporal gene expression, the nonstationary GPR is shown to be necessary for modeling realistic input-dependent dynamics, while it performs comparably to conventional stationary or previous non-stationary GPR models otherwise.

Journal ArticleDOI
TL;DR: It is demonstrated that variability in clinical manifestations of disease are detectable in bacterial sputa signatures, and that the changing M.tb mRNA profiles 0–2 weeks into chemotherapy predict the efficacy of treatment 6 weeks later, which advocate assaying dynamic bacterial phenotypes through drug therapy as biomarkers for treatment success.
Abstract: New treatment options are needed to maintain and improve therapy for tuberculosis, which caused the death of 1.5 million people in 2013 despite potential for an 86 % treatment success rate. A greater understanding of Mycobacterium tuberculosis (M.tb) bacilli that persist through drug therapy will aid drug development programs. Predictive biomarkers for treatment efficacy are also a research priority. Genome-wide transcriptional profiling was used to map the mRNA signatures of M.tb from the sputa of 15 patients before and 3, 7 and 14 days after the start of standard regimen drug treatment. The mRNA profiles of bacilli through the first 2 weeks of therapy reflected drug activity at 3 days with transcriptional signatures at days 7 and 14 consistent with reduced M.tb metabolic activity similar to the profile of pre-chemotherapy bacilli. These results suggest that a pre-existing drug-tolerant M.tb population dominates sputum before and after early drug treatment, and that the mRNA signature at day 3 marks the killing of a drug-sensitive sub-population of bacilli. Modelling patient indices of disease severity with bacterial gene expression patterns demonstrated that both microbiological and clinical parameters were reflected in the divergent M.tb responses and provided evidence that factors such as bacterial load and disease pathology influence the host-pathogen interplay and the phenotypic state of bacilli. Transcriptional signatures were also defined that predicted measures of early treatment success (rate of decline in bacterial load over 3 days, TB test positivity at 2 months, and bacterial load at 2 months). This study defines the transcriptional signature of M.tb bacilli that have been expectorated in sputum after two weeks of drug therapy, characterizing the phenotypic state of bacilli that persist through treatment. We demonstrate that variability in clinical manifestations of disease are detectable in bacterial sputa signatures, and that the changing M.tb mRNA profiles 0–2 weeks into chemotherapy predict the efficacy of treatment 6 weeks later. These observations advocate assaying dynamic bacterial phenotypes through drug therapy as biomarkers for treatment success.

Posted Content
TL;DR: In this paper, it was shown that Alice can send Bob a message of size O(K(log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x and$y$ is no more than $K, and output "error" otherwise.
Abstract: We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time $\tilde{O}(n+\mathsf{poly}(K))$. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold $x$ and $y$ respectively, they can compute sketches of $x$ and $y$ of sizes $\mathsf{poly}(K \log n)$ bits (the encoding), and send to the referee, who can then compute the edit distance between $x$ and $y$ together with all the edit operations if the edit distance is no more than $K$, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using $\mathsf{poly}(K \log n)$ bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using $\mathsf{poly}(K \log n)$ bits of space.

Journal ArticleDOI
TL;DR: A Bayesian approach for joint bic Lustering of multiple data sources is presented, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions, and enables data-driven detection of linear structure present in parts of the data sources.
Abstract: Motivation: Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple data sources, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions. The resulting method enables data-driven detection of linear structure present in parts of the data sources. Results: Our simulation studies show that the proposed method reliably infers biclusters from heterogeneous data sources. We tested the method on data from the NCI-DREAM drug sensitivity prediction challenge, resulting in an excellent prediction accuracy. Moreover, the predictions are based on several biclusters which provide insight into the data sources, in this case on gene expression, DNA methylation, protein abundance, exome sequence, functional connectivity fingerprints and drug sensitivity. Availability and Implementation: http://research.cs.aalto.fi/pml/software/GFAsparse/ Contacts: kerstin.bunte@googlemail.com or samuel.kaski@aalto.fi

Proceedings ArticleDOI
07 Mar 2016
TL;DR: A classifier that recognizes task type (lookup vs. exploratory) as a user is searching and a reinforcement learning based search engine that adapts accordingly the balance of exploration/exploitation in ranking the documents is described.
Abstract: We present a novel adaptation technique for search engines to better support information-seeking activities that include both lookup and exploratory tasks. Building on previous findings, we describe (1) a classifier that recognizes task type (lookup vs. exploratory) as a user is searching and (2) a reinforcement learning based search engine that adapts accordingly the balance of exploration/exploitation in ranking the documents. This allows supporting both task types surreptitiously without changing the familiar list-based interface. Search results include more diverse results when users are exploring and more precise results for lookup tasks. Users found more useful results in exploratory tasks when compared to a base-line system, which is specifically tuned for lookup tasks.

Proceedings ArticleDOI
19 Dec 2016
TL;DR: This work shows that in the document exchange problem, Alice can send Bob a message of size O(K(log2 K + log n) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise.
Abstract: We show that in the document exchange problem, where Alice holds x e {0, 1}n and Bob holds y e {0, 1}n, Alice can send Bob a message of size O(K(log2 K + log n)) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise. Both the encoding and decoding can be done in time O(n + poly(K)). This result significantly improves on the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold x and y respectively, they can compute sketches of x and y of sizes poly(K log n) bits (the encoding), and send to the referee, who can then compute the edit distance between x and y together with all the edit operations if the edit distance is no more than K, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(K log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(K log n) bits of space.