scispace - formally typeset
Search or ask a question

Showing papers by "Helsinki Institute for Information Technology published in 2014"


Journal ArticleDOI
TL;DR: LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph is presented.
Abstract: Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with com-paratively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads pro-vides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy.

580 citations


Journal ArticleDOI
25 Apr 2014-Science
TL;DR: In this paper, the authors applied large-scale single-cell genomics to study populations of the globally abundant marine cyanobacterium Prochlorococcus and found that they are composed of hundreds of subpopulations with distinct "genomic backbones," each backbone consisting of a different set of core gene alleles linked to a small distinctive set of flexible genes.
Abstract: Extensive genomic diversity within coexisting members of a microbial species has been revealed through selected cultured isolates and metagenomic assemblies. Yet, the cell-by-cell genomic composition of wild uncultured populations of co-occurring cells is largely unknown. In this work, we applied large-scale single-cell genomics to study populations of the globally abundant marine cyanobacterium Prochlorococcus. We show that they are composed of hundreds of subpopulations with distinct "genomic backbones," each backbone consisting of a different set of core gene alleles linked to a small distinctive set of flexible genes. These subpopulations are estimated to have diverged at least a few million years ago, suggesting ancient, stable niche partitioning. Such a large set of coexisting subpopulations may be a general feature of free-living bacterial species with huge populations in highly mixed habitats.

469 citations


Journal ArticleDOI
TL;DR: In this article, the authors report whole-genome sequencing of 3,085 pneumococcal carriage isolates from a 2.4km(2) refugee camp, which provides unprecedented resolution of the process of recombination and its impact on population evolution.
Abstract: Evasion of clinical interventions by Streptococcus pneumoniae occurs through selection of non-susceptible genomic variants. We report whole-genome sequencing of 3,085 pneumococcal carriage isolates from a 2.4-km(2) refugee camp. This sequencing provides unprecedented resolution of the process of recombination and its impact on population evolution. Genomic recombination hotspots show remarkable consistency between lineages, indicating common selective pressures acting at certain loci, particularly those associated with antibiotic resistance. Temporal changes in antibiotic consumption are reflected in changes in recombination trends, demonstrating rapid spread of resistance when selective pressure is high. The highest frequencies of receipt and donation of recombined DNA fragments were observed in non-encapsulated lineages, implying that this largely overlooked pneumococcal group, which is beyond the reach of current vaccines, may have a major role in genetic exchange and the adaptation of the species as a whole. These findings advance understanding of pneumococcal population dynamics and provide information for the design of future intervention strategies.

359 citations


Journal ArticleDOI
TL;DR: The genome of the Glanville fritillary butterfly, a widely recognized model species in metapopulation biology and eco-evolutionary research, is reported, which shows that fusion chromosomes have retained the ancestral chromosome segments and very few rearrangements have occurred across the fusion sites.
Abstract: Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) ...

216 citations


Journal ArticleDOI
TL;DR: A genome-wide association study to identify single nucleotide polymorphisms (SNPs) and indels that could confer beta-lactam non-susceptibility using 3,085 Thai and 616 USA pneumococcal isolates as independent datasets for the variant discovery.
Abstract: Traditional genetic association studies are very difficult in bacteria, as the generally limited recombination leads to large linked haplotype blocks, confounding the identification of causative variants. Beta-lactam antibiotic resistance in Streptococcus pneumoniae arises readily as the bacteria can quickly incorporate DNA fragments encompassing variants that make the transformed strains resistant. However, the causative mutations themselves are embedded within larger recombined blocks, and previous studies have only analysed a limited number of isolates, leading to the description of “mosaic genes” as being responsible for resistance. By comparing a large number of genomes of beta-lactam susceptible and non-susceptible strains, the high frequency of recombination should break up these haplotype blocks and allow the use of genetic association approaches to identify individual causative variants. Here, we performed a genome-wide association study to identify single nucleotide polymorphisms (SNPs) and indels that could confer beta-lactam non-susceptibility using 3,085 Thai and 616 USA pneumococcal isolates as independent datasets for the variant discovery. The large sample sizes allowed us to narrow the source of beta-lactam non-susceptibility from long recombinant fragments down to much smaller loci comprised of discrete or linked SNPs. While some loci appear to be universal resistance determinants, contributing equally to non-susceptibility for at least two classes of beta-lactam antibiotics, some play a larger role in resistance to particular antibiotics. All of the identified loci have a highly non-uniform distribution in the populations. They are enriched not only in vaccine-targeted, but also non-vaccine-targeted lineages, which may raise clinical concerns. Identification of single nucleotide polymorphisms underlying resistance will be essential for future use of genome sequencing to predict antibiotic sensitivity in clinical microbiology.

212 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the two major generalist lineages of C. jejuni do not show evidence of recombination with each other in nature, despite having a high degree of host niche overlap and recombining extensively with specialist lineages, suggesting ecological rather than essential barriers to recombination.
Abstract: Homologous recombination between bacterial strains is theoretically capable of preventing the separation of daughter clusters, and producing cohesive clouds of genotypes in sequence space. However, numerous barriers to recombination are known. Barriers may be essential such as adaptive incompatibility, or ecological, which is associated with the opportunities for recombination in the natural habitat. Campylobacter jejuni is a gut colonizer of numerous animal species and a major human enteric pathogen. We demonstrate that the two major generalist lineages of C. jejuni do not show evidence of recombination with each other in nature, despite having a high degree of host niche overlap and recombining extensively with specialist lineages. However, transformation experiments show that the generalist lineages readily recombine with one another in vitro. This suggests ecological rather than essential barriers to recombination, caused by a cryptic niche structure within the hosts.

128 citations


Journal ArticleDOI
TL;DR: A sparse-coding framework for activity recognition in ubiquitous and mobile computing that alleviates two fundamental problems of current supervised learning approaches is proposed, and its practical potential is shown by successfully evaluating its generalization capabilities across both domain and sensor modalities.

115 citations


Journal ArticleDOI
TL;DR: In this paper, a generic approach for reasoning over argumentation frameworks is proposed based on the concept of complexity-sensitivity, which allows instantiations of the generic framework via harnessing in an iterative way current sophisticated Boolean satisfiability solver technology for solving the considered argumentation reasoning problems.

103 citations


Journal ArticleDOI
TL;DR: A novel kernelized Bayesian matrix factorization method is applied to solve the modeling task of predicting the responses to new drugs for new cancer cell lines, and a complete global map of drug response is explored to assess treatment potential and treatment range of therapeutically interesting anticancer drugs.
Abstract: With data from recent large-scale drug sensitivity measurement campaigns, it is now possible to build and test models predicting responses for more than one hundred anticancer drugs against several hundreds of human cancer cell lines. Traditional quantitative structure-activity relationship (QSAR) approaches focus on small molecules in searching for their structural properties predictive of the biological activity in a single cell line or a single tissue type. We extend this line of research in two directions: (1) an integrative QSAR approach predicting the responses to new drugs for a panel of multiple known cancer cell lines simultaneously and (2) a personalized QSAR approach predicting the responses to new drugs for new cancer cell lines. To solve the modeling task, we apply a novel kernelized Bayesian matrix factorization method. For maximum applicability and predictive performance, the method optionally utilizes genomic features of cell lines and target information on drugs in addition to chemical drug descriptors. In a case study with 116 anticancer drugs and 650 cell lines, we demonstrate the usefulness of the method in several relevant prediction scenarios, differing in the amount of available information, and analyze the importance of various types of drug features for the response prediction. Furthermore, after predicting the missing values of the data set, a complete global map of drug response is explored to assess treatment potential and treatment range of therapeutically interesting anticancer drugs.

102 citations


Proceedings ArticleDOI
26 Apr 2014
TL;DR: A predictive model for the functional area of the thumb on a touchscreen surface that derives a quadratic formula by analyzing the kinematics of the gripping hand and can be used to infer the grips assumed by a given user interface layout.
Abstract: We present a predictive model for the functional area of the thumb on a touchscreen surface: the area of the interface reachable by the thumb of the hand that is holding the device. We derive a quadratic formula by analyzing the kinematics of the gripping hand. Model fit is high for the thumb-motion trajectories of 20 participants. The model predicts the functional area for a given 1) surface size, 2) hand size, and 3) position of the index finger on the back of the device. Designers can use this model to ensure that a user interface is suitable for interaction with the thumb. The model can also be used inversely - that is, to infer the grips assumed by a given user interface layout.

101 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigated whether designers systematically prefer their own ideas in concept evaluation and found a systematic preference of self-generated concepts in evaluation tasks, and discussed the implications of this preference effect on design practice.

Journal ArticleDOI
TL;DR: This work combines fragmentation tree computations with kernel-based machine learning to predict molecular fingerprints and identify molecular structures, and introduces a family of kernels capturing the similarity of fragmentation trees, and combines these kernels using recently proposed multiple kernel learning approaches.
Abstract: Motivation: Metabolite identification from tandem mass spectrometric data is a key task in metabolomics. Various computational methods have been proposed for the identification of metabolites from tandem mass spectra. Fragmentation tree methods explore the space of possible ways in which the metabolite can fragment, and base the metabolite identification on scoring of these fragmentation trees. Machine learning methods have been used to map mass spectra to molecular fingerprints; predicted fingerprints, in turn, can be used to score candidate molecular structures. Results: Here, we combine fragmentation tree computations with kernel-based machine learning to predict molecular fingerprints and identify molecular structures. We introduce a family of kernels capturing the similarity of fragmentation trees, and combine these kernels using recently proposed multiple kernel learning approaches. Experiments on two large reference datasets show that the new methods significantly improve molecular fingerprint prediction accuracy. These improvements result in better metabolite identification, doubling the number of metabolites ranked at the top position of the candidates list. Contact: if.otlaa@nehs.nibiuh Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this paper, the authors report lessons learnt from three parallel and complementary user studies, where motivational features for sustainable urban mobility, including social influence strategies delivered through social media, were prototyped, tested and refined.

Journal ArticleDOI
TL;DR: Deterministic models that describe the energy consumption of Wi-Fi data transmission with traffic burstiness, network performance metrics like throughput and retransmission rate, and parameters of the power saving mechanisms in use are presented.
Abstract: Wireless data transmission consumes a significant part of the overall energy consumption of smartphones, due to the popularity of Internet applications. In this paper, we investigate the energy consumption characteristics of data transmission over Wi-Fi, focusing on the effect of Internet flow characteristics and network environment. We present deterministic models that describe the energy consumption of Wi-Fi data transmission with traffic burstiness, network performance metrics like throughput and retransmission rate, and parameters of the power saving mechanisms in use. Our models are practical because their inputs are easily available on mobile platforms without modifying low-level software or hardware components. We demonstrate the practice of model-based energy profiling on Maemo, Symbian, and Android phones, and evaluate the accuracy with physical power measurement of applications including file transfer, web browsing, video streaming, and instant messaging. Our experimental results show that our models are of adequate accuracy for energy profiling and are easy to apply.

Proceedings ArticleDOI
15 Feb 2014
TL;DR: It is suggested that the frame monetary transactions set to exchange relationships contributes to the hosts' sense of control and ease in the exchange.
Abstract: This study examines how money mediates and structures social exchange in a hospitality exchange service, and how social and economic factors guiding exchange get intertwined in this context. We present a qualitative study on the experiences of people who offer to rent out their homes, or parts of them, via the online peer-to-peer renting service Airbnb. Our study suggests that the frame monetary transactions set to exchange relationships contributes to the hosts' sense of control and ease in the exchange. We identified two behavioral patterns that highlight the importance of reputation and trust: (1) hosts divert their accumulated reputational capital into the rental price and (2) they may price their property below "the market price", so that they can choose their exchange partners form a wider pool of candidates.

Proceedings Article
02 Apr 2014
TL;DR: This work develops a novel score-based approach to BTW-BNSL, based on casting BTW’s structure as weighted partial Maximum satisability, and demonstrates empirically that the approach scales notably better than a recent exact dynamic programming algorithm for BTw-B NSL.
Abstract: Bayesian network structure learning is the well-known computationally hard problem of nding a directed acyclic graph structure that optimally describes given data. A learned structure can then be used for probabilistic inference. While exact inference in Bayesian networks is in general NP-hard, it is tractable in networks with low treewidth. This provides good motivations for developing algorithms for the NPhard problem of learning optimal bounded treewidth Bayesian networks (BTW-BNSL). In this work, we develop a novel score-based approach to BTW-BNSL, based on casting BTW-BNSL as weighted partial Maximum satisability. We demonstrate empirically that the approach scales notably better than a recent exact dynamic programming algorithm for BTW-BNSL.

Journal ArticleDOI
TL;DR: A new approach for the problem of finding overlapping communities in graphs and social networks, particularly in graphs that have labels on their vertices, by adapting efficient approximation algorithms for the generalized maximum-coverage problem and the densest-subgraph problem.
Abstract: We present a new approach for the problem of finding overlapping communities in graphs and social networks. Our approach consists of a novel problem definition and three accompanying algorithms. We are particularly interested in graphs that have labels on their vertices, although our methods are also applicable to graphs with no labels. Our goal is to find k communities so that the total edge density over all k communities is maximized. In the case of labeled graphs, we require that each community is succinctly described by a set of labels. This requirement provides a better understanding for the discovered communities. The proposed problem formulation leads to the discovery of vertex-overlapping and dense communities that cover as many graph edges as possible. We capture these properties with a simple objective function, which we solve by adapting efficient approximation algorithms for the generalized maximum-coverage problem and the densest-subgraph problem. Our proposed algorithm is a generic greedy scheme. We experiment with three variants of the scheme, obtained by varying the greedy step of finding a dense subgraph. We validate our algorithms by comparing with other state-of-the-art community-detection methods on a variety of performance measures. Our experiments confirm that our algorithms achieve results of high quality in terms of the reported measures, and are practical in terms of performance.

Journal ArticleDOI
01 Jul 2014-PLOS ONE
TL;DR: The hypothesis that males not only prefer competitive over cooperative play, but they also exhibit more positive emotional responses during them is supported, and the emotional experiences of females do not differ between cooperation and competition, which implies that less competitiveness does not mean more cooperativeness.
Abstract: Previous research indicates that males prefer competition over cooperation, and it is sometimes suggested that females show the opposite behavioral preference. In the present article, we investigate the emotions behind the preferences: Do males exhibit more positive emotions during competitive than cooperative activities, and do females show the opposite pattern? We conducted two experiments where we assessed the emotional responses of same-gender dyads (in total 130 participants, 50 female) during intrinsically motivating competitive and cooperative digital game play using facial electromyography (EMG), skin conductance, heart rate measures, and self-reported emotional experiences. We found higher positive emotional responses (as indexed by both physiological measures and self-reports) during competitive than cooperative play for males, but no differences for females. In addition, we found no differences in negative emotions, and heart rate, skin conductance, and self-reports yielded contradictory evidence for arousal. These results support the hypothesis that males not only prefer competitive over cooperative play, but they also exhibit more positive emotional responses during them. In contrast, the results suggest that the emotional experiences of females do not differ between cooperation and competition, which implies that less competitiveness does not mean more cooperativeness. Our results pertain to intrinsically motivated game play, but might be relevant also for other kinds of activities.

Journal ArticleDOI
TL;DR: The framework can be used for supervised and semi-supervised multilabel classification and multi-output regression, by considering samples and outputs as the domains where matrix factorization operates and outperforms alternatives in predicting drug-protein interactions on two data sets.
Abstract: We extend kernelized matrix factorization with a full-Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernels have been introduced to integrate side information about the rows and columns, which is necessary for making out-of-matrix predictions. We discuss specifically binary output matrices but extensions to realvalued matrices are straightforward. We extend the state of the art in two key aspects: (i) A full-conjugate probabilistic formulation of the kernelized matrix factorization enables an efficient variational approximation, whereas full-Bayesian treatments are not computationally feasible in the earlier approaches. (ii) Multiple side information sources are included, treated as different kernels in multiple kernel learning which additionally reveals which side sources are informative. We then show that the framework can also be used for supervised and semi-supervised multilabel classification and multi-output regression, by considering samples and outputs as the domains where matrix factorization operates. Our method outperforms alternatives in predicting drug-protein interactions on two data sets. On multilabel classification, our algorithm obtains the lowest Hamming losses on 10 out of 14 data sets compared to five state-of-the-art multilabel classification algorithms. We finally show that the proposed approach outperforms alternatives in multi-output regression experiments on a yeast cell cycle data set.

Proceedings ArticleDOI
07 Apr 2014
TL;DR: In this article, the authors present the first independent study of malware infection rates and associated risk factors using data collected directly from over 55,000 Android devices and find that the malware infection rate in Android devices estimated using two malware datasets (0.28% and 0.26%), though small, are significantly higher than the previous independent estimate.
Abstract: There is little information from independent sources in the public domain about mobile malware infection rates. The only previous independent estimate (0.0009%) [11], was based on indirect measurements obtained from domain-name resolution traces. In this paper, we present the first independent study of malware infection rates and associated risk factors using data collected directly from over 55,000 Android devices. We find that the malware infection rates in Android devices estimated using two malware datasets (0.28% and 0.26%), though small, are significantly higher than the previous independent estimate. Based on the hypothesis that some application stores have a greater density of malicious applications and that advertising within applications and cross-promotional deals may act as infection vectors, we investigate whether the set of applications used on a device can serve as an indicator for infection of that device. Our analysis indicates that, while not an accurate indicator of infection by itself, the application set does serve as an inexpensive method for identifying the pool of devices on which more expensive monitoring and analysis mechanisms should be deployed. Using our two malware datasets we show that this indicator performs up to about five times better at identifying infected devices than the baseline of random checks. Such indicators can be used, for example, in the search for new or previously undetected malware. It is therefore a technique that can complement standard malware scanning. Our analysis also demonstrates a marginally significant difference in battery use between infected and clean devices.

Journal ArticleDOI
31 Jan 2014-PLOS ONE
TL;DR: The outcomes of this study contribute to the understanding of the aetiology of INMI, in particular within the framework of memory theory, and present testable hypotheses for future research on successful INMI coping strategies.
Abstract: The vast majority of people experience involuntary musical imagery (INMI) or ‘earworms’; perceptions of spontaneous, repetitive musical sound in the absence of an external source. The majority of INMI episodes are not bothersome, while some cause disruption ranging from distraction to anxiety and distress. To date, little is known about how the majority of people react to INMI, in particular whether evaluation of the experience impacts on chosen response behaviours or if attempts at controlling INMI are successful or not. The present study classified 1046 reports of how people react to INMI episodes. Two laboratories in Finland and the UK conducted an identical qualitative analysis protocol on reports of INMI reactions and derived visual descriptive models of the outcomes using grounded theory techniques. Combined analysis carried out across the two studies confirmed that many INMI episodes were considered neutral or pleasant, with passive acceptance and enjoyment being among the most popular response behaviours. A significant number of people, however, reported on attempts to cope with unwanted INMI. The most popular and effective behaviours in response to INMI were seeking out the tune in question, and musical or verbal distraction. The outcomes of this study contribute to our understanding of the aetiology of INMI, in particular within the framework of memory theory, and present testable hypotheses for future research on successful INMI coping strategies.

Book ChapterDOI
24 Sep 2014
TL;DR: This work designs a constraint propagator for the acyclicity constraint and shows how it can be incorporated in off-the-shelf SAT solvers, and proposes an embedding of directed graphs in SAT, with arcs labelled with propositional variables, and an extended SAT problem in which all clauses have to be satisfied.
Abstract: Acyclicity is a recurring property of solutions to many important combinatorial problems. In this work we study embeddings of specialized acyclicity constraints in the satisfiability problem of the classical propositional logic (SAT). We propose an embedding of directed graphs in SAT, with arcs labelled with propositional variables, and an extended SAT problem in which all clauses have to be satisfied and the subgraph consisting of arcs labelled true is acyclic. We devise a constraint propagator for the acyclicity constraint and show how it can be incorporated in off-the-shelf SAT solvers. We show that all existing encodings of acyclicity constraints in SAT are either prohibitively large or do not sanction all inferences made by the constraint propagator. Our experiments demonstrate the advantages of our solver over other approaches for handling acyclicity.

Proceedings Article
27 Jul 2014
TL;DR: The lower bound is tightened by using more informed variable groupings when creating the pattern databases, and the upper bound is Tightened using an anytime learning algorithm.
Abstract: A recent breadth-first branch and bound algorithm (BFBnB) for learning Bayesian network structures (Malone et al. 2011) uses two bounds to prune the search space for better efficiency; one is a lower bound calculated from pattern database heuristics, and the other is an upper bound obtained by a hill climbing search. Whenever the lower bound of a search path exceeds the upper bound, the path is guaranteed to lead to suboptimal solutions and is discarded immediately. This paper introduces methods for tightening the bounds. The lower bound is tightened by using more informed variable groupings when creating the pattern databases, and the upper bound is tightened using an anytime learning algorithm. Empirical results show that these bounds improve the efficiency of Bayesian network learning by two to three orders of magnitude.

01 Jan 2014
TL;DR: It is arrived at that adapting creative software for supporting human-computer cocreation requires redesigning some major aspects of the software, which guides the on-going project of building an interactive poetry composition tool.
Abstract: This paper investigates how to transform machine creativity systems into interactive tools that support human-computer co-creation. We use three case studies to identify common issues in this transformation, under the perspective of User-Centered Design. We also analyse the interactivity and creative behavior of the three platforms in terms of Wiggins’ formalization of creativity as a search. We arrive at the conclusion that adapting creative software for supporting human-computer cocreation requires redesigning some major aspects of the software, which guides our on-going project of building an interactive poetry composition tool.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper proposes a novel secure control channel architecture based on Host Identity Protocol (HIP) to protect the control channel of software-Defined Mobile Networks from various IP (Internet Protocol) based attacks.
Abstract: Software-Defined Mobile Networks (SDMNs) are becoming popular as the next generation of telecommunication networks due to the enhanced performance, flexibility and scalability. In this paper, we study the new security challenges of the control channel of SDMNs and propose a novel secure control channel architecture based on Host Identity Protocol (HIP). IPsec tunneling and security gateways are widely used in today's mobile networks. The proposed architecture utilized these technologies to protect the control channel of SDMNs. We implement the proposed architecture in a testbed and analyze the security features. Moreover, we measure the performance penalty of security of proposed architecture and analyze its ability to protect the control channel from various IP (Internet Protocol) based attacks.

Journal ArticleDOI
TL;DR: Through long-term studies of the development of two contrasting IIs, the paper examines the prosumer-management strategies by which vendors manage their relationships with their diverse users.
Abstract: This paper contributes to the reworking of the traditional concepts and methods of Science and Technology Studies that is necessary in order to analyse the development and use of social media and other emerging information infrastructures (IIs). Through long-term studies of the development of two contrasting IIs, the paper examines the prosumer-management strategies by which vendors manage their relationships with their diverse users. Despite the sharp differences between our cases – an online-game with social network features and traditional enterprise systems – we find striking homologies in the ways vendors manage the tensions underpinning the design and development of mass-market products. Thus their knowledge infrastructures – the set of tools and instruments through which vendors maintain an adequate understanding of their multiple users – change in the face of competing exigencies. Market expansion may favour ‘efficient’ quantitative user assessment methods and the construction of abstract user cat...

Proceedings Article
01 Apr 2014
TL;DR: This work presents a novel CMF solution that allows each of the matrices to have a separate low-rank structure that is independent of the other matrices, as well as structures that are shared only by a subset of them.
Abstract: CMF is a technique for simultaneously learning low-rank representations based on a collection of matrices with shared entities. A typical example is the joint modeling of user-item, item-property, and user-feature matrices in a recommender system. The key idea in CMF is that the embeddings are shared across the matrices, which enables transferring information between them. The existing solutions, however, break down when the individual matrices have low-rank structure not shared with others. In this work we present a novel CMF solution that allows each of the matrices to have a separate low-rank structure that is independent of the other matrices, as well as structures that are shared only by a subset of them. We compare MAP and variational Bayesian solutions based on alternating optimization algorithms and show that the model automatically infers the nature of each factor using group-wise sparsity. Our approach supports in a principled way continuous, binary and count observations and is efficient for sparse matrices involving missing data. We illustrate the solution on a number of examples, focusing in particular on an interesting use-case of augmented multi-view learning.

Journal ArticleDOI
TL;DR: This paper considers two generalizations of the Minimum Path Cover Problem dealing with integrating constraints arising from long reads or paired-end reads, and shows that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem.
Abstract: Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.

Proceedings Article
23 Jul 2014
TL;DR: The observation that there is useful information implicit in the POPS is made, which shows that solving the constrained subproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms.
Abstract: Several recent algorithms for learning Bayesian network structures first calculate potentially optimal parent sets (POPS) for all variables and then use various optimization techniques to find a set of POPS, one for each variable, that constitutes an optimal network structure. This paper makes the observation that there is useful information implicit in the POPS. Specifically, the POPS of a variable constrain its parent candidates. Moreover, the parent candidates of all variables together give a directed cyclic graph, which often decomposes into a set of strongly connected components (SCCs). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Our results show that solving the constrained subproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms. Further, we show that by considering only the top p POPS of each variable, we quickly find provably very high quality networks for large datasets.

Proceedings ArticleDOI
18 Aug 2014
TL;DR: A linear embedding of logic programs is devised and the performance of answer set computation with SAT modulo acyclicity solvers is studied.
Abstract: Answer set programming (ASP) is a declarative programming paradigm for solving search problems arising in knowledge-intensive domains. One viable way to implement the computation of answer sets corresponding to problem solutions is to recast a logic program as a Boolean satisfiability (SAT) problem and to use existing SAT solver technology for the actual search. Such mappings can be obtained by augmenting Clark's completion with constraints guaranteeing the strong justifiability of answer sets. To this end, we consider an extension of SAT by graphs subject to an acyclicity constraint, called SAT modulo acyclicity. We devise a linear embedding of logic programs and study the performance of answer set computation with SAT modulo acyclicity solvers.