scispace - formally typeset
Search or ask a question
Institution

Helsinki Institute for Information Technology

FacilityEspoo, Finland
About: Helsinki Institute for Information Technology is a facility organization based out in Espoo, Finland. It is known for research contribution in the topics: Population & Bayesian network. The organization has 630 authors who have published 1962 publications receiving 63426 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This article answers the question of what strings can be safely reported from G as contigs using a model in which the genome is a circular covering walk and gives a polynomial-time algorithm to find such strings, which are called omnitigs.
Abstract: Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs—a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question remains: given a genome graph G (e.g., a de Bruijn, or a string graph), what are all the strings that can be safely reported from G as contigs? In this article, we answer this question using a model in which the genome is a circular covering walk. We also give a polynomial-time algorithm to find such strings, which we call omnitigs. Our experiments show that omnitigs are 66%–82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.

27 citations

Journal ArticleDOI
TL;DR: A framework consisting of conceptual models of network hierarchy, multi-layer hierarchical DHT architectures, principles affecting the design choices, and cost models for system tradeoff analysis, performance evaluation, and scalability estimation for structured P2P overlay networks is introduced.
Abstract: Distributed Hash Tables (DHT) are presently used in several large-scale systems in the Internet and envisaged as a key mechanism to provide identifier-locator separation for mobile hosts in Future Internet. Such P2P-based systems become increasingly complex serving popular social networking, resource sharing applications, and Internet-scale infrastructures. Hierarchy is a standard mechanism for coping with heterogeneity and scalability in distributed systems. To address the shortcomings of flat DHT designs, many hierarchical P2P designs have been proposed over recent years. The last generation is hierarchical DHTs (HDHTs) where nodes are organized onto layers and groups. This article discusses hierarchical architectures applied in structured P2P overlay networks, focusing on HDHT designs. We introduce a framework consisting of conceptual models of network hierarchy, multi-layer hierarchical DHT architectures, principles affecting the design choices, and cost models for system tradeoff analysis, performance evaluation, and scalability estimation. Based on the framework we provide a taxonomy and survey more than 20 hierarchical HDHT proposals.

27 citations

Proceedings Article
31 Mar 2018
TL;DR: An information theoretic criterion for Bayesian network structure learning which is called quotient normalized maximum likelihood (qNML), which satisfies the property of score equivalence and is decomposable and completely free of adjustable hyperparameters.
Abstract: We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). In contrast to the closely related factorized normalized maximum likelihood criterion, qNML satisfies the property of score equivalence. It is also decomposable and completely free of adjustable hyperparameters. For practical computations, we identify a remarkably accurate approximation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated and real data demonstrate that the new criterion leads to parsimonious models with good predictive accuracy.

27 citations

Posted Content
TL;DR: In this paper, the authors extend the EPCA model toolbox by presenting the first exponential family multi-view learning methods of the partial least squares and canonical correlation analysis, based on a unified representation of EPCAs as matrix factorization of the natural parameters of exponential family.
Abstract: Exponential family extensions of principal component analysis (EPCA) have received a considerable amount of attention in recent years, demonstrating the growing need for basic modeling tools that do not assume the squared loss or Gaussian distribution. We extend the EPCA model toolbox by presenting the first exponential family multi-view learning methods of the partial least squares and canonical correlation analysis, based on a unified representation of EPCA as matrix factorization of the natural parameters of exponential family. The models are based on a new family of priors that are generally usable for all such factorizations. We also introduce new inference strategies, and demonstrate how the methods outperform earlier ones when the Gaussianity assumption does not hold.

27 citations

Journal ArticleDOI
TL;DR: The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
Abstract: We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a ''balanced state'' for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by ''leashing'' the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

27 citations


Authors

Showing all 632 results

NameH-indexPapersCitations
Dimitri P. Bertsekas9433285939
Olli Kallioniemi9035342021
Heikki Mannila7229526500
Jukka Corander6641117220
Jaakko Kangasjärvi6214617096
Aapo Hyvärinen6130144146
Samuel Kaski5852214180
Nadarajah Asokan5832711947
Aristides Gionis5829219300
Hannu Toivonen5619219316
Nicola Zamboni5312811397
Jorma Rissanen5215122720
Tero Aittokallio522718689
Juha Veijola5226119588
Juho Hamari5117616631
Network Information
Related Institutions (5)
Google
39.8K papers, 2.1M citations

93% related

Microsoft
86.9K papers, 4.1M citations

93% related

Carnegie Mellon University
104.3K papers, 5.9M citations

91% related

Facebook
10.9K papers, 570.1K citations

91% related

Performance
Metrics
No. of papers from the Institution in previous years
YearPapers
20231
20224
202185
202097
2019140
2018127