Showing papers on "Tree (data structure) published in 2016"

PDF

Open Access

Journal Article•DOI•

Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees

[...]

Ivica Letunic, Peer Bork¹•Institutions (1)

08 Jul 2016-Nucleic Acids Research

TL;DR: ITOL 3 is the first tool which supports direct visualization of the recently proposed phylogenetic placements format, and its account system has been redesigned to simplify the management of trees in user-defined workspaces and projects.

...read moreread less

Abstract: Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. The current version was completely redesigned and rewritten, utilizing current web technologies for speedy and streamlined processing. Numerous new features were introduced and several new data types are now supported. Trees with up to 100,000 leaves can now be efficiently displayed. Full interactive control over precise positioning of various annotation features and an unlimited number of datasets allow the easy creation of complex tree visualizations. iTOL 3 is the first tool which supports direct visualization of the recently proposed phylogenetic placements format. Finally, iTOL's account system has been redesigned to simplify the management of trees in user-defined workspaces and projects, as it is heavily used and currently handles already more than 500,000 trees from more than 10,000 individual users.

...read moreread less

4,190 citations

Journal Article•DOI•

W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis.

[...]

Jana Trifinopoulos¹, Lam Tung Nguyen¹, Arndt von Haeseler¹, Bui Quang Minh¹•Institutions (1)

Medical University of Vienna¹

08 Jul 2016-Nucleic Acids Research

TL;DR: W-IQ-TREE supports multiple sequence types in common alignment formats and a wide range of evolutionary models including mixture and partition models, performing fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests.

...read moreread less

Abstract: This article presents W-IQ-TREE, an intuitive and user-friendly web interface and server for IQ-TREE, an efficient phylogenetic software for maximum likelihood analysis. W-IQ-TREE supports multiple sequence types (DNA, protein, codon, binary and morphology) in common alignment formats and a wide range of evolutionary models including mixture and partition models. W-IQ-TREE performs fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests. All computations are conducted on a dedicated computer cluster and the users receive the results via URL or email. W-IQ-TREE is available at http://iqtree.cibiv.univie.ac.at It is free and open to all users and there is no login requirement.

...read moreread less

2,488 citations

Journal Article•DOI•

A Secure and Dynamic Multi-Keyword Ranked Search Scheme over Encrypted Cloud Data

[...]

Zhihua Xia¹, Xinhui Wang¹, Xingming Sun¹, Qian Wang²•Institutions (2)

Nanjing University of Information Science and Technology¹, Wuhan University²

01 Feb 2016-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper constructs a special tree-based index structure and proposes a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents.

...read moreread less

Abstract: Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF $\;\times\;$ IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results. Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme.

...read moreread less

976 citations

Tree Rings And Climate

[...]

Sarah Eichmann

01 Jan 2016

803 citations

Journal Article•DOI•

Fast coalescent-based computation of local branch support from quartet frequencies

[...]

Erfan Sayyari¹, Siavash Mirarab¹•Institutions (1)

University of California, San Diego¹

15 Apr 2016-Molecular Biology and Evolution

TL;DR: This article proposes a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees and evaluates the precision and recall of the local PP on a wide set of simulated and biological datasets.

...read moreread less

Abstract: Species tree reconstruction is complicated by effects of incomplete lineage sorting, commonly modeled by the multi-species coalescent model (MSC). While there has been substantial progress in developing methods that estimate a species tree given a collection of gene trees, less attention has been paid to fast and accurate methods of quantifying support. In this article, we propose a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees. We then show how the quartet support can be used in the context of the MSC to compute (1) the local posterior probability (PP) that the branch is in the species tree and (2) the length of the branch in coalescent units. We evaluate the precision and recall of the local PP on a wide set of simulated and biological datasets, and show that it has very high precision and improved recall compared with multi-locus bootstrapping. The estimated branch lengths are highly accurate when gene tree estimation error is low, but are underestimated when gene tree estimation error increases. Computation of both the branch length and local PP is implemented as new features in ASTRAL.

...read moreread less

578 citations

Proceedings Article•

Convolutional neural networks over tree structures for programming language processing

[...]

Lili Mou¹, Ge Li¹, Lu Zhang¹, Tao Wang², Zhi Jin¹ - Show less +1 more•Institutions (2)

Peking University¹, Stanford University²

12 Feb 2016

TL;DR: In this article, a tree-based convolutional neural network (TBCNN) is proposed for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information.

...read moreread less

Abstract: Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

...read moreread less

551 citations

Journal Article•DOI•

TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis

[...]

Zhicheng Ji¹, Hongkai Ji¹•Institutions (1)

Johns Hopkins University¹

27 Jul 2016-Nucleic Acids Research

TL;DR: TSCAN is a software tool developed to better support in silico pseudo-Time reconstruction in Single-Cell RNA-seq ANalysis and quantitative measures are developed to objectively evaluate and compare different pseudo-time reconstruction methods.

...read moreread less

Abstract: When analyzing single-cell RNA-seq data, constructing a pseudo-temporal path to order cells based on the gradual transition of their transcriptomes is a useful way to study gene expression dynamics in a heterogeneous cell population Currently, a limited number of computational tools are available for this task, and quantitative methods for comparing different tools are lacking Tools for Single Cell Analysis (TSCAN) is a software tool developed to better support in silico pseudo-Time reconstruction in Single-Cell RNA-seq ANalysis TSCAN uses a cluster-based minimum spanning tree (MST) approach to order cells Cells are first grouped into clusters and an MST is then constructed to connect cluster centers Pseudo-time is obtained by projecting each cell onto the tree, and the ordered sequence of cells can be used to study dynamic changes of gene expression along the pseudo-time Clustering cells before MST construction reduces the complexity of the tree space This often leads to improved cell ordering It also allows users to conveniently adjust the ordering based on prior knowledge TSCAN has a graphical user interface (GUI) to support data visualization and user interaction Furthermore, quantitative measures are developed to objectively evaluate and compare different pseudo-time reconstruction methods TSCAN is available at https://githubcom/zji90/TSCAN and as a Bioconductor package

...read moreread less

468 citations

Journal Article•DOI•

Branch-and-bound algorithms

[...]

David R. Morrison, Sheldon H. Jacobson¹, Jason J. Sauppe², Edward C. Sewell³•Institutions (3)

University of Illinois at Urbana–Champaign¹, University of Wisconsin–La Crosse², Southern Illinois University Edwardsville³

01 Feb 2016-Discrete Optimization

TL;DR: A description of recent research advances in the design of B&B algorithms is presented, particularly with regards to the search strategy, the branching strategy, and the pruning rules.

...read moreread less

340 citations

Proceedings Article•DOI•

Natural Language Inference by Tree-Based Convolution and Heuristic Matching

[...]

Lili Mou¹, Rui Men², Ge Li², Yan Xu², Lu Zhang², Rui Yan², Zhi Jin² - Show less +3 more•Institutions (2)

Chinese Ministry of Education¹, Peking University²

01 Aug 2016

TL;DR: This model, a tree-based convolutional neural network (TBCNN) captures sentence-level semantics; then heuristic matching layers like concatenation, element-wise product/difference combine the information in individual sentences.

...read moreread less

Abstract: In this paper, we propose the TBCNN-pair model to recognize entailment and contradiction between two sentences. In our model, a tree-based convolutional neural network (TBCNN) captures sentence-level semantics; then heuristic matching layers like concatenation, element-wise product/difference combine the information in individual sentences. Experimental results show that our model outperforms existing sentence encoding-based approaches by a large margin.

...read moreread less

338 citations

Posted Content•

Modeling and Propagating CNNs in a Tree Structure for Visual Tracking.

[...]

Hyeonseob Nam, Mooyeol Baek, Bohyung Han

25 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: An online visual tracking algorithm by managing multiple target appearance models in a tree structure using Convolutional Neural Networks to represent target appearances, where multiple CNNs collaborate to estimate target states and determine the desirable paths for online model updates in the tree.

...read moreread less

Abstract: We present an online visual tracking algorithm by managing multiple target appearance models in a tree structure. The proposed algorithm employs Convolutional Neural Networks (CNNs) to represent target appearances, where multiple CNNs collaborate to estimate target states and determine the desirable paths for online model updates in the tree. By maintaining multiple CNNs in diverse branches of tree structure, it is convenient to deal with multi-modality in target appearances and preserve model reliability through smooth updates along tree paths. Since multiple CNNs share all parameters in convolutional layers, it takes advantage of multiple models with little extra cost by saving memory space and avoiding redundant network evaluations. The final target state is estimated by sampling target candidates around the state in the previous frame and identifying the best sample in terms of a weighted average score from a set of active CNNs. Our algorithm illustrates outstanding performance compared to the state-of-the-art techniques in challenging datasets such as online tracking benchmark and visual object tracking challenge.

...read moreread less

335 citations

Journal Article•DOI•

A review on machine learning principles for multi-view biological data integration.

[...]

Yifeng Li¹, Fang-Xiang Wu², Alioune Ngom³•Institutions (3)

National Research Council¹, University of Saskatchewan², University of Windsor³

22 Dec 2016-Briefings in Bioinformatics

TL;DR: It is shown that Bayesian models are able to use prior information and model measurements with various distributions, and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.

...read moreread less

Abstract: Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.

...read moreread less

Journal Article•DOI•

Fast Dating Using Least-Squares Criteria and Algorithms

[...]

Thu-Hien To¹, Matthieu Jung¹, Samantha Lycett², Olivier Gascuel¹•Institutions (2)

University of Montpellier¹, University of Edinburgh²

01 Jan 2016-Systematic Biology

TL;DR: Very fast dating algorithms, based on a Gaussian model closely related to the Langley–Fitch molecular-clock model, are presented, showing that this model is robust to uncorrelated violations of the molecular clock.

...read moreread less

Abstract: Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than a thousand taxa are becoming increasingly common, notably with viruses (e.g., human immunodeficiency virus (HIV)). Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley–Fitch molecular-clock model. We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods (e.g., midpoint). Our algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between an unconstrained setting and the case where the temporal precedence constraint (i.e., an ancestral node must be older that its daughter nodes) is accounted for. With rooted trees, the former is solved using linear algebra in linear computing time (i.e., proportional to the number of taxa), while the resolution of the latter, constrained setting, is based on an active-set method that runs in nearly linear time. With unrooted trees the computing time becomes (nearly) quadratic (i.e., proportional to the square of the number of taxa). In all cases, very large input trees (>10,000 taxa) can easily be processed and transformed into time-scaled trees. We compare these algorithms to standard methods (root-to-tip, r8s version of Langley–Fitch method, and BEAST). Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising 1194 strains of Influenza virus from the pdm09 H1N1 Human pandemic. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (least-squares dating), which can be downloaded from http://www.atgc-montpellier.fr/LSD/, along with all our data sets and detailed results. An Online Appendix, providing additional algorithm descriptions, tables, and figures can be found in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.968t3.

...read moreread less

Journal Article•DOI•

rotl: an R package to interact with the Open Tree of Life data

[...]

Franï¿½ois Michonneau¹, Franï¿½ois Michonneau², Joseph W. Brown³, David J. Winter⁴•Institutions (4)

University of Florida¹, Florida Museum of Natural History², University of Michigan³, Arizona State University⁴

01 Dec 2016-Methods in Ecology and Evolution

TL;DR: The Open Tree of Life (OTL) project as discussed by the authors provides a digital tree that encompasses all organisms, built by combining taxonomic information and published phylogenies, as well as the source data used to build it.

...read moreread less

Abstract: Summary While phylogenies have been getting easier to build, it has been difficult to reuse, combine and synthesize the information they provide because published trees are often only available as image files, and taxonomic information is not standardized across studies. The Open Tree of Life (OTL) project addresses these issues by providing a digital tree that encompasses all organisms, built by combining taxonomic information and published phylogenies. The project also provides tools and services to query and download parts of this synthetic tree, as well as the source data used to build it. Here, we present rotl, an R package to search and download data from the Open Tree of Life directly in R. rotl uses common data structures allowing researchers to take advantage of the rich set of tools and methods that are available in R to manipulate, analyse and visualize phylogenies. Here, and in the vignettes accompanying the package, we demonstrate how rotl can be used with other R packages to analyse biodiversity data. As phylogenies are being used in a growing number of applications, rotl facilitates access to phylogenetic data and allows their integration with statistical methods and data sources available in R.

...read moreread less

Proceedings Article•DOI•

FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory

[...]

Ismail Oukid¹, Johan Lasperas, Anisoara Nica, Thomas Willhalm², Wolfgang Lehner¹ - Show less +1 more•Institutions (2)

Dresden University of Technology¹, Intel²

14 Jun 2016

TL;DR: A novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts and a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory is proposed.

...read moreread less

Abstract: The advent of Storage Class Memory (SCM) is driving a rethink of storage systems towards a single-level architecture where memory and storage are merged. In this context, several works have investigated how to design persistent trees in SCM as a fundamental building block for these novel systems. However, these trees are significantly slower than DRAM-based counterparts since trees are latency-sensitive and SCM exhibits higher latencies than DRAM. In this paper we propose a novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts. In this novel design, leaf nodes are persisted in SCM while inner nodes are placed in DRAM and rebuilt upon recovery. The FPTree uses Fingerprinting, a technique that limits the expected number of in-leaf probed keys to one. In addition, we propose a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory. We conduct a thorough performance evaluation and show that the FPTree outperforms state-of-the-art persistent trees with different SCM latencies by up to a factor of 8.2. Moreover, we show that the FPTree scales very well on a machine with 88 logical cores. Finally, we integrate the evaluated trees in memcached and a prototype database. We show that the FPTree incurs an almost negligible performance overhead over using fully transient data structures, while significantly outperforming other persistent trees.

...read moreread less

Proceedings Article•

Learning to branch in Mixed Integer Programming

[...]

Elias B. Khalil¹, Pierre Le Bodic¹, Le Song¹, George L. Nemhauser¹, Bistra Dilkina¹ - Show less +1 more•Institutions (1)

Georgia Institute of Technology¹

12 Feb 2016

TL;DR: This work proposes a machine learning (ML) framework for variable branching in MIP, and observes the decisions made by Strong Branching, a time-consuming strategy that produces small search trees, collecting features that characterize the candidate branching variables at each node of the tree.

...read moreread less

Abstract: The design of strategies for branching in Mixed Integer Programming (MIP) is guided by cycles of parameter tuning and offline experimentation on an extremely heterogeneous testbed, using the average performance. Once devised, these strategies (and their parameter settings) are essentially input-agnostic. To address these issues, we propose a machine learning (ML) framework for variable branching in MIP. Our method observes the decisions made by Strong Branching (SB), a time-consuming strategy that produces small search trees, collecting features that characterize the candidate branching variables at each node of the tree. Based on the collected data, we learn an easy-to-evaluate surrogate function that mimics the SB strategy, by means of solving a learning-to-rank problem, common in ML. The learned ranking function is then used for branching. The learning is instance-specific, and is performed on-the-fly while executing a branch-and-bound search to solve the instance. Experiments on benchmark instances indicate that our method produces significantly smaller search trees than existing heuristics, and is competitive with a state-of-the-art commercial solver.

...read moreread less

Journal Article•DOI•

Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting.

[...]

Siavash Mirarab¹, Shamsuzzoha Bayzid¹, Tandy Warnow², Tandy Warnow¹•Institutions (2)

University of Texas at Austin¹, University of Illinois at Urbana–Champaign²

01 May 2016-Systematic Biology

TL;DR: A study to evaluate MP-EST-one of the most popular species tree estimation methods designed to address ILS-as well as concatenation under maximum likelihood, the greedy consensus, and two supertree methods (Matrix Representation with Parsimony and Matrix representation with Likelihood).

...read moreread less

Abstract: Species tree estimation is complicated by processes, such as gene duplication and loss and incomplete lineage sorting (ILS), that cause discordance between gene trees and the species tree. Furthermore, while concatenation, a traditional approach to tree estimation, has excellent performance under many conditions, the expectation is that the best accuracy will be obtained through the use of species tree estimation methods that are specifically designed to address gene tree discordance. In this article, we report on a study to evaluate MP-EST-one of the most popular species tree estimation methods designed to address ILS-as well as concatenation under maximum likelihood, the greedy consensus, and two supertree methods (Matrix Representation with Parsimony and Matrix Representation with Likelihood). Our study shows that several factors impact the absolute and relative accuracy of methods, including the number of gene trees, the accuracy of the estimated gene trees, and the amount of ILS. Concatenation can be more accurate than the best summary methods in some cases (mostly when the gene trees have poor phylogenetic signal or when the level of ILS is low), but summary methods are generally more accurate than concatenation when there are an adequate number of sufficiently accurate gene trees. Our study suggests that coalescent-based species tree methods may be key to estimating highly accurate species trees from multiple loci.

...read moreread less

Book Chapter•DOI•

Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

[...]

Randal S. Olson¹, Ryan J. Urbanowicz¹, Peter C. Andrews¹, Nicole A. Lavender², LaCreis R. Kidd², Jason H. Moore¹ - Show less +2 more•Institutions (2)

University of Pennsylvania¹, University of Louisville²

30 Mar 2016

TL;DR: This work implements a Tree-based Pipeline Optimization Tool (TPOT) and shows that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets.

...read moreread less

Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.

...read moreread less

Journal Article•DOI•

AC-Feasibility on Tree Networks is NP-Hard

[...]

Karsten Lehmann¹, Alban Grastien¹, Pascal Van Hentenryck¹•Institutions (1)

Australian National University¹

01 Jan 2016-IEEE Transactions on Power Systems

TL;DR: In this article, the authors show that ac-feasibility to find whether some generator dispatch can satisfy a given demand, is NP-hard for tree networks under various conditions on loads or voltages.

...read moreread less

Abstract: Recent years have witnessed significant interest in convex relaxations of the power flows, with several papers showing that the second-order cone relaxation is tight for tree networks under various conditions on loads or voltages. This paper shows that ac-feasibility, i.e., to find whether some generator dispatch can satisfy a given demand, is NP-hard for tree networks.

...read moreread less

Phylogenetic Inference for Binary Data on Dendograms Using Markov

[...]

Bob Mau, Michael A. Newton

01 Jan 2016

TL;DR: Using a stochastic model for the evolution of discrete characters among a group of organisms, this paper derived a Markov chain that simulates a Bayesian posterior distribution on the space of dendograms.

...read moreread less

Abstract: Using a stochastic model for the evolution of discrete characters among a group of organisms, we derive a Markov chain that simulates a Bayesian posterior distribution on the space of dendograms A transformation of the tree into a canonical cophenetic matrix form, with distinct entries along its superdiagonal, suggests a simple proposal distribution for selecting candidate trees "close" to the current tree in the chain We apply the consequent Metropolis algorithm to published restriction site data on nine species of plants The Markov chain mixes well from random starting trees, generating reproducible estimates and confidence sets for the path of evolution

...read moreread less

Journal Article•DOI•

Privately Evaluating Decision Trees and Random Forests

[...]

David J. Wu¹, Tony Feng², Michael Naehrig³, Kristin E. Lauter³•Institutions (3)

Stanford University¹, Massachusetts Institute of Technology², Microsoft³

01 Oct 2016

TL;DR: In this article, two protocols for private evaluation of decision trees and random forests are presented, where the server holds a model (ei- ther a tree or a forest), and the client holds an input (a feature vector). At the conclusion of the protocol, the client learns only the model's output on its input and a few generic parameters concerning the model; the server learns nothing.

...read moreread less

Abstract: Decision trees and random forests are com- mon classifiers with widespread use. In this paper, we develop two protocols for privately evaluating decision trees and random forests. We operate in the standard two-party setting where the server holds a model (ei- ther a tree or a forest), and the client holds an input (a feature vector). At the conclusion of the protocol, the client learns only the model's output on its input and a few generic parameters concerning the model; the server learns nothing. The first protocol we develop provides security against semi-honest adversaries. We then give an extension of the semi-honest protocol that is robust against malicious adversaries. We implement both pro- tocols and show that both variants are able to process trees with several hundred decision nodes in just a few seconds and a modest amount of bandwidth. Compared to previous semi-honest protocols for private decision tree evaluation, we demonstrate a tenfold improvement in computation and bandwidth.

...read moreread less

Journal Article•DOI•

Challenges in species tree estimation under the multispecies coalescent model

[...]

Bo Xu¹, Ziheng Yang², Ziheng Yang¹•Institutions (2)

Beijing Institute of Genomics¹, University College London²

01 Dec 2016-Genetics

TL;DR: The challenges and strategies of species tree inference for distantly related species when the molecular clock is violated are discussed, and the need for improving the computational efficiency and model realism of the likelihood methods as well as the statistical efficiency of the summary methods is highlighted.

...read moreread less

Abstract: The multispecies coalescent (MSC) model has emerged as a powerful framework for inferring species phylogenies while accounting for ancestral polymorphism and gene tree-species tree conflict. A number of methods have been developed in the past few years to estimate the species tree under the MSC. The full likelihood methods (including maximum likelihood and Bayesian inference) average over the unknown gene trees and accommodate their uncertainties properly but involve intensive computation. The approximate or summary coalescent methods are computationally fast and are applicable to genomic datasets with thousands of loci, but do not make an efficient use of information in the multilocus data. Most of them take the two-step approach of reconstructing the gene trees for multiple loci by phylogenetic methods and then treating the estimated gene trees as observed data, without accounting for their uncertainties appropriately. In this article we review the statistical nature of the species tree estimation problem under the MSC, and explore the conceptual issues and challenges of species tree estimation by focusing mainly on simple cases of three or four closely related species. We use mathematical analysis and computer simulation to demonstrate that large differences in statistical performance may exist between the two classes of methods. We illustrate that several counterintuitive behaviors may occur with the summary methods but they are due to inefficient use of information in the data by summary methods and vanish when the data are analyzed using full-likelihood methods. These include (i) unidentifiability of parameters in the model, (ii) inconsistency in the so-called anomaly zone, (iii) singularity on the likelihood surface, and (iv) deterioration of performance upon addition of more data. We discuss the challenges and strategies of species tree inference for distantly related species when the molecular clock is violated, and highlight the need for improving the computational efficiency and model realism of the likelihood methods as well as the statistical efficiency of the summary methods.

...read moreread less

Proceedings Article•DOI•

A Decision Tree Approach to Data Classification using Signal Temporal Logic

[...]

Giuseppe Bombara¹, Cristian-Ioan Vasile¹, Francisco Penedo¹, Hirotoshi Yasuoka², Calin Belta¹ - Show less +1 more•Institutions (2)

Boston University¹, Denso²

11 Apr 2016

TL;DR: This paper introduces a framework for inference of timed temporal logic properties from data, and proposes extensions of the usual impurity measures from machine learning literature to handle classification of system traces by leveraging upon the robustness degree concept.

...read moreread less

Abstract: This paper introduces a framework for inference of timed temporal logic properties from data. The dataset is given as a finite set of pairs of finite-time system traces and labels, where the labels indicate whether the traces exhibit some desired behavior (e.g., a ship traveling along a safe route). We propose a decision-tree based approach for learning signal temporal logic classifiers. The method produces binary decision trees that represent the inferred formulae. Each node of the tree contains a test associated with the satisfaction of a simple formula, optimally tuned from a predefined finite set of primitives. Optimality is assessed using heuristic impurity measures, which capture how well the current primitive splits the data with respect to the traces' labels. We propose extensions of the usual impurity measures from machine learning literature to handle classification of system traces by leveraging upon the robustness degree concept. The proposed incremental construction procedure greatly improves the execution time and the accuracy compared to existing algorithms. We present two case studies that illustrate the usefulness and the computational advantages of the algorithms. The first is an anomaly detection problem in a maritime environment. The second is a fault detection problem in an automotive powertrain system.

...read moreread less

Journal Article•DOI•

Generalized random shapelet forests

[...]

Isak Karlsson¹, Panagiotis Papapetrou¹, Henrik Boström¹•Institutions (1)

Stockholm University¹

01 Sep 2016

TL;DR: A novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm, which yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster.

...read moreread less

Abstract: Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

...read moreread less

Journal Article•DOI•

The Impact of the Tree Prior on Molecular Dating of Data Sets Containing a Mixture of Inter- and Intraspecies Sampling

[...]

Andrew M. Ritchie¹, Nathan Lo¹, Simon Y. W. Ho¹•Institutions (1)

University of Sydney¹

26 Oct 2016-Systematic Biology

TL;DR: The results suggest that tree priors do not strongly affect Bayesian molecular dating results in most cases, even when severely misspecified, however, the choice of tree prior can be significant for the accuracy ofdating results in the case of data sets with mixed inter‐ and intraspecies sampling.

...read moreread less

Abstract: In Bayesian phylogenetic analyses of genetic data, prior probability distributions need to be specified for the model parameters, including the tree When Bayesian methods are used for molecular dating, available tree priors include those designed for species-level data, such as the pure-birth and birth-death priors, and coalescent-based priors designed for population-level data However, molecular dating methods are frequently applied to data sets that include multiple individuals across multiple species Such data sets violate the assumptions of both the speciation and coalescent-based tree priors, making it unclear which should be chosen and whether this choice can affect the estimation of node times To investigate this problem, we used a simulation approach to produce data sets with different proportions of within- and between-species sampling under the multispecies coalescent model These data sets were then analyzed under pure-birth, birth-death, constant-size coalescent, and skyline coalescent tree priors We also explored the ability of Bayesian model testing to select the best-performing priors We confirmed the applicability of our results to empirical data sets from cetaceans, phocids, and coregonid whitefish Estimates of node times were generally robust to the choice of tree prior, but some combinations of tree priors and sampling schemes led to large differences in the age estimates In particular, the pure-birth tree prior frequently led to inaccurate estimates for data sets containing a mixture of inter- and intraspecific sampling, whereas the birth-death and skyline coalescent priors produced stable results across all scenarios Model testing provided an adequate means of rejecting inappropriate tree priors Our results suggest that tree priors do not strongly affect Bayesian molecular dating results in most cases, even when severely misspecified However, the choice of tree prior can be significant for the accuracy of dating results in the case of data sets with mixed inter- and intraspecies sampling [Bayesian phylogenetic methods; model testing; molecular dating; node time; tree prior]

...read moreread less

Journal Article•DOI•

A decision tree based data-driven diagnostic strategy for air handling units

[...]

Rui Yan¹, Zhenjun Ma¹, Yang Zhao², Georgios Kokogiannakis¹•Institutions (2)

University of Wollongong¹, Zhejiang University²

01 Dec 2016-Energy and Buildings

TL;DR: A decision tree based data-driven diagnostic strategy for AHUs is presented, in which classification and regression tree (CART) algorithm is used for decision tree induction, and it is shown that this strategy can achieve a good diagnostic performance with an average F-measure of 0.97.

...read moreread less

Journal Article•DOI•

Mining High Utility Patterns in One Phase without Generating Candidates

[...]

Junqiang Liu¹, Ke Wang², Benjamin C. M. Fung³•Institutions (3)

Zhejiang Gongshang University¹, Simon Fraser University², McGill University³

01 May 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a novel algorithm that finds high utility patterns in a single phase without generating candidates in an efficient and scalable way, which targets the root cause with prior algorithms.

...read moreread less

Abstract: Utility mining is a new development of data mining technology. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotonicity property holds with the interestingness measure. Prior works on this problem all employ a two-phase, candidate generation approach with one exception that is however inefficient and not scalable with large databases. The two-phase approach suffers from scalability issue due to the huge number of candidates. This paper proposes a novel algorithm that finds high utility patterns in a single phase without generating candidates. The novelties lie in a high utility pattern growth approach, a lookahead strategy, and a linear data structure. Concretely, our pattern growth approach is to search a reverse set enumeration tree and to prune search space by utility upper bounding. We also look ahead to identify high utility patterns without enumeration by a closure property and a singleton property. Our linear data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility patterns in an efficient and scalable way, which targets the root cause with prior algorithms. Extensive experiments on sparse and dense, synthetic and real world data suggest that our algorithm is up to 1 to 3 orders of magnitude more efficient and is more scalable than the state-of-the-art algorithms.

...read moreread less

Journal Article•DOI•

MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine.

[...]

Gilmer Valdes¹, Gilmer Valdes², José Marcio Luna¹, Eric Eaton², Charles B. Simone², Lyle H. Ungar², Timothy D. Solberg², Timothy D. Solberg¹ - Show less +4 more•Institutions (2)

University of California, San Francisco¹, University of Pennsylvania²

30 Nov 2016-Scientific Reports

TL;DR: MediBoost is presented, a novel framework for constructing decision trees that retain interpretability while having accuracy similar to ensemble methods, and its performance is compared to that of conventional decision trees and ensemble methods on 13 medical classification problems.

...read moreread less

Abstract: Machine learning algorithms that are both interpretable and accurate are essential in applications such as medicine where errors can have a dire consequence. Unfortunately, there is currently a tradeoff between accuracy and interpretability among state-of-the-art methods. Decision trees are interpretable and are therefore used extensively throughout medicine for stratifying patients. Current decision tree algorithms, however, are consistently outperformed in accuracy by other, less-interpretable machine learning models, such as ensemble methods. We present MediBoost, a novel framework for constructing decision trees that retain interpretability while having accuracy similar to ensemble methods, and compare MediBoost's performance to that of conventional decision trees and ensemble methods on 13 medical classification problems. MediBoost significantly outperformed current decision tree algorithms in 11 out of 13 problems, giving accuracy comparable to ensemble methods. The resulting trees are of the same type as decision trees used throughout clinical practice but have the advantage of improved accuracy. Our algorithm thus gives the best of both worlds: it grows a single, highly interpretable tree that has the high accuracy of ensemble methods.

...read moreread less

Meta-analysis Reveals that Hydraulic Traits Explain Cross-Species Patterns of Drought-Induced Tree Mortality across the Globe

[...]

William R. L. Anderegg

01 Dec 2016

TL;DR: A meta-analysis of species’ mortality rates across 475 species finds that species-specific mortality anomalies from community mortality rate in a given drought were associated with plant hydraulic traits, providing broad support for the hypothesis that hydraulic traits capture key mechanisms determining tree death.

...read moreread less

Abstract: Significance Predicting the impacts of climate extremes on plant communities is a central challenge in ecology. Physiological traits may improve prediction of drought impacts on forests globally. We perform a meta-analysis across 33 studies that span all forested biomes and find that, among the examined traits, hydraulic traits explain cross-species patterns in mortality from drought. Gymnosperm and angiosperm mortality was associated with different hydraulic traits, giving insight into the relative weights of different traits and mechanisms in mortality prediction. Our results provide a foundation for more mechanistic predictions of drought-induced tree mortality across Earth’s diverse forests. Drought-induced tree mortality has been observed globally and is expected to increase under climate change scenarios, with large potential consequences for the terrestrial carbon sink. Predicting mortality across species is crucial for assessing the effects of climate extremes on forest community biodiversity, composition, and carbon sequestration. However, the physiological traits associated with elevated risk of mortality in diverse ecosystems remain unknown, although these traits could greatly improve understanding and prediction of tree mortality in forests. We performed a meta-analysis on species’ mortality rates across 475 species from 33 studies around the globe to assess which traits determine a species’ mortality risk. We found that species-specific mortality anomalies from community mortality rate in a given drought were associated with plant hydraulic traits. Across all species, mortality was best predicted by a low hydraulic safety margin—the difference between typical minimum xylem water potential and that causing xylem dysfunction—and xylem vulnerability to embolism. Angiosperms and gymnosperms experienced roughly equal mortality risks. Our results provide broad support for the hypothesis that hydraulic traits capture key mechanisms determining tree death and highlight that physiological traits can improve vegetation model prediction of tree mortality during climate extremes.

...read moreread less

Journal Article•DOI•

A generalized item response tree model for psychological assessments.

[...]

Minjeong Jeon¹, Paul De Boeck¹•Institutions (1)

Ohio State University¹

01 Sep 2016-Behavior Research Methods

TL;DR: A generalized item response tree model with a flexible parametric form, dimensionality, and choice of covariates for modeling item response processes with a tree structure is presented.

...read moreread less

Abstract: A new item response theory (IRT) model with a tree structure has been introduced for modeling item response processes with a tree structure In this paper, we present a generalized item response tree model with a flexible parametric form, dimensionality, and choice of covariates The utilities of the model are demonstrated with two applications in psychological assessments for investigating Likert scale item responses and for modeling omitted item responses The proposed model is estimated with the freely available R package flirt (Jeon et al, 2014b)

...read moreread less

Applied Forest Tree Improvement

[...]

Anne Kuefer

01 Jan 2016

TL;DR: The applied forest tree improvement is universally compatible with any devices to read and is available in the digital library an online access to it is set as public so you can download it instantly.

...read moreread less

Abstract: Thank you very much for downloading applied forest tree improvement. As you may know, people have look hundreds times for their favorite readings like this applied forest tree improvement, but end up in infectious downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some infectious bugs inside their desktop computer. applied forest tree improvement is available in our digital library an online access to it is set as public so you can download it instantly. Our digital library spans in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Merely said, the applied forest tree improvement is universally compatible with any devices to read.

...read moreread less

Collapse