scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Visual Exploration of Parameter Influence on Phylogenetic Trees

TL;DR: A proposed visual-analytics approach can reveal the parameters' impact on the trees and reveal which parameters affected the tree structures, leading to a more reliable selection of the best trees.
Abstract: Evolutionary relationships between organisms are frequently derived as phylogenetic trees inferred from multiple sequence alignments (MSAs). The MSA parameter space is exponentially large, so tens of thousands of potential trees can emerge for each dataset. A proposed visual-analytics approach can reveal the parameters' impact on the trees. Given input trees created with different parameter settings, it hierarchically clusters the trees according to their structural similarity. The most important clusters of similar trees are shown together with their parameters. This view offers interactive parameter exploration and automatic identification of relevant parameters. Biologists applied this approach to real data of 16S ribosomal RNA and protein sequences of ion channels. It revealed which parameters affected the tree structures. This led to a more reliable selection of the best trees.
Citations
More filters
Proceedings ArticleDOI
01 Oct 2014
TL;DR: This work introduces a framework for a feedback-driven view exploration, inspired by relevance feedback approaches used in Information Retrieval, and presents an instantiation of the framework for exploration of Scatter Plot Spaces based on visual features.
Abstract: The extraction of relevant and meaningful information from multivariate or high-dimensional data is a challenging problem. One reason for this is that the number of possible representations, which might contain relevant information, grows exponentially with the amount of data dimensions. Also, not all views from a possibly large view space, are potentially relevant to a given analysis task or user. Focus+Context or Semantic Zoom Interfaces can help to some extent to efficiently search for interesting views or data segments, yet they show scalability problems for very large data sets. Accordingly, users are confronted with the problem of identifying interesting views, yet the manual exploration of the entire view space becomes ineffective or even infeasible. While certain quality metrics have been proposed recently to identify potentially interesting views, these often are defined in a heuristic way and do not take into account the application or user context. We introduce a framework for a feedback-driven view exploration, inspired by relevance feedback approaches used in Information Retrieval. Our basic idea is that users iteratively express their notion of interestingness when presented with candidate views. From that expression, a model representing the user's preferences, is trained and used to recommend further interesting view candidates. A decision support system monitors the exploration process and assesses the relevance-driven search process for convergence and stability. We present an instantiation of our framework for exploration of Scatter Plot Spaces based on visual features. We demonstrate the effectiveness of this implementation by a case study on two real-world datasets. We also discuss our framework in light of design alternatives and point out its usefulness for development of user- and context-dependent visual exploration systems.

78 citations

Journal ArticleDOI
TL;DR: CorBLOSUM type matrices outperform the BLOSUM matrices on a statistically significant level in most of the cases, especially on up-to-date databases such as ASTRAL ≥2.01.
Abstract: BLOSUM matrices belong to the most commonly used substitution matrix series for protein homology search and sequence alignments since their publication in 1992. In 2008, Styczynski et al. discovered miscalculations in the clustering step of the matrix computation. Still, the RBLOSUM64 matrix based on the corrected BLOSUM code was reported to perform worse at a statistically significant level than the BLOSUM62. Here, we present a further correction of the (R)BLOSUM code and provide a thorough performance analysis of BLOSUM-, RBLOSUM- and the newly derived CorBLOSUM-type matrices. Thereby, we assess homology search performance of these matrix-types derived from three different BLOCKS databases on all versions of the ASTRAL20, ASTRAL40 and ASTRAL70 subsets resulting in 51 different benchmarks in total. Our analysis is focused on two of the most popular BLOSUM matrices — BLOSUM50 and BLOSUM62. Our study shows that fixing small errors in the BLOSUM code results in substantially different substitution matrices with a beneficial influence on homology search performance when compared to the original matrices. The CorBLOSUM matrices introduced here performed at least as good as their BLOSUM counterparts in ∼75 % of all test cases. On up-to-date ASTRAL databases BLOSUM matrices were even outperformed by CorBLOSUM matrices in more than 86 % of the times. In contrast to the study by Styczynski et al., the tested RBLOSUM matrices also outperformed the corresponding BLOSUM matrices in most of the cases. Comparing the CorBLOSUM with the RBLOSUM matrices revealed no general performance advantages for either on older ASTRAL releases. On up-to-date ASTRAL databases however CorBLOSUM matrices performed better than their RBLOSUM counterparts in ∼74 % of the test cases. Our results imply that CorBLOSUM type matrices outperform the BLOSUM matrices on a statistically significant level in most of the cases, especially on up-to-date databases such as ASTRAL ≥2.01. Additionally, CorBLOSUM matrices are closer to those originally intended by Henikoff and Henikoff on a conceptual level. Hence, we encourage the usage of CorBLOSUM over (R)BLOSUM matrices for the task of homology search.

20 citations


Cites background from "Visual Exploration of Parameter Inf..."

  • ...The selection of the parameters in these models is a non-trivial task and an important step in homology search [5–7] and phylogeny [8, 9]....

    [...]

Journal ArticleDOI
TL;DR: This article presents a new characterization of the requirements for pipeline design and optimization, based on both a review of the literature and first-hand assessment of eight application case studies, and identifies seven future challenges for visualization research that are not met by the capabilities of today's systems.
Abstract: The rising quantity and complexity of data creates a need to design and optimize data processing pipelines—the set of data processing steps, parameters and algorithms that perform operations on the data. Visualization can support this process but, although there are many examples of systems for visual parameter analysis, there remains a need to systematically assess users’ requirements and match those requirements to exemplar visualization methods. This article presents a new characterization of the requirements for pipeline design and optimization. This characterization is based on both a review of the literature and first-hand assessment of eight application case studies. We also match these requirements with exemplar functionality provided by existing visualization tools. Thus, we provide end-users and visualization developers with a way of identifying functionality that addresses data processing problems in an application. We also identify seven future challenges for visualization research that are not met by the capabilities of today's systems.

19 citations


Cites background or methods from "Visual Exploration of Parameter Inf..."

  • ...A third option is to group similar outputs and then showonly representatives of each group [29], [43]....

    [...]

  • ...output tree to the input parameters, datasets and algorithms that are used [29], [30]....

    [...]

  • ...6) and the phylogenetic trees case study [29])....

    [...]

  • ...Therefore new visual tree comparison and parameter sensitivity exploration tools needed to be developed [29]....

    [...]

  • ...Complex output analysis after performing several pipeline steps is the main bottleneck during the comparison of phylogenetic trees [29]....

    [...]

Journal ArticleDOI
TL;DR: The problem of estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy, and medicine as discussed by the authors, although trees are estimated, their uncertainties are generally discarded in stati...
Abstract: Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy, and medicine. Although trees are estimated, their uncertainties are generally discarded in stati...

17 citations

Journal ArticleDOI
TL;DR: This study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices, and implies that PFASum matrices improve homological search performance as well as MSA quality in many cases whenCompared to conventional substitution matrices.
Abstract: Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space. We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set. Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences. We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks.

17 citations


Cites background from "Visual Exploration of Parameter Inf..."

  • ...The selection of the right substitution matrix and corresponding gap parameters is an intricate task essential in homology search, construction of MSAs and phylogeny [1, 18, 24, 35, 37]....

    [...]

References
More filters
Proceedings ArticleDOI
15 Nov 2004
TL;DR: This work proposes an efficient algorithm, the L method, that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph, using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters.
Abstract: Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments is specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. We investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph. Using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.

711 citations

Proceedings ArticleDOI
01 Jul 2003
TL;DR: The idea of "guaranteed visibility", where highlighted areas are treated as landmarks that must remain visually apparent at all times, is introduced in TreeJuxtaposer, a system designed to support the comparison task for large trees of several hundred thousand nodes.
Abstract: Structural comparison of large trees is a difficult task that is only partially supported by current visualization techniques, which are mainly designed for browsing. We present TreeJuxtaposer, a system designed to support the comparison task for large trees of several hundred thousand nodes. We introduce the idea of "guaranteed visibility", where highlighted areas are treated as landmarks that must remain visually apparent at all times. We propose a new methodology for detailed structural comparison between two trees and provide a new nearly-linear algorithm for computing the best corresponding node from one tree to another. In addition, we present a new rectilinear Focus+Context technique for navigation that is well suited to the dynamic linking of side-by-side views while guaranteeing landmark visibility and constant frame rates. These three contributions result in a system delivering a fluid exploration experience that scales both in the size of the dataset and the number of pixels in the display. We have based the design decisions for our system on the needs of a target audience of biologists who must understand the structural details of many phylogenetic, or evolutionary, trees. Our tool is also useful in many other application domains where tree comparison is needed, ranging from network management to call graph optimization to genealogy.

336 citations

Journal ArticleDOI
TL;DR: The use of multidimensional scaling of tree-to-tree pairwise distances to visualize the relationships among sets of phylogenetic trees is explored and found to be useful for exploring "tree islands", for comparing sets of trees obtained from bootstrapping and Bayesian sampling, and for comparing multiple Bayesian analyses.
Abstract: We explored the use of multidimensional scaling (MDS) of tree-to-tree pairwise distances to visualize the relationships among sets of phylogenetic trees. We found the technique to be useful for exploring "tree islands" (sets of topologically related trees among larger sets of near-optimal trees), for comparing sets of trees obtained from bootstrapping and Bayesian sampling, for comparing trees obtained from the analysis of several different genes, and for comparing multiple Bayesian analyses. The technique was also useful as a teaching aid for illustrating the progress of a Bayesian analysis and as an exploratory tool for examining large sets of phylogenetic trees. We also identified some limitations to the method, including distortions of the multidimensional tree space into two dimensions through the MDS technique, and the definition of the MDS-defined space based on a limited sample of trees. Nonetheless, the technique is a useful approach for the analysis of large sets of phylogenetic trees.

227 citations

Journal ArticleDOI
TL;DR: The spectrum of current representation techniques used on single trees, pairs of trees and finally multiple trees are discussed, in order to identify which representations are best suited to particular tasks and to find gaps in the representation space.
Abstract: This article summarises the current state of research into multiple tree visualisations. It discusses the spectrum of current representation techniques used on single trees, pairs of trees and finally multiple trees, in order to identify which representations are best suited to particular tasks and to find gaps in the representation space, in which opportunities for future multiple tree visualisation research may exist. The application areas from where multiple tree data are derived are enumerated, and the distinct structures that multiple trees make in combination with each other and the effect on subsequent approaches to their visualisation are discussed, along with the basic high-level goals of existing multiple tree visualisations.

140 citations

Journal ArticleDOI
TL;DR: This paper replaces a tedious manual process that is often based on guess-work and luck by a principled approach that systematically explores the parameter space and relies on existing ground-truth images in order to evaluate the "goodness" of an image segmentation technique.
Abstract: In this paper we address the difficult problem of parameter-finding in image segmentation. We replace a tedious manual process that is often based on guess-work and luck by a principled approach that systematically explores the parameter space. Our core idea is the following two-stage technique: We start with a sparse sampling of the parameter space and apply a statistical model to estimate the response of the segmentation algorithm. The statistical model incorporates a model of uncertainty of the estimation which we use in conjunction with the actual estimate in (visually) guiding the user towards areas that need refinement by placing additional sample points. In the second stage the user navigates through the parameter space in order to determine areas where the response value (goodness of segmentation) is high. In our exploration we rely on existing ground-truth images in order to evaluate the "goodness" of an image segmentation technique. We evaluate its usefulness by demonstrating this technique on two image segmentation algorithms: a three parameter model to detect microtubules in electron tomograms and an eight parameter model to identify functional regions in dynamic Positron Emission Tomography scans.

133 citations