scispace - formally typeset
Search or ask a question

Showing papers on "Dendrogram published in 2020"


Journal ArticleDOI
TL;DR: These findings provide new perspectives on the diverse nature of marine phylosymbioses and the complex roles of the microbiome in the evolution of marine invertebrates.
Abstract: Microbiome assemblages of plants and animals often show a degree of correlation with host phylogeny; an eco-evolutionary pattern known as phylosymbiosis. Using 16S rRNA gene sequencing to profile the microbiome, paired with COI, 18S rRNA and ITS1 host phylogenies, phylosymbiosis was investigated in four groups of coral reef invertebrates (scleractinian corals, octocorals, sponges and ascidians). We tested three commonly used metrics to evaluate the extent of phylosymbiosis: (a) intraspecific versus interspecific microbiome variation, (b) topological comparisons between host phylogeny and hierarchical clustering (dendrogram) of host-associated microbial communities, and (c) correlation of host phylogenetic distance with microbial community dissimilarity. In all instances, intraspecific variation in microbiome composition was significantly lower than interspecific variation. Similarly, topological congruency between host phylogeny and the associated microbial dendrogram was more significant than would be expected by chance across all groups, except when using unweighted UniFrac distance (compared with weighted UniFrac and Bray-Curtis dissimilarity). Interestingly, all but the ascidians showed a significant positive correlation between host phylogenetic distance and associated microbial dissimilarity. Our findings provide new perspectives on the diverse nature of marine phylosymbioses and the complex roles of the microbiome in the evolution of marine invertebrates.

29 citations


Journal ArticleDOI
01 Sep 2020
TL;DR: The programming codes presented in this study enable to sort a dataset in a similar research aimed to group data based on the similarity of attributes and demonstrate the effective representation of the data sorting, grouping and classifying by the machine learning algorithms.
Abstract: The paper presents a comparison of the two languages Python and R related to the classification tools and demonstrates the differences in their syntax and graphical output. It indicates the functionality of R and Python packages {dendextend} and scipy.cluster as effective tools for the dendrogram modelling by the algorithms of sorting and ranking datasets. R and Python programming languages have been tested on a sample dataset including marine geological measurements. The work aims to detect how bathymetric data change along the 25 bathymetric profiles digitized across the Mariana Trench. The methodology includes performed hierarchical cluster analysis with dendrograms and plotted clustermap with marginal dendrograms. The statistical libraries include Matplotlib, SciPy, NumPy, Pandas by Python and {dendextend}, {pvclust}, {magrittr} by R. The dendrograms were compared by the model-simulated clusters of the bathymetric ranges. The results show three distinct groups of the profiles sorted by the elevation ranges with maximal depths detected in a group of profiles 19-21. The dendrogram visualization in a cluster analysis demonstrates the effective representation of the data sorting, grouping and classifying by the machine learning algorithms. The programming codes presented in this study enable to sort a dataset in a similar research aimed to group data based on the similarity of attributes. Effective visualization by dendrograms is a useful modelling tool for the geospatial management where data ranking is required. Plotting dendrograms by R, comparing to Python, presented functional and sophisticated algorithms, refined design control and fine graphical data output. The interdisciplinary nature of this work consists in application of the coding algorithms for spatial data analysis.

24 citations


Journal ArticleDOI
TL;DR: Cl clustering analysis means have evidently demonstrated a relationship between known Virgibacillus strains and other related bacteria based on profiling of their synthesized proteins, which can be easily screened for their potential to exhibit certain activities, which is of ecological, environmental and biotechnological significance.
Abstract: Occurrence of mineral forming and other bacteria in mats is well demonstrated. However, their high diversity shown by ribotyping has not been explained, although it could explain the diversity of formed minerals. Common biomarkers as well as phylogenic relationships are useful tools for clustering the isolates and the prediction of their potential role in the natural niche. In this study, a combination of MALDI-TOF MS with PCA was shown to be a powerful tool to categorize 35 mineral forming bacterial isolates isolated from Dohat Faishakh sabkha, northwest of Qatar (23 from decaying mats and 12 from living ones). The 23 strains from decaying mats belong to the Virgibacillus genus as identified by ribotyping and are shown to be highly involved in the formation of protodolomite and a diversity of minerals. They were used as internal references for the categorization of sabkha bacteria. Combination of the isolation of bacteria on selective mineral forming media, their MALDI TOF MS protein profiling and PCA analysis established their relationship in a phylloproteomic dendrogram based on protein biomarkers including m/z 4905, 3265, 5240, 6430, 7765, and 9815. PCA analysis clustered the studied isolates into 3 major clusters, showing strong correspondence to the 3 phylloproteomic groups that were established by the dendrogram. Both clustering analysis means have evidently demonstrated a relationship between known Virgibacillus strains and other related bacteria based on profiling of their synthesized proteins. Thus, larger populations of bacteria in mats can be easily screened for their potential to exhibit certain activities, which is of ecological, environmental and biotechnological significance.

18 citations


Journal ArticleDOI
TL;DR: It is proved that CytP450 based marker system is efficient in the elucidation of genetic diversity in M. oleifera accessions.
Abstract: Drumstick (Moringa oleifera Lam.) is an important vegetable as well as forage crop of arid and semi-arid zones of the tropics. The leaves and pods of the plant are rich sources of minerals and vitamins. In the present work, genetic diversity study of 23 genotypes of M. oleifera collected from Kerala, Tamil Nadu and Karnataka states of India was carried out using seven cytochrome P450 (CytP450) markers. By using seven pairs of CytP450 gene-based markers, 88.25% of polymorphism was recorded among the 23 sampled genotypes. The Polymorphic Information Content (PI), Marker Index (MI) and Resolving Power obtained for seven primers were estimated 0.23, 2.96 and 9.83, respectively. The Unweighted Pair Group Method with Arithmetic mean (UPGMA) dendrogram based on this marker data indicate that genotypes from different geographical regions are placed in the same clusters. The dendrogram and Principal Coordinates Analysis (PCoA) plots derived from the binary data matrices were highly concordant. The investigation, in brief, proved that CytP450 based marker system is efficient in the elucidation of genetic diversity in M. oleifera accessions.

13 citations


Journal ArticleDOI
TL;DR: A Hierarchical Agglomerative Clustering algorithm which automatically estimates the numbers of natural clusters and gives the associated clustering solutions along with dendrograms for visualizing the clustering structure can be a better alternative to contemporary approaches for identifying potential novel subtypes of cancers from genomic data.
Abstract: Identifying potential novel subtypes of cancers from genomic data requires techniques to estimate the number of natural clusters in the data. Determining the number of natural clusters in a dataset has been a challenging problem in Machine Learning. Employing an internal cluster validity index such as Silhouette Index together with a clustering algorithm has been a widely used technique for estimating the number of natural clusters, which has limitations. We propose a Hierarchical Agglomerative Clustering algorithm which automatically estimates the numbers of natural clusters and gives the associated clustering solutions along with dendrograms for visualizing the clustering structure. The algorithm has a Silhouette Index-based criterion for selecting the pair of clusters to merge, in the process of building the clustering hierarchy. The proposed algorithm could find decent estimates for the number of natural clusters, and the associated clustering solutions when applied to a collection of cancer gene expression datasets and general datasets. The proposed method showed better overall performance when compared to that of a set of prominent methods for estimating the number of natural clusters, which are used for cancer subtype discovery from genomic data. The proposed method is deterministic. It can be a better alternative to contemporary approaches for identifying potential novel subtypes of cancers from genomic data.

11 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigate the time evolution of dense cores identified in molecular cloud simulations using dendrograms, which are a common tool to identify hierarchical structure in simulations and observations of star formation.
Abstract: We investigate the time evolution of dense cores identified in molecular cloud simulations using dendrograms, which are a common tool to identify hierarchical structure in simulations and observations of star formation. We develop an algorithm to link dendrogram structures through time using the three-dimensional density field from magnetohydrodynamical simulations, thus creating histories for all dense cores in the domain. We find that the population-wide distributions of core properties are relatively invariant in time, and quantities like the core mass function match with observations. Despite this consistency, an individual core may undergo large (>40%), stochastic variations due to the redefinition of the dendrogram structure between timesteps. This variation occurs independent of environment and stellar content. We identify a population of short-lived (<200 kyr) overdensities masquerading as dense cores that may comprise ~20% of any time snapshot. Finally, we note the importance of considering the full history of cores when interpreting the origin of the initial mass function; we find that, especially for systems containing multiple stars, the core mass defined by a dendrogram leaf in a snapshot is typically less than the final system stellar mass. This work reinforces that there is no time-stable density contour that defines a star-forming core. The dendrogram itself can induce significant structure variation between timesteps due to small changes in the density field. Thus, one must use caution when comparing dendrograms of regions with different ages or environment properties because differences in dendrogram structure may not come solely from the physical evolution of dense cores.

10 citations



Journal ArticleDOI
TL;DR: A generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions is developed, and a vector-based representation of the inferred distances is computed.
Abstract: We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies.

6 citations


Journal Article
TL;DR: Thirty six rice genotypes were used to estimate the magnitude of diversity associated with them using D2 analysis to discover the diversity of genotypes, which helped to choose the improved mutants from the original variety.
Abstract: Thirty six rice genotypes were used to estimate the magnitude of diversity associated with them using D2 analysis to discover the diversity of genotypes. The analysis of 17 quantitative traits resulted in 8 different clusters from 36 rice genotypes. The maximum number of genotypes was grouped in Cluster I, whereas only one genotype each was associated with Cluster VII and VIII. Cluster I had maximum cluster mean for the number of grains per panicle and minimum for days to maturity whereas Cluster VII exhibited a better grain type with maximum cluster mean for grain yield. Hence the selection of genotypes from these clusters would give better breeding lines with higher grain yield. The dendrogram analysis helped to choose the improved mutants from the original variety.

4 citations


Journal ArticleDOI
20 May 2020
TL;DR: The results showed that the studied SSR markers, provided sufficient polymorphism and reproducible fingerprinting profiles for evaluating genetic diversity of wheat landraces, and showed a good level of genetic diversity at the molecular level.
Abstract: To increase the genetic progress in wheat (Triticum aestivum L.) yield, breeders search for germplasm of high genetic diversity, one of them is the landraces. The present study aimed at evaluating genetic diversity of 20 Egyptian wheat landraces and two cultivars using microsatellite markers (SSRs). Ten SSR markers amplified a total of 27 alleles in the set of 22 wheat accessions, of which 23 alleles (85.2%) were polymorphic. The majority of the markers showed high polymorphism information content (PIC) values (0.67-0.94), indicating the diverse nature of the wheat accessions and/or highly informative SSR markers used in this study. The genotyping data of the SSR markers were used to assess genetic variation in the wheat accessions by dendrogram. The highest genetic distance was found between G21 (Sakha 64; an Egyptian cultivar) and the landrace accession No. 9120 (G11). These two genotypes could be used as parents in a hybridization program followed by selection in the segregating generations, to identify some Original Research Article Al-Naggar et al.; AJBGMB, 3(4): 46-58, 2020; Article no.AJBGMB.56398 47 transgressive segregates of higher grain yield than both parents. The clustering assigned the wheat genotypes into four groups based on SSR markers. The results showed that the studied SSR markers, provided sufficient polymorphism and reproducible fingerprinting profiles for evaluating genetic diversity of wheat landraces. The analyzed wheat landraces showed a good level of genetic diversity at the molecular level. Molecular variation evaluated in this study of wheat landraces can be useful in traditional and molecular breeding programs.

4 citations


Journal ArticleDOI
30 Jun 2020
TL;DR: This study showed that ISSRs would be useful to determine the genetic diversity in the Brassicaceae family.
Abstract: Brassicaceae is one of the biggest family which have thousands of species all around the world In order to use wild mustard in a breeding process, their genetic kinship levels must be defined Inter simple sequence repeats (ISSRs) are one of the common markers to evaluate genetic diversity Here, 28 mustard genotypes representing four taxa, 17 of Brassica juncea, 2 of B nigra, 2 of B rapa, and 7 of B arvensis, were investigated with seven ISSR primers Totally, 160 bands were scored out of which 8875% showed polymorphism The polymorphism information content (PIC) varied from 025 to 040 The average heterozygosity (Hav), multiplex ratio (MR), marker index (MI), and resolving power (Rp) were calculated as 033, 907, 299, and 829, respectively STRUCTURE (v 234) analysis unraveled two subpopulations (K=2) The dendrogram, constructed based on Jaccard similarity coefficient using the Unweighted Pair Group Average (UPGMA), in which, the first branch consisted of B juncea, B nigra and B rapa, and the second branch consisted of B arvensis, supported the results of STRUCTURE analysis Additionally, principal component analysis (PCA) analysis supported the dendrogram and clearly separated the four taxa This study showed that ISSRs would be useful to determine the genetic diversity in the Brassicaceae family

Journal ArticleDOI
TL;DR: Variations in genetic were observed in 65 pineapple accessions gathered from germplasm available at Malaysian Agriculture Research and Development Institute (MARDI) via 15 markers of simple sequence repeat (SSR) and the results showed that 59 alleles appeared to range from 2.0 to 6.0 alleles with a mean of 3.9 alleles per locus, thus displaying polymorphism for all samples at a moderate level.
Abstract: Assessments of genetic diversity have been claimed to be significantly efficient in utilising and managing resources of genetic for breeding programme. In this study, variations in genetic were observed in 65 pineapple accessions gathered from germplasm available at Malaysian Agriculture Research and Development Institute (MARDI) located in Pontian, Johor via 15 markers of simple sequence repeat (SSR). The results showed that 59 alleles appeared to range from 2.0 to 6.0 alleles with a mean of 3.9 alleles per locus, thus displaying polymorphism for all samples at a moderate level. Furthermore, the values of polymorphic information content (PIC) had been found to range between 0.104 (TsuAC035) and 0.697 (Acom_9.9), thus averaging at the value of 0.433. In addition, the expected and the observed heterozygosity of each locus seemed to vary within the ranges of 0.033 to 0.712, and from 0.033 to 0.885, along with the average values of 0.437 and 0.511, respectively. The population structure analysis via method of delta K (ΔK), along with mean of L (K) method, revealed that individuals from the germplasm could be divided into two major clusters based on genetics (K = 2), namely Group 1 and Group 2. As such, five accessions (Yankee, SRK Chalok, SCK Giant India, SC KEW5 India and SC1 Thailand) were clustered in Group 1, while the rest were clustered in Group 2. These outcomes were also supported by the dendrogram, which had been generated through the technique of unweighted pair group with arithmetic mean (UPGMA). These analyses appear to be helpful amongst breeders to maintain and to manage their collections of germplasm. Besides, the data gathered in this study can be useful for breeders to exploit the area of genetic diversity in estimating the level of heterosis.

Journal ArticleDOI
TL;DR: The biochemical characterization revealed more precise discrimination among the 27 cowpea varieties studied, and grouped the varieties into seven clusters at 52% similarity coefficient.
Abstract: The shortcomings of genotype x environment interaction necessitated the use of molecular methods in characterizing many plant species and in determining their phylogenetic relationships. In this study, some selected cowpea lines (27 varieties) from Obafemi Awolowo University, Ile – Ife, the Institute of Agricultural Research (IAR), Samaru, Kaduna and Genetic Resource Centre, IITA, Ibadan were characterized using sodium dodecyl sulphate polyacrylamide gel electrophoresis (SDS-PAGE) profiling. The protein banding profiles of the 27 cowpea varieties were scored and subjected to cluster analysis using Ward's minimum-variance method (WMVM) for dendrogram grouping. The dendrogram generated from the SDS-PAGE profiles grouped the varieties into seven clusters at 52% similarity coefficient. Hence, the biochemical characterization revealed more precise discrimination among the 27 cowpea varieties studied. Keywords: Cowpea, electrophoretic banding profiles, dendrogram grouping, total proteins

Posted Content
TL;DR: It is shown that statistical features drawn from these bottleneck distance distributions detect artefacts of, and can be tapped to recover higher dimensional shape characteristics of point cloud data.
Abstract: We show that specific higher dimensional shape information of point cloud data can be recovered by observing lower dimensional hierarchical clustering dynamics. We generate multiple point samples from point clouds and perform hierarchical clustering within each sample to produce dendrograms. From these dendrograms, we take cluster evolution and merging data that capture clustering behavior to construct simplified diagrams that record the lifetime of clusters akin to what zero dimensional persistence diagrams do in topological data analysis. We compare differences between these diagrams using the bottleneck metric, and examine the resulting distribution. Finally, we show that statistical features drawn from these bottleneck distance distributions detect artefacts of, and can be tapped to recover higher dimensional shape characteristics.

Journal ArticleDOI
TL;DR: In this article, the authors compared and evaluated the floristics, the structure and the degree of conservation of two adjacent areas of Cerrado, one area of conserved savanna (CC) and another of anthropized savanna(CA).
Abstract: The strong anthropic pressure in the Cerrado makes studies necessary to ascertain the level of conservation of areas that are subject to this activity. Thus, the objective of the work was to compare and evaluate the floristics, the structure and the degree of conservation of two adjacent areas of Cerrado, one area of conserved savanna (CC) and another of anthropized savanna (CA). Conventional phytosociological parameters were used: relative density (DR), relative dominance (DoR), relative frequency (FR) and the importance value (VI), in addition to the Shannon-Wiener diversity index and Pielou's equitability. Sorting analyzes were performed using the DCA method, classification dendrogram (UPGMA) based on the Sorensen-Dice and Bray-Curtis indices, in addition to the dichotomous hierarchical division by Twinspan. Student's t-test was used to compare variables, diametric classes and height classes. The conserved savannah presented higher values for the number of individuals, diametric classes, basal area and height. The diversity index by plots, number of species, individuals and total basal area were statistically different for the two areas. Through multivariate analyzes it was possible to segregate the plots based on the degree of conservation, both in the ordering analysis (DCA) and in the classification (Dendrogram). The methods used were efficient to demonstrate the differences between the two areas, however the multivariate analysis proved to be efficient in providing greater detail in the differentiation.

Posted ContentDOI
29 Apr 2020-bioRxiv
TL;DR: This work takes a model selection perspective to clustering and proposes a shape clustering method through linear models defined on Spherical Harmonics expansions of shapes, and introduces a BIC-type criterion, called CLUSBIC, and study consistency of the criterion.
Abstract: Shape is an important phenotype of living species that contain different environmental and genetic information. Clustering living cells using their shape information can provide a preliminary guide to their functionality and evolution. Hierarchical clustering and dendrograms, as a visualization tool for hierarchical clustering, are commonly used by practitioners for classification and clustering. The existing hierarchical shape clustering methods are distance based. Such methods often lack a proper statistical foundation to allow for making inference on important parameters such as the number of clusters, often of prime interest to practitioners. We take a model selection perspective to clustering and propose a shape clustering method through linear models defined on Spherical Harmonics expansions of shapes. We introduce a BIC-type criterion, called CLUSBIC, and study consistency of the criterion. Special attention is paid to the notions of over- and under-specified models, important in studying model selection criteria and naturally defined in model selection literature. These notions do not automatically extend to shape clustering when a model selection perspective is adopted for clustering. To this end we take a novel approach using hypothesis testing. We apply our proposed criterion to cluster a set of real 3D images from HeLa cell line.

Journal ArticleDOI
TL;DR: In this article, a review is presented on clustering methods used with binary data and an evaluation of the linkage methods and the corresponding appropriate distances is attempted using binary data resulted from molecular markers applied to five populations of the wild mustard Sinapis arvensis species.
Abstract: Data from molecular markers used for constructing dendrograms, which are based on genetic distances between different plant species, are encoded as binary data. For dendrograms' construction, the most commonly used linkage method is the UPGMA in combination with the squared Euclidean distance. It seems that in this scientific field, this is the 'golden standard' clustering method. In this study, a review is presented on clustering methods used with binary data. Furthermore, an evaluation of the linkage methods and the corresponding appropriate distances (comparison of 163 clustering methods) is attempted using binary data resulted from molecular markers applied to five populations of the wild mustard Sinapis arvensis species. The validation of the various cluster solutions was tested using external criteria. The results showed that the 'golden standard' is not a 'panacea' for dendrogram construction, based on binary data derived from molecular markers. Thirty seven other hierarchical clustering methods could be used.

Posted Content
TL;DR: This work presents a method for hierarchical clustering of directed acyclic graphs and other strictly partially ordered data that preserves the data structure and uses standard linkage functions, such as single- and complete linkage, and is a generalisation of hierarchical clusters of non-ordered sets.
Abstract: We present a method for hierarchical clustering of directed acyclic graphs and other strictly partially ordered data that preserves the data structure. In particular, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, we get $[a] < [b]$ in the produced clustering. The clustering uses standard linkage functions, such as single- and complete linkage, and is a generalisation of hierarchical clustering of non-ordered sets. To achieve this, we define the output from running hierarchical clustering algorithms on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is now defined as follows: Given a collection of partial dendrograms, the optimal clustering is the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the $p$-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting.

Posted Content
TL;DR: This work develops a novel theory that extends classical hierarchical clustering to strictly partially ordered sets and constructs an embedding of partial dendrograms over a set into the family of ultrametrics over the same set.
Abstract: We present a method for hierarchical clustering of directed acyclic graphs and other strictly partially ordered data that preserves the data structure. In particular, if we have $a

Proceedings ArticleDOI
06 Jan 2020
TL;DR: The purpose of this study is to research and explore the clustering algorithm, and provide methods for clustering characteristics evaluation and clustering dimension selection, in order to help the user to understand the impact and meaning of clustering parameters and data dimensions on clustering, thereby strengthening the use of clustered algorithm.
Abstract: The purpose of this study is to research and explore the clustering algorithm, and provide methods for clustering characteristics evaluation and clustering dimension selection, in order to help the user to understand the impact and meaning of clustering parameters and data dimensions on clustering, thereby strengthening the use of clustering algorithm. In previous studies, many scholars have proposed various types of clustering algorithms. Most of these algorithms need to set the clustering parameters, and the selection of clustering parameters will affect the results after clustering. Therefore, the user must fully understand the meaning of clustering parameters for clustering and select appropriate clustering parameters for clustering, then the clustering algorithm can be effectively used to help solve decision-making problems. Based on the above factors, this study focus on doing further analysis and description on the meaning of the clustering data distribution & the meaning of parameters to the clusters, and the relationship among the clusters, find out the important clustering feature and propose a new clustering evaluation formula, and expect to assist the decision-maker to find appropriate clustering parameters effectively.

Posted Content
TL;DR: The root cause of this issue is identified, and the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it and leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA.
Abstract: Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.

Journal ArticleDOI
TL;DR: In this article, the authors used Chebyshev distance and Mcquitty connection methods to generate a dendrogram whose validation was based from cophenetic coefficient of 0.8.
Abstract: The objective was the genetic selection of cassava (Manihot esculenta crantz) to formation of groups. Initially, 23 quantitative variables from 28 genotypes were used. After principal-components analysis, this number was reduced to 7 variables (stalk diameter, number of maniva-seed per plant, root length, root diameter, leaf water potential in the morning, leaf water potential in the midday and produtivity of aerial part. A cluster analysis using Chebyshev distance and Mcquitty connection methods were used to generate a dendrogram whose validation was based from cophenetic coefficient of 0.8. Two groups composed of 24 and 4 cassava genotypes were formed. These were indicated in the dendrogram using Ratkowsky, McClain and KL indexes. The second group formed by the tussuma, Caititi, Poti Branca and mulatinha genotypes showed higher frequency of the variables that describe an upper part of cassava. This information is important for the creation of database for rustic species and their improvement.

DOI
30 Oct 2020
TL;DR: There was a considerable amount of genetic variability among passion fruit accessions grown in Uasin Gishu County of Kenya and there was general agreement between the population sub-divisions and the genetic relationships among accessions.
Abstract: Purple passion fruit (Passiflora edulis Sims) is the third most important fruit crop in Kenya that is produced for both local and export markets. In Uasin Gishu County, passion fruit had recently emerged as an important cash crop for the small-holder farmers. Understanding the structure and diversity of species is very important in plant breeding and in conservation of genetic resources related activities. This study was set out in 2017-2018 to determine the genetic diversity of purple passion fruits genotypes grown in Uasin Gishu County, Kenya using SSR markers. Among the 50 purple passion fruit accessions used in this study, the genetic distance coefficients among accessions ranged from 0.24 to 0.72, with an average of 0.48. The results of STRUCTURE analysis suggested that the 50 accessions could be grouped into five sub-populations. The clustering was based on the unweighted pair-group method of arithmetic averages (UPGMA) where accessions were divided into three major clusters. The UPGMA dendrogram revealed that accessions from identical or adjacent areas were generally, but not entirely, clustered into the same cluster. Comparison of the UPGMA dendrogram and the Bayesian STRUCTURE analysis showed general agreement between the population sub-divisions and the genetic relationships among accessions. Principal coordinate analysis (PCoA) with SSR markers revealed a similar grouping of accessions to the UPGMA dendrogram and STRUCTURE analysis. Analysis of molecular variance (AMOVA) indicated that 16% of the total was attributed to the diversity among sub-populations, while 84% was associated with differences within sub-populations. Overall, there was a considerable amount of genetic variability among passion fruit accessions grown in Uasin Gishu County of Kenya. The study represents the comprehensive investigation of the genetic diversity of passion fruit accessions which would be valuable for germplasm collection, genetic improvement, and efficient utilization.

Journal ArticleDOI
09 Sep 2020
TL;DR: Ebeniro et al. as discussed by the authors assessed tree species composition and classification in a degraded tropical rainforest in Southwest Nigeria, where data was collected from the Olukayode compartment of the study area.
Abstract: Tree species information is essential for forest studies such as forest meteorology, botany and ecology, and across the relevant fields new techniques efficient for classifying tree species are desperately in demand. This study assessed tree species composition and classification in a degraded tropical rainforest in Southwest Nigeria. Data was collected from the Olukayode compartment of the study area of size 2 ha. Eight (8) Temporary sample plots of size 50 m x 50 m was laid using systematic line transect at 100 m intervals in the compartment. Hierarchical clustering in SPSS was used to find clusters of patterns in the measurement space. Tree species such as; Eucalyptus cameldulensis, Eucalyptus tereticornis, Khaya ivorensis, Khaya senegalensis,Nauclea diderichi, Terminalia randii, and Terminalia superba with a total frequency of 60 were identified, belonging to 3 different families. At similarity 5.0 from the dendrogram using ward linkage, samples 48 6 formed the first cluster, samples 28 9 constituted the second cluster Original Research Article Ebeniro and Wali; AJRAF, 6(3): 41-48, 2020; Article no.AJRAF.59790 42 while samples 20 13 constituted the third cluster. From the dendrogram using centroid linkage, at similarity 5.0, samples 59 7 formed the first cluster, samples 32 31 constituted the second cluster, and samples 8 28 formed the third cluster while the fourth cluster combined samples 17 21 which is a combination of trees from the three families. Histogram was used to show the diameter at breast height and total height distribution.

Posted Content
TL;DR: In this article, the average dendrogram is based on the Pearson correlation between resting-state networks and two methodologies are proposed to better understand the inter-individual variability of restingstate functional Magnetic Resonance Imaging (fMRI) brain data.
Abstract: We propose two methodologies in order to better understand the inter-individual variability of resting-state functional Magnetic Resonance Imaging (fMRI) brain data. The aim of the study was to quantify whether the average dendrogram is representative of the initial population and to identify its possible sources of instability. The average dendrogram is based on the Pearson correlation between resting-state networks. The first method identifies networks that can lead to unstable partitions of the average dendrogram. The second method identified homogeneous sub-samples of participants for whom their associated average dendrograms were more stable than that of the whole sample. The two suggested methods have shown significant quantifiable behavioral data results with regards to detecting an unstable network or presence of subpopulations when the noise level does not conceal the structure of the data. These two methods have been successfully applied to establish a cerebral atlas for late adulthood. The first method made it clear that there was no unstable network among the atlas networks. The second method highlighted the presence of two distinct sub-populations with different age-related brain organizations.

Journal Article
TL;DR: Fifty two accessions of sweet potato collected from eastern states of India and maintained in the National Active Germplasm Site at ICAR-CTCRI, along with two wild species Ipomoea triloba L. and I. aquatica Forssk.
Abstract: Fifty two accessions of sweet potato collected from eastern states of India and maintained in the National Active Germplasm Site at ICAR-CTCRI, along with two wild species Ipomoea triloba L. and Ipomoea aquatica Forssk. were evaluated using eighteen vegetative morphological and eleven Inter Simple Sequence Repeats (ISSR) markers. The dendrogram obtained using phenotypic characters separated the genotypes into two major clusters and an outlier at a Euclidean distance of 1.20. The first three principal components of data accounted for 67.50% of the total variance among accessions. Traits like predominant vine colour, leaf lobes type were found to be of great importance in distinguishing the accessions. The cluster diagram based on morphological data revealed that the accessions exhibited greater degree of genetic variation for the 18 different morphological traits observed. According to the morphological data, there were no duplicate accessions and S1439 and S1442 were found to be highly similar among the accessions studied. The hierarchical clustering using on ISSR profile based on Jaccard’s similarity coefficient separated the accessions into three principal clusters at a similarity coefficient of 0.56. The first principal cluster consisted of 37 accessions with many sub-clusters showing high intra-clusteral variability indicating the variability in the sweet potato accessions selected for the study. Accessions collected from the same geographical area were grouped together in a single cluster. The second principal cluster comprised of 15 accessions with one set of two accessions showing 89% similarity, both collected from Bihar. This grouping was similar to that obtained with morphological data. I. triloba L. and I. aquatica Forssk. were grouped as a third cluster showing their species specificity. Mantel test indicated significant correlation between morphological and molecular marker information.

Posted Content
TL;DR: A new strategy is proposed for building easy to interpret predictive models in the context of a high-dimensional dataset, with a large number of highly correlated explanatory variables, based on a first step of variables clustering using the CLustering of Variables around Latent Variables (CLV) method.
Abstract: A new strategy is proposed for building easy to interpret predictive models in the context of a high-dimensional dataset, with a large number of highly correlated explanatory variables. The strategy is based on a first step of variables clustering using the CLustering of Variables around Latent Variables (CLV) method. The exploration of the hierarchical clustering dendrogram is undertaken in order to sequentially select the explanatory variables in a group-wise fashion. For model setting implementation, the dendrogram is used as the base-learner in an L2-boosting procedure. The proposed approach, named lmCLV, is illustrated on the basis of a toy-simulated example when the clusters and predictive equation are already known, and on a real case study dealing with the authentication of orange juices based on 1H-NMR spectroscopic analysis. In both illustrative examples, this procedure was shown to have similar predictive efficiency to other methods, with additional interpretability capacity. It is available in the R package ClustVarLV.

Journal ArticleDOI
TL;DR: A collection of sea buckthorn samples of different ecological and geographical origin, which consisted from herbarium samples and samples harvested in natural populations, was prepared and revealed a high level of intraspecific polymorphism.
Abstract: Sea buckthorn (Hippophae rhamnoides L.) is a perennial dioecious shrub or tree whose juvenile period lasts from 3 to 5 years. Sea buckthorn grows in many European and Asian countries and is widely used in cosmetology, biotechnology, pharmacology, and agriculture. This crop is a source of valuable vitamins, minerals, carotenoids, and flavonoids. Its phenotypic description does not provide accurate characteristics of a taxon at the species, subspecies, variety, and population levels; therefore, one should use molecular analysis. For this study, a collection of sea buckthorn samples of different ecological and geographical origin, which consisted from herbarium samples and samples harvested in natural populations, was prepared. ISSR markers efficiently revealing genotypic difference between the samples were selected, and binary matrices were created for the further dendrogram construction. The performed analysis revealed a high level of intraspecific polymorphism and showed that samples of a similar ecological and geographical origin tended to group into clusters and were located at different genetic distance. The results of the study can be used to evaluate the diversity of forms of this crop, relations between them, their ecological and geographical origin, and the evolution of certain taxa. These results may also be used for a genetic passportization and identification of varieties prior to them reaching the generative period and for the pair selection for crossing.

Book ChapterDOI
01 Jan 2020
TL;DR: Wang et al. as discussed by the authors applied functional clustering analysis to milk dataset, which contains milk concentration, sum of rain, average speed of wind, and average temperature, to reveal the relationship between milk concentration and other variables.
Abstract: In this research, we study characteristics of concentration variation with functional data analysis. Functional data analysis, which combines traditional data analysis with the characteristics of functions, is suitable to analyze changing trend of observed data with the utility of derivatives. We applied functional clustering analysis to milk dataset, which contains milk concentration, sum of rain, average speed of wind, and average temperature, to reveal the relationship between milk concentration and other variables. We get eight dendrograms in functional clustering analysis. Cophenetic correlation coefficient is used to measure the similarities among the dendrograms. Multidimensional scaling is used to visualize the dissimilarities of the dendrograms clearly. As a result, we find that the trend of milk concentration has relation to those of sum of rain and of average speed of wind.

Journal ArticleDOI
03 Nov 2020
TL;DR: It is shown that strains of A. ferrooxidans that were isolated from various dumps have a wide genetic diversity and depends on the primary location of isolation of the strains, and the coefficients of similarity between them varies from 0.182 to 0.80, which confirms the high degree of similarity among strains grouped by clusters.
Abstract: The results of Acidithiobacillus ferrooxidans strains genetic variability studies that were first isolated from waste by coal and energy industries of Ukraine are reported in this article. These strains, according to the results of previous studies, are fully consistent with the biological properties of A. ferrooxidans bacteria given in Bergey’s Manual of Determinative Bacteriology and other original works [8, 10, 12, 18] also strains studied, regardless of habitat, were resistant to temperature and pH, had a mixed type of food, similar energy sources used by strains, etc. [1, 10, 11]. Using PCR was confirmed the affiliation of isolated from different of origin dumps of acidophilic chemolithotrophic strains with A. ferrooxidans. Genetic polymorphism of the strains was studied by RAPD-PCR using universal primer M13. It was shown that strains of A. ferrooxidans that were isolated from various dumps have a wide genetic diversity. By PCR, their affiliation with the species A. ferrooxidans was confirmed. Comparison analysis of the obtained RAPD profiles showed the variability of the strains which coincides with their main phenotypic properties, as described earlier [11]. The most heterogeneous profiles were characteristic of A. ferrooxidans DTV 1, A. ferrooxidans Lad 5 and A. ferrooxidans Lad 27. The obtained RAPD profiles served as the basis for the generation of the dendrogram constructed using the Neighbor-Joining method and calcula­ting the similarity matrix, based on the coefficient of similarity of Nei & Li. Based on the obtained dendrogram, the formation of two clusters that combine similar strains is shown. The obtained indicators of the probability of formation of nodes of the constructed dendrogram range from 65.0 to 76.0 %%, which confirms the high degree of similarity between strains grouped by clusters. It is also shown that the first of the cluster includes strains that were isolated from coal and waste from its enrichment, and the second cluster includes strains that were isolated from coal waste. The obtained data confirm that the genetic variability of the strains depends on the primary location of isolation of the strains, and the coefficients of similarity between them varies from 0.182 to 0.80.