scispace - formally typeset
Open AccessJournal ArticleDOI

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

Reads0
Chats0
TLDR
The intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found.
Abstract
The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

read more

Content maybe subject to copyright    Report

Citations
More filters

科研数据共享的挑战 (The Conundrum of Sharing Research Data)

TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.
Journal ArticleDOI

Predicting three-dimensional genome organization with chromatin states.

TL;DR: Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations.
Journal ArticleDOI

The variable quality of metadata about biological samples used in biomedical experiments.

TL;DR: Overall, the metadata the authors analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements, and significant aberrancies that are likely to impede search and secondary use of the associated datasets.
Journal ArticleDOI

A dataset of egg size and shape from more than 6,700 insect species

TL;DR: A dataset of 10,449 morphological descriptions of insect eggs, with records for 6,706 unique insect species and representatives from every extant hexapod order is presented, created by partially automating the extraction of egg traits from the primary literature.
References
More filters
Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Journal ArticleDOI

NCBI GEO: archive for functional genomics data sets—update

TL;DR: The Gene Expression Omnibus is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community and supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable.
Related Papers (5)

The FAIR Guiding Principles for scientific data management and stewardship