Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

doi:10.1186/S12859-017-1832-4

Open AccessJournal ArticleDOI

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

Wei Hu, +3 more

- 18 Sep 2017 -

BMC Bioinformatics

- Vol. 18, Iss: 1, pp 1-12

Chats0

TLDR

The intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found.

Abstract:

The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

Citations

PDF

Open Access

More filters

科研数据共享的挑战 (The Conundrum of Sharing Research Data)

Christine L Borgman

TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.

...read moreread less

Journal ArticleDOI

Predicting three-dimensional genome organization with chromatin states.

Yifeng Qi, +1 more

- 10 Jun 2019 -

PLOS Computational Biology

TL;DR: Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations.

...read moreread less

Journal ArticleDOI

The variable quality of metadata about biological samples used in biomedical experiments.

Rafael S. Gonçalves, +1 more

- 19 Feb 2019 -

Scientific Data

TL;DR: Overall, the metadata the authors analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements, and significant aberrancies that are likely to impede search and secondary use of the associated datasets.

...read moreread less

Journal ArticleDOI

A dataset of egg size and shape from more than 6,700 insect species

Samuel H. Church, +4 more

- 03 Jul 2019 -

Scientific Data

TL;DR: A dataset of 10,449 morphological descriptions of insect eggs, with records for 6,706 unique insect species and representatives from every extant hexapod order is presented, created by partially automating the extraction of egg traits from the primary literature.

...read moreread less

Journal ArticleDOI

Artificial intelligence enables comprehensive genome interpretation and nomination of candidate diagnoses for rare genetic diseases.

Francisco M. De La Vega, +29 more

- 14 Oct 2021 -

Genome Medicine

TL;DR: In this article, the authors assess the diagnostic performance of Fabric GEM, a new, AI-based, clinical decision support tool for expediting the diagnosis of rare genetic diseases.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, +3 more

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.

...read moreread less

Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

Michael B. Eisen, +3 more

- 08 Dec 1998 -

Proceedings of the National Academy of S...

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.

...read moreread less

Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Martin Ester, +3 more

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.

...read moreread less

Journal ArticleDOI

NCBI GEO: archive for functional genomics data sets—update

Tanya Barrett, +16 more

- 27 Nov 2012 -

Nucleic Acids Research

TL;DR: The Gene Expression Omnibus is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community and supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable.

...read moreread less

Journal ArticleDOI

Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.

Alvis Brazma, +23 more

- 01 Dec 2001 -

Nature Genetics

TL;DR: The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools.

...read moreread less