Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.
Reads0
Chats0
TLDR
The intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found.Abstract:
The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.read more
Citations
More filters
科研数据共享的挑战 (The Conundrum of Sharing Research Data)
TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.
Journal ArticleDOI
Predicting three-dimensional genome organization with chromatin states.
Yifeng Qi,Bin Zhang +1 more
TL;DR: Analysis of the model’s energy function uncovers distinct mechanisms for chromatin folding at various length scales and suggests a need to go beyond simple A/B compartment types to predict specific contacts between regulatory elements using polymer simulations.
Journal ArticleDOI
The variable quality of metadata about biological samples used in biomedical experiments.
TL;DR: Overall, the metadata the authors analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements, and significant aberrancies that are likely to impede search and secondary use of the associated datasets.
Journal ArticleDOI
A dataset of egg size and shape from more than 6,700 insect species
TL;DR: A dataset of 10,449 morphological descriptions of insect eggs, with records for 6,706 unique insect species and representatives from every extant hexapod order is presented, created by partially automating the extraction of egg traits from the primary literature.
Journal ArticleDOI
Artificial intelligence enables comprehensive genome interpretation and nomination of candidate diagnoses for rare genetic diseases.
Francisco M. De La Vega,Shimul Chowdhury,Barry Moore,Erwin Frise,Jeanette McCarthy,Edgar Javier Hernandez,Terence C. Wong,Kiely N. James,Lucia Guidugli,Pankaj B. Agrawal,Casie A. Genetti,Catherine A. Brownstein,Alan H. Beggs,Britt Sabina Löscher,Andre Franke,Braden E. Boone,Shawn Levy,Katrin Õunap,Katrin Õunap,Sander Pajusalu,Sander Pajusalu,Matthew J. Huentelman,Keri Ramsey,Marcus Naymik,Vinodh Narayanan,Narayanan Veeraraghavan,Paul Billings,Martin G. Reese,Mark Yandell,Stephen F. Kingsmore +29 more
TL;DR: In this article, the authors assess the diagnostic performance of Fabric GEM, a new, AI-based, clinical decision support tool for expediting the diagnosis of rare genetic diseases.
References
More filters
Proceedings Article
A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Journal ArticleDOI
Cluster analysis and display of genome-wide expression patterns
TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Proceedings Article
A density-based algorithm for discovering clusters in large spatial Databases with Noise
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Journal ArticleDOI
NCBI GEO: archive for functional genomics data sets—update
Tanya Barrett,Stephen E. Wilhite,Pierre Ledoux,Carlos Evangelista,Irene F. Kim,Maxim Tomashevsky,Kimberly A. Marshall,Katherine Phillippy,Patti M. Sherman,Michelle Holko,Andrey Yefanov,Hye Seung Lee,Naigong Zhang,Cynthia L. Robertson,Nadezhda Serova,Sean Davis,Alexandra Soboleva +16 more
TL;DR: The Gene Expression Omnibus is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community and supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable.
Journal ArticleDOI
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.
Alvis Brazma,Pascal Hingamp,John Quackenbush,Gavin Sherlock,Paul T. Spellman,Chris Stoeckert,John Aach,Wilhelm Ansorge,Catherine A. Ball,Helen C. Causton,Terry Gaasterland,Patrick Glenisson,Frank C. P. Holstege,Irene F. Kim,Victor Markowitz,John C. Matese,Helen Parkinson,Alan J. Robinson,Ugis Sarkans,Steffen Schulze-Kremer,Jason E. Stewart,Ronald C. Taylor,Jaak Vilo,Martin Vingron +23 more
TL;DR: The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools.
Related Papers (5)
The FAIR Guiding Principles for scientific data management and stewardship
Mark Wilkinson,Michel Dumontier,IJsbrand Jan Aalbersberg,Gabrielle Appleton,Myles Axton,Arie Baak,Niklas Blomberg,Jan-Willem Boiten,Luiz Olavo Bonino da Silva Santos,Philip E. Bourne,Jildau Bouwman,Anthony J. Brookes,Timothy Clark,Mercè Crosas,Ingrid Dillo,Olivier G. Dumon,Scott C. Edmunds,Chris T. Evelo,Richard Finkers,Alejandra Gonzalez-Beltran,Alasdair J. G. Gray,Paul Groth,Carole Goble,Jeffrey S. Grethe,Jaap Heringa,Peter A C 't Hoen,Rob Hooft,Tobias Kuhn,Ruben Kok,Joost N. Kok,Scott J. Lusher,Maryann E. Martone,Albert Mons,Abel L. Packer,Bengt Persson,Philippe Rocca-Serra,Marco Roos,Rene van Schaik,Susanna-Assunta Sansone,Erik Anthony Schultes,Thierry Sengstag,Ted Slater,George Strawn,Morris A. Swertz,Mark Thompson,Johan van der Lei,Erik M. van Mulligen,Jan Velterop,Andra Waagmeester,Peter Wittenburg,Katherine Wolstencroft,Jun Zhao,Barend Mons,Barend Mons +53 more