MetaStore: A metadata framework for scientific data repositories

doi:10.1109/BIGDATA.2016.7840956

Home
/
Papers
/
MetaStore: A metadata framework for scientific data repositories

Proceedings Article•DOI•

MetaStore: A metadata framework for scientific data repositories

Ajinkya Prabhune¹, Hasebullah Ansari¹, Anil Keshav¹, Rainer Stotzka¹, Michael Gertz², Jürgen Hesser² - Show less +2 more•Institutions (2)

Karlsruhe Institute of Technology¹, Heidelberg University²

01 Dec 2016-pp 3026-3035

TL;DR: The MetaStore framework automatically generates the necessary software code (services) and extends the functionality of the framework to handle heterogeneous metadata models and standards, and allows full-text search over metadata through automated creation of indexes.

read less

Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data and metadata. Metadata is critical for scientific research, as it enables discovering, analysing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments is heterogeneous and subject to frequent changes, demanding a flexible data model. Currently, there does not exist an adaptive and a generic solution that is capable of handling heterogeneous metadata models. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database. To handle heterogeneous metadata models and standards, the MetaStore automatically generates the necessary software code (services) and extends the functionality of the framework. To leverage the functionality of NoSQL databases, the MetaStore framework allows full-text search over metadata through automated creation of indexes. Finally, a dedicated REST service is provided for efficient harvesting (sharing) of metadata using the METS metadata standard over the OAI-PMH protocol.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Scientific Data Management in the Coming Decade

[...]

Jim Gray¹, David T. Liu², Maria Nieto-Santisteban³, Alexander S. Szalay³, David J. DeWitt, Gerd Heber⁴ - Show less +2 more•Institutions (4)

Microsoft¹, University of California, Berkeley², Johns Hopkins University³, Cornell University⁴

02 Feb 2005-arXiv: Databases

TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

...read moreread less

Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

...read moreread less

476 citations

SKOS Core: Simple Knowledge Organisation for the Web

[...]

Cclrc Rutherford

01 Jan 2006

TL;DR: The main purpose of this paper is to provide an initial basis for establishing clear recommendations for the use of SKOS Core and DCMI Metadata Terms in combination.

...read moreread less

Abstract: This paper introduces SKOS Core, an RDF vocabulary for expressing the basic structure and content of concept schemes (thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary). SKOS Core is published and maintained by the W3C Semantic Web Best Practices and Deployment Working Group. The main purpose of this paper is to provide an initial basis for establishing clear recommendations for the use of SKOS Core and DCMI Metadata Terms in combination. Also discussed are management policies for SKOS Core and other RDF vocabularies, and the relationship between a “SKOS concept scheme” and an “RDFS/OWL Ontology”.

...read moreread less

68 citations

Proceedings Article•DOI•

Big Data Lakes: Models, Frameworks, and Techniques

[...]

Alfredo Cuzzocrea

17 Jan 2021

TL;DR: In this article, the authors propose an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions.

...read moreread less

Abstract: Nowadays, big data lakes are prominent components of emerging big data architectures. Basically, big data lakes are the natural evolution of data warehousing systems in the big data context, and deal with several requirements deriving from the well-known 3V nature of big data. Along with the emerging of big data lake research initiative, several issues appeared, such as: (i) big data lake models; (ii) big data lake frameworks; (iii) big data lake techniques. In line with this exciting research perspective, this paper proposes an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions, on advancing the big data lake research trend.

...read moreread less

15 citations

Journal Article•DOI•

The MASi repository service — Comprehensive, metadata-driven and multi-community research data management

[...]

Richard Grunzke¹, Volker Hartmann², Thomas Jejkal², Helen Kollai, Ajinkya Prabhune², Hendrik Herold, Aline Deicke, Christiane Dressler, Julia Dolhoff, Julia Stanek³, Alexander Hoffmann³, Ralph Müller-Pfefferkorn¹, Torsten Schrade, Gotthard Meinel, Sonja Herres-Pawlis³, Wolfgang E. Nagel¹ - Show less +12 more•Institutions (3)

Dresden University of Technology¹, Karlsruhe Institute of Technology², RWTH Aachen University³

01 Jan 2018-Future Generation Computer Systems

TL;DR: The MASi research data management service is currently being prepared to go into production to satisfy the complex and varying requirements in an efficient, useable and sustainable way.

...read moreread less

13 citations

Journal Article•DOI•

MetaStore: an adaptive metadata management framework for heterogeneous metadata models

[...]

Ajinkya Prabhune¹, Rainer Stotzka¹, Vaibhav Sakharkar¹, Jürgen Hesser², Michael Gertz² - Show less +1 more•Institutions (2)

Karlsruhe Institute of Technology¹, Heidelberg University²

01 Mar 2018-Distributed and Parallel Databases

TL;DR: MetaStore is an adaptive metadata management framework based on a NoSQL database and an RDF triple store that automatically segregates the different categories of metadata in their corresponding data models to maximize the utilization of the data models supported by NoSQL databases.

...read moreread less

Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data, and the handling of associated metadata is critical, as it enables discovering, analyzing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments are heterogeneous and subject to frequent changes, demanding a flexible data model. Existing metadata management systems provide a broad range of features for handling scientific metadata. However, the principal limitation of these systems is their architecture design that is restricted towards either a single or at the most a few standard metadata models. Support for handling different types of metadata models, i.e., administrative, descriptive, structural, and provenance metadata, and including community-specific metadata models is not possible with these systems. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database and an RDF triple store. MetaStore provides a set of core functionalities to handle heterogeneous metadata models by automatically generating the necessary software code (services) and on-the-fly extends the functionality of the framework. To handle dynamic metadata and to control metadata quality, MetaStore also provides an extended set of functionalities such as enabling annotation of images and text by integrating the Web Annotation Data Model, allowing communities to define discipline-specific vocabularies using Simple Knowledge Organization System, and providing advanced search and analytical capabilities by integrating the ElasticSearch. To maximize the utilization of the data models supported by NoSQL databases, MetaStore automatically segregates the different categories of metadata in their corresponding data models. Complex provenance graphs and dynamic metadata are modeled and stored in an RDF triple store, whereas the static metadata is stored in a NoSQL database. For enabling large-scale harvesting (sharing) of metadata using the METS standard over the OAI-PMH protocol, MetaStore is designed OAI-compliant. Finally, to show the practical usability of the MetaStore framework and that the requirements from the research communities have been realized, we describe our experience in the adoption of MetaStore for three communities.

...read moreread less

10 citations

Cites methods from "MetaStore: A metadata framework for..."

...Contributions In the previous version ofMetaStore, we established the core functionality of the framework required for handling static metadata [20]....
[...]

References

PDF

Open Access

More filters

Book Chapter•DOI•

I and J

[...]

William Marsden

01 Jan 2012

139,059 citations

"MetaStore: A metadata framework for..." refers methods in this paper

...The XMC Cat metadata catalog for the LEAD cyberinfrastructure [20] follows a hybrid XML/relational approach that stores the XML metadata as a Character Large Object (CLOB) and further shreds the XML using inlining [21] and stores it in an RDBMS schema to enable execution of complex queries....
[...]

Web Services Business Process Execution Language Version 2.0

[...]

Charlton Barreto, Vaughn Bullard, Thomas Erl, John Evdemon, Diane Jordan, Khanderao Kand, Dieter König, Simon Moser, Ralph Stout, Ron Ten-Hove, Ivana Trickovic, Danny van der Rijn, Alex Yiu - Show less +9 more

01 Jan 2007

TL;DR: The continuity of the basic conceptual model between Abstract and Executable Processes in WSBPEL makes it possible to export and import the public aspects embodied in Abstract Processes as process or role templates while maintaining the intent and structure of the observable behavior.

...read moreread less

2,640 citations

Additional excerpts

...978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3026...
[...]

Proceedings Article•

Relational Databases for Querying XML Documents: Limitations and Opportunities

[...]

Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton - Show less +2 more

07 Sep 1999

TL;DR: It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases.

...read moreread less

Abstract: XML is fast emerging as the dominant standard for representing data in the World Wide Web. Sophisticated query engines that allow users to effectively tap the data stored in XML documents will be crucial to exploiting the full power of XML. While there has been a great deal of activity recently proposing new semistructured data models and query languages for this purpose, this paper explores the more conservative approach of using traditional relational database engines for processing XML documents conforming to Document Type Descriptors (DTDs). To this end, we have developed algorithms and implemented a prototype system that converts XML documents to relational tuples, translates semi-structured queries over XML documents to SQL queries over tables, and converts the results to XML. We have qualitatively evaluated this approach using several real DTDs drawn from diverse domains. It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases. We identify the causes for these limitations and propose certain extensions to the relational model that would make it more appropriate for processing queries over XML documents.

...read moreread less

1,111 citations

Posted Content•

Scientific Data Management in the Coming Decade

[...]

Jim Gray¹, David T. Liu², Maria Nieto-Santisteban³, Alexander S. Szalay³, David J. DeWitt, Gerd Heber⁴ - Show less +2 more•Institutions (4)

Microsoft¹, University of California, Berkeley², Johns Hopkins University³, Cornell University⁴

02 Feb 2005-arXiv: Databases

...read moreread less

476 citations

"MetaStore: A metadata framework for..." refers background in this paper

...Metadata is crucial for managing the complete life-cycle of scientific data, for example, automating scientific analysis workflows, enabling data access, enhancing data interpretation by visual exploration and creating metadata-aware scientific tools [1]....
[...]

Journal Article•DOI•

Scientific data management in the coming decade

[...]

Jim Gray¹, David T. Liu², Maria Nieto-Santisteban³, Alexander S. Szalay³, David J. DeWitt, Gerd Heber⁴ - Show less +2 more•Institutions (4)

Microsoft¹, University of California, Berkeley², Johns Hopkins University³, Cornell University⁴

01 Dec 2005

TL;DR: In this article, the authors propose algorithms that can simultaneously deal with huge datasets and that can find very subtle effects, finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

...read moreread less

Abstract: Scientific instruments and computer simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Data volumes are approximately doubling each year. Since these new instruments have extraordinary precision, the data quality is also rapidly improving. Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

...read moreread less

432 citations