scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

MetaStore: A metadata framework for scientific data repositories

TL;DR: The MetaStore framework automatically generates the necessary software code (services) and extends the functionality of the framework to handle heterogeneous metadata models and standards, and allows full-text search over metadata through automated creation of indexes.
Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data and metadata. Metadata is critical for scientific research, as it enables discovering, analysing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments is heterogeneous and subject to frequent changes, demanding a flexible data model. Currently, there does not exist an adaptive and a generic solution that is capable of handling heterogeneous metadata models. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database. To handle heterogeneous metadata models and standards, the MetaStore automatically generates the necessary software code (services) and extends the functionality of the framework. To leverage the functionality of NoSQL databases, the MetaStore framework allows full-text search over metadata through automated creation of indexes. Finally, a dedicated REST service is provided for efficient harvesting (sharing) of metadata using the METS metadata standard over the OAI-PMH protocol.
Citations
More filters
Posted Content
TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

476 citations

01 Jan 2006
TL;DR: The main purpose of this paper is to provide an initial basis for establishing clear recommendations for the use of SKOS Core and DCMI Metadata Terms in combination.
Abstract: This paper introduces SKOS Core, an RDF vocabulary for expressing the basic structure and content of concept schemes (thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary). SKOS Core is published and maintained by the W3C Semantic Web Best Practices and Deployment Working Group. The main purpose of this paper is to provide an initial basis for establishing clear recommendations for the use of SKOS Core and DCMI Metadata Terms in combination. Also discussed are management policies for SKOS Core and other RDF vocabularies, and the relationship between a “SKOS concept scheme” and an “RDFS/OWL Ontology”.

68 citations

Proceedings ArticleDOI
17 Jan 2021
TL;DR: In this article, the authors propose an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions.
Abstract: Nowadays, big data lakes are prominent components of emerging big data architectures. Basically, big data lakes are the natural evolution of data warehousing systems in the big data context, and deal with several requirements deriving from the well-known 3V nature of big data. Along with the emerging of big data lake research initiative, several issues appeared, such as: (i) big data lake models; (ii) big data lake frameworks; (iii) big data lake techniques. In line with this exciting research perspective, this paper proposes an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions, on advancing the big data lake research trend.

15 citations

Journal ArticleDOI
TL;DR: The MASi research data management service is currently being prepared to go into production to satisfy the complex and varying requirements in an efficient, useable and sustainable way.

13 citations

Journal ArticleDOI
TL;DR: MetaStore is an adaptive metadata management framework based on a NoSQL database and an RDF triple store that automatically segregates the different categories of metadata in their corresponding data models to maximize the utilization of the data models supported by NoSQL databases.
Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data, and the handling of associated metadata is critical, as it enables discovering, analyzing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments are heterogeneous and subject to frequent changes, demanding a flexible data model. Existing metadata management systems provide a broad range of features for handling scientific metadata. However, the principal limitation of these systems is their architecture design that is restricted towards either a single or at the most a few standard metadata models. Support for handling different types of metadata models, i.e., administrative, descriptive, structural, and provenance metadata, and including community-specific metadata models is not possible with these systems. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database and an RDF triple store. MetaStore provides a set of core functionalities to handle heterogeneous metadata models by automatically generating the necessary software code (services) and on-the-fly extends the functionality of the framework. To handle dynamic metadata and to control metadata quality, MetaStore also provides an extended set of functionalities such as enabling annotation of images and text by integrating the Web Annotation Data Model, allowing communities to define discipline-specific vocabularies using Simple Knowledge Organization System, and providing advanced search and analytical capabilities by integrating the ElasticSearch. To maximize the utilization of the data models supported by NoSQL databases, MetaStore automatically segregates the different categories of metadata in their corresponding data models. Complex provenance graphs and dynamic metadata are modeled and stored in an RDF triple store, whereas the static metadata is stored in a NoSQL database. For enabling large-scale harvesting (sharing) of metadata using the METS standard over the OAI-PMH protocol, MetaStore is designed OAI-compliant. Finally, to show the practical usability of the MetaStore framework and that the requirements from the research communities have been realized, we describe our experience in the adoption of MetaStore for three communities.

10 citations


Cites methods from "MetaStore: A metadata framework for..."

  • ...Contributions In the previous version ofMetaStore, we established the core functionality of the framework required for handling static metadata [20]....

    [...]

References
More filters
Book ChapterDOI

[...]

01 Jan 2012

139,059 citations


"MetaStore: A metadata framework for..." refers methods in this paper

  • ...The XMC Cat metadata catalog for the LEAD cyberinfrastructure [20] follows a hybrid XML/relational approach that stores the XML metadata as a Character Large Object (CLOB) and further shreds the XML using inlining [21] and stores it in an RDBMS schema to enable execution of complex queries....

    [...]

01 Jan 2007
TL;DR: The continuity of the basic conceptual model between Abstract and Executable Processes in WSBPEL makes it possible to export and import the public aspects embodied in Abstract Processes as process or role templates while maintaining the intent and structure of the observable behavior.

2,640 citations


Additional excerpts

  • ...978-1-4673-9005-7/16/$31.00 ©2016 IEEE 3026...

    [...]

Proceedings Article
07 Sep 1999
TL;DR: It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases.
Abstract: XML is fast emerging as the dominant standard for representing data in the World Wide Web. Sophisticated query engines that allow users to effectively tap the data stored in XML documents will be crucial to exploiting the full power of XML. While there has been a great deal of activity recently proposing new semistructured data models and query languages for this purpose, this paper explores the more conservative approach of using traditional relational database engines for processing XML documents conforming to Document Type Descriptors (DTDs). To this end, we have developed algorithms and implemented a prototype system that converts XML documents to relational tuples, translates semi-structured queries over XML documents to SQL queries over tables, and converts the results to XML. We have qualitatively evaluated this approach using several real DTDs drawn from diverse domains. It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases. We identify the causes for these limitations and propose certain extensions to the relational model that would make it more appropriate for processing queries over XML documents.

1,111 citations

Posted Content
TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

476 citations


"MetaStore: A metadata framework for..." refers background in this paper

  • ...Metadata is crucial for managing the complete life-cycle of scientific data, for example, automating scientific analysis workflows, enabling data access, enhancing data interpretation by visual exploration and creating metadata-aware scientific tools [1]....

    [...]

Journal ArticleDOI
01 Dec 2005
TL;DR: In this article, the authors propose algorithms that can simultaneously deal with huge datasets and that can find very subtle effects, finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.
Abstract: Scientific instruments and computer simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Data volumes are approximately doubling each year. Since these new instruments have extraordinary precision, the data quality is also rapidly improving. Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

432 citations