scispace - formally typeset
Search or ask a question

Showing papers on "Meta Data Services published in 2016"


Journal ArticleDOI
TL;DR: In this paper, the authors describe their experience developing data commons-interoperable infrastructure that collocates data, storage, and compute with common analysis tools, including persistent digital identifier and metadata services, APIs, data portability, pay-for-compute capabilities, and data peering agreements between data commons.
Abstract: As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science life cycle, including long-term data storage, data exploration, and discovery services, and compute capabilities to support data analysis and reanalysis as new data is added and scientific pipelines are refined. The authors describe their experience developing data commons-interoperable infrastructure that collocates data, storage, and compute with common analysis tools. Across the presented case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay-for-compute capabilities, and data peering agreements between data commons. Although many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective data science as a service for the scientific research community.

55 citations


Journal ArticleDOI
30 Dec 2016
TL;DR: A Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources in an extensible metamodel that distinguishes structural and semantical metadata.
Abstract: In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori .

51 citations


Journal ArticleDOI
TL;DR: Conceptual and technical guidance is provided to overcome the challenges associated with the collection, organization, and storage of metadata in a neurophysiology laboratory and an adaptable workflow to accumulate, structure and store metadata from different sources is suggested.
Abstract: To date, non-reproducibility of neurophysiological research is a matter of intense discussion in the scientific community. A crucial component to enhance reproducibility is to comprehensively collect and store metadata, that is, all information about the experiment, the data, and the applied preprocessing steps on the data, such that they can be accessed and shared in a consistent and simple manner. However, the complexity of experiments, the highly specialized analysis workflows and a lack of knowledge on how to make use of supporting software tools often overburden researchers to perform such a detailed documentation. For this reason, the collected metadata are often incomplete, incomprehensible for outsiders or ambiguous. Based on our research experience in dealing with diverse datasets, we here provide conceptual and technical guidance to overcome the challenges associated with the collection, organization, and storage of metadata in a neurophysiology laboratory. Through the concrete example of managing the metadata of a complex experiment that yields multi-channel recordings from monkeys performing a behavioral motor task, we practically demonstrate the implementation of these approaches and solutions with the intention that they may be generalized to other projects. Moreover, we detail five use cases that demonstrate the resulting benefits of constructing a well-organized metadata collection when processing or analyzing the recorded data, in particular when these are shared between laboratories in a modern scientific collaboration. Finally, we suggest an adaptable workflow to accumulate, structure and store metadata from different sources using, by way of example, the odML metadata framework.

34 citations


Patent
12 Dec 2016
TL;DR: In this article, techniques for metadata processing that can be used to encode an arbitrary number of security policies for code running on a processor are described, and aspects and techniques in connection with metadata processing in an embodiment based on the RISC-V architecture.
Abstract: Techniques are described for metadata processing that can be used to encode an arbitrary number of security policies for code running on a processor. Metadata may be added to every word in the system and a metadata processing unit may be used that works in parallel with data flow to enforce an arbitrary set of policies. In one aspect, the metadata may be characterized as unbounded and software programmable to be applicable to a wide range of metadata processing policies. Techniques and policies have a wide range of uses including, for example, safety, security, and synchronization. Additionally, described are aspects and techniques in connection with metadata processing in an embodiment based on the RISC-V architecture.

32 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: This work formally defines a metadata management process which identifies the key activities required to effectively handle information profiling, and demonstrates the value and feasibility of this approach using a prototype implementation handling a real-life case-study from the OpenML DL.
Abstract: There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this. We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.

26 citations


Posted Content
TL;DR: An architecture for data commons is described, as well as some lessons learned from operating several large-scale data commons.
Abstract: As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience developing data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.

24 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: This paper presents a comprehensive classification of all the metadata required to provide user support in KDD and presents the implementation of a metadata repository for storing and managing this metadata and explains its benefits in a real Big Data analytics project.
Abstract: Once analyzed correctly, data can yield substantial benefits. The process of analyzing the data and transforming it into knowledge is known as Knowledge Discovery in Databases (KDD). The plethora and subtleties of algorithms in the different steps of KDD, render it challenging. An effective user support is of crucial importance, even more now, when the analysis is performed on Big Data. Metadata is the necessary component to drive the user support. In this paper we study the metadata required to provide user support on every stage of the KDD process. We show that intelligent systems addressing the problem of user assistance in KDD are incomplete in this regard. They do not use the whole potential of metadata to enable assistance during the whole process. We present a comprehensive classification of all the metadata required to provide user support. Furthermore, we present our implementation of a metadata repository for storing and managing this metadata and explain its benefits in a real Big Data analytics project.

24 citations


01 Jan 2016
TL;DR: This paper introduces an approach that relies on declarative descriptions of (i) mapping rules, specifying how the rdf data is generated, and of (ii) raw data access interfaces to automatically and incrementally generate provenance and metadata information.
Abstract: Provenance and other metadata are essential for determining ownership and trust. Nevertheless, no systematic approaches were introduced so far in the Linked Data publishing workow to capture them. Dening such metadata remained independent of the rdf data generation and publishing. In most cases, metadata is manually dened by the data publishers (person-agents), rather than produced by the involved applications (software-agents). Moreover, the generated rdf data and the published one are considered to be one and the same, which is not always the case, leading to pure, condense and often seductive information. This paper introduces an approach that relies on declarative descriptions of (i) mapping rules, specifying how the rdf data is generated, and of (ii) raw data access interfaces to automatically and incrementally generate provenance and metadata information. This way, it is assured that the metadata information is accurate, consistent and complete.

22 citations


Journal ArticleDOI
TL;DR: This paper presents a method to construct such resume and illustrates the framework with current Semantic Web technologies, such as RDF and SPARQL for representing and querying semantic metadata, and shows the benefits of indexing and retrieving multimedia contents without centralizing multimedia contents or their associated metadata.
Abstract: Currently, many multimedia contents are acquired and stored in real time and on different locations. In order to retrieve efficiently the desired information and to avoid centralizing all metadata, we propose to compute a centralized metadata resume, i.e., a concise version of the whole metadata, which locates some desired multimedia contents on remote servers. The originality of this resume is that it is automatically constructed based on the extracted metadata. In this paper, we present a method to construct such resume and illustrate our framework with current Semantic Web technologies, such as RDF and SPARQL for representing and querying semantic metadata. Some experimental results are provided in order to show the benefits of indexing and retrieving multimedia contents without centralizing multimedia contents or their associated metadata, and to prove the efficiency of a metadata resume.

19 citations


27 Dec 2016
TL;DR: The main difficulty for generating LOM documents stands in the educational part of the metadata, so existing metadata extraction methods based on content analysis cannot totally serve LOM generation.
Abstract: Generation of learning object metadata • Understanding the issues related with the generation of learning object metadata. • Identifying the opportunities and drawbacks of using automatic techniques for generating learning object metadata. Validation of learning object metadata • Understanding the validation of learning object metadata. • Identifying the opportunities and drawbacks of automatic techniques for validating learning object metadata. Learning object retrieval • Understanding the retrieval of learning objects using learning object metadata. • Identifying the opportunities and drawbacks of automatic query generation for enabling semantically rich retrieval of learning objects. Use of learning object metadata • Visualizing the impact of automatic processes on the practical use of learning object metadata. Executive Summary Reuse of learning material has recently become a leitmotiv for research on computer-aided education. The most obvious motivation is the economic interest of reusing learning material instead of repeatedly authoring it. Other motivations can be found in the pedagogical area since learner-centric teaching theories invite instructors to use a wide variety of didactic material. Since sharing and retrieving learning material is a basic requirement to ease learning material reuse, it is not surprising to see the research community specially focusing on these topics. Learning material retrieval should not only stand on common document characteristics-like DublinCore Metadata Intitiative (DCMI, 2005)-in order to be pedagogically relevant, but also on specific educational data. The Learning Object Metadata (LOM) standard includes such data. Consequently, Learning Object Repositories typically use this metadata for storage and retrieval of learning objects. However, creating a LOM document means to instantiate the almost 60 metadata attributes of the IEEE LTSC LOM specification (LOM, 2005). Such a fastidious task is not compatible with making learning material sharing a customary activity for regular teachers. Therefore, several researchers seriously focus on the metadata generation issue (Downes, 2004; Duval et al., 2004; Simon et al., 2004). The main difficulty for generating LOM documents stands in the educational part of the metadata. Educational information is generally implicit in the learning objects. Therefore, existing metadata extraction methods based on content analysis cannot totally serve LOM generation. The same situation occurs with learning material retrieval. Retrieval effectiveness depends on LOM-based query precision. Consequently, generating effective queries could rapidly be as complex as generating LOM documents. In such a context, the future of learning object repositories will definitely depend on the ability of current systems to facilitate the generation of metadata as well as …

17 citations


Book ChapterDOI
19 Nov 2016
TL;DR: A pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings is described, together with open-source Web-based tools that support the acquisition, search, and management of metadata.
Abstract: The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this paper, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.

Journal ArticleDOI
TL;DR: The examples included in the distribution implement profiles of the ISO 19139 standard for geographic information, such as core INSPIRE metadata, as well as the OGC standard for sensor description, SensorML.
Abstract: EDI is a general purpose, template-driven metadata editor for creating XML-based descriptions. Originally aimed at defining rich and standard metadata for geospatial resources, It can be easily customised in order to comply with a broad range of schemata and domains. EDI creates HTML5 [9] metadata forms with advanced assisted editing capabilities and compiles them into XML files. The examples included in the distribution implement profiles of the ISO 19139 standard for geographic information [14], such as core INSPIRE metadata [10], as well as the OGC [8] standard for sensor description, SensorML [11]. Templates (the blueprints for a specific metadata format) drive form behaviour by element data types and provide advanced features like codelists 1 underlying combo boxes or autocompletion functionalities. Virtually, the editing of any metadata format can be supported by creating a specific template. EDI is stored on GitHub at https://github.com/SP7-Ritmare/EDI-NG_client and https://github.com/SP7-Ritmare/EDI-NG_server .

Proceedings ArticleDOI
Runsha Dong1, Fei Su1, Shan Yang1, Lexi Xu1, Xinzhou Cheng1, Weiwei Chen1 
01 Sep 2016
TL;DR: In this paper, metadata and its manage-ment system are illustrated from several aspects and a framework and an implementation of metadata management system are proposed to give guidance for enterprises.
Abstract: With the development of information technology, managers of enterprises pay attention to how to make decisions with a large scale of data stored in data warehouse. Data ware-house is an important component in Information Supply Chain. For an efficient infor-mation flow running through the Information Supply Chain, metadata and its management system guarantee the high quality of data sharing. In this paper, metadata and its manage-ment system are illustrated from several aspects. A framework and an implementation of metadata management system are proposed to give guidance for enterprises. Finally, the prospect of application on metadata management is concluded.

Journal ArticleDOI
TL;DR: A proof of concept based on an interoperable workflow between a data publication server and a metadata catalog to automatically generate ISO-compliant metadata is presented, which facilitates metadata creation by embedding this task in daily data management workflows and significantly reduces the obstacles of metadata production.

Proceedings ArticleDOI
05 Dec 2016
TL;DR: It is shown that efficient management of hot metadata improves the performance of SWfMS, reducing the workflow execution time up to 50% for highly parallel jobs and avoiding unnecessary cold metadata operations.
Abstract: Large-scale scientific applications are often expressed as workflows that help defining data dependencies between their different components. Several such workflows have huge storage and computation requirements, and so they need to be processed in multiple (cloud-federated) datacenters. It has been shown that efficient metadata handling plays a key role in the performance of computing systems. However, most of this evidence concern only single-site, HPC systems to date. In this paper, we present a hybrid decentralized/distributed model for handling hot metadata (frequently accessed metadata) in multisite architectures. We couple our model with a scientific workflow management system (SWfMS) to validate and tune its applicability to different real-life scientific scenarios. We show that efficient management of hot metadata improves the performance of SWfMS, reducing the workflow execution time up to 50% for highly parallel jobs and avoiding unnecessary cold metadata operations.

Proceedings ArticleDOI
24 Mar 2016
TL;DR: An approach for tag reconciliation in Open Data Portals is developed and implemented, encompassing local actions related to individual portals, and global actions for adding a semantic metadata layer above individual portals.
Abstract: This paper presents an approach for metadata reconciliation, curation and linking for Open Governamental Data Portals (ODPs). ODPs have been lately the standard solution for governments willing to put their public data available for the society. Portal managers use several types of metadata to organize the datasets, one of the most important ones being the tags. However, the tagging process is subject to many problems, such as synonyms, ambiguity or incoherence, among others. As our empiric analysis of ODPs shows, these issues are currently prevalent in most ODPs and effectively hinders the reuse of Open Data. In order to address these problems, we develop and implement an approach for tag reconciliation in Open Data Portals, encompassing local actions related to individual portals, and global actions for adding a semantic metadata layer above individual portals. The local part aims to enhance the quality of tags in a single portal, and the global part is meant to interlink ODPs by establishing relations between tags.

Journal ArticleDOI
TL;DR: A sample of U.S. repository administrators from the OpenDOAR initiative were surveyed to understand aspects of the quality and creation of their metadata, and how their metadata could improve.
Abstract: Digital repositories require good metadata, created according to community-based principles that include provisions for interoperability. When metadata is of high quality, digital objects become sharable and metadata can be harvested and reused outside of the local system. A sample of U.S.-based repository administrators from the OpenDOAR initiative were surveyed to understand aspects of the quality and creation of their metadata, and how their metadata could improve. Most respondents (65%) thought their metadata was of average quality; none thought their metadata was high quality or poor quality. The discussion argues that increased strategic staffing will alleviate many perceived issues with metadata quality.

Journal ArticleDOI
TL;DR: The findings from interviews with gravitational wave researchers were designed to gather user requirements to develop a metadata model, which tends to differ from those currently available in metadata standards.
Abstract: The complexity of computationally-intensive scientific research poses great challenges for both research data management and research reproducibility What metadata needs to be captured for tracking, reproducing, and reusing computational results is the starting point in developing metadata models to fulfil these functions of data management This paper reports the findings from interviews with gravitational wave (GW) researchers, which were designed to gather user requirements to develop a metadata model Motivations for keeping documentation of data and analysis results include trust, accountability and continuity of work Research reproducibility relies on metadata that represents code dependencies and versions and has good documentation for verification Metadata specific to GW data, workflows and outputs tend to differ from those currently available in metadata standards The paper also discusses the challenges in representing code dependencies and workflows

Journal ArticleDOI
TL;DR: The ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency is supported.
Abstract: Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and may even not contain some resources considered obvious. This is even worse when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems and information-retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to automatically identify inconsistent metadata records 870 out of 10,575 and propose fixes to most of them 91.5% These results support the ability of the proposed methodology to assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency.

Patent
08 Jan 2016
TL;DR: In this paper, a computing device determines a data manipulation from a job specification, and determines a corresponding data processing instruction using data-source metadata, and decides and executes a corresponding query.
Abstract: In some examples, a computing device determines a data manipulation from a job specification. The device determines a corresponding data-processing instruction using data-source metadata, and determines and executes a corresponding query. In some examples, a device receives search keys. The device searches data-source metadata using the search keys. The device weights a first data source based on producer-consumer relationships between data sources, and ranks the first data source using the weight. In some examples, a device determines structural and content information of a data record. The device determines a data-source identifier from the structural information and stores the content information with the data-source identifier in a database. In some examples, via a user interface, a device receives a job specification and annotation data. The device stores the spec and the annotation data in a metadata repository.

Journal ArticleDOI
TL;DR: This work presents the essentials on customisation of the editor by means of two use cases and demonstrates the novel capabilities enabled by RDF-based metadata representation with respect to traditional metadata management in the geospatial domain.
Abstract: Metadata management is an essential enabling factor for geospatial assets because discovery, retrieval, and actual usage of the latter are tightly bound to the quality of these descriptions. Unfortunately, the multi-faceted landscape of metadata formats, requirements, and conventions makes it difficult to identify editing tools that can be easily tailored to the specificities of a given project, workgroup, and Community of Practice. Our solution is a template-driven metadata editing tool that can be customised to any XML-based schema. Its output is constituted by standards-compliant metadata records that also have a semantics-aware counterpart eliciting novel exploitation techniques. Moreover, external data sources can easily be plugged in to provide autocompletion functionalities on the basis of the data structures made available on the Web of Data. Beside presenting the essentials on customisation of the editor by means of two use cases, we extend the methodology to the whole life cycle of geospatial metadata. We demonstrate the novel capabilities enabled by RDF-based metadata representation with respect to traditional metadata management in the geospatial domain.

Proceedings ArticleDOI
14 Mar 2016
TL;DR: A collection of a large number of user responses regarding identification of spreadsheet metadata from participants of a MOOC is described, to understand how users identify metadata in spreadsheets, and to evaluate two existing approaches of automatic metadata extraction from spreadsheets.
Abstract: Spreadsheets are popular end-user computing applicationsand one reason behind their popularity is that theyoffer a large degree of freedom to their users regarding theway they can structure their data. However, this flexibilityalso makes spreadsheets difficult to understand. Textual documentationcan address this issue, yet for supporting automaticgeneration of textual documentation, an important pre-requisiteis to extract metadata inside spreadsheets. It is a challengethough, to distinguish between data and metadata due to thelack of universally accepted structural patterns in spreadsheets. Two existing approaches for automatic extraction of spreadsheetmetadata were not evaluated on large datasets consisting ofuser inputs. Hence in this paper, we describe the collectionof a large number of user responses regarding identificationof spreadsheet metadata from participants of a MOOC. Wedescribe the use of this large dataset to understand how usersidentify metadata in spreadsheets, and to evaluate two existingapproaches of automatic metadata extraction from spreadsheets. The results provide us with directions to follow in order toimprove metadata extraction approaches, obtained from insightsabout user perception of metadata. We also understand what typeof spreadsheet patterns the existing approaches perform well andon what type poorly, and thus which problem areas to focus onin order to improve.

Journal ArticleDOI
Le Yang1
TL;DR: Findings of the study indicate that three specific metadata elements are effective in enhancing discoverability of digital collections through Internet search engines, including Dublin Core metadata elements Title, Description, and Subject.
Abstract: This study analyzed digital item metadata and keywords from Internet search engines to learn what metadata elements actually facilitate discovery of digital collections through Internet keyword searching and how significantly each metadata element affects the discovery of items in a digital repository. The study found that keywords from Internet search engines matched values in eight metadata elements and resulted in landing visits to the digital repository. Findings of the study indicate that three specific metadata elements are effective in enhancing discoverability of digital collections through Internet search engines, including Dublin Core metadata elements Title, Description, and Subject.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: Replichard provides metadata services through a cluster of metadata servers, in which a flexible consistency scheme is adopted: strict consistency for non-idempotent operations with dynamic write-lock sharding, and relaxed consistency with accuracy estimations of return values where consistency for idempotent requests is relaxed to achieve high throughput.
Abstract: Metadata scalability is critical for distributed systems as the storage scale is growing rapidly. Because of the strict consistency requirement of metadata, many existing metadata services utilize a fundamentally unscalable design for the sake of easy management, while others provide improved scalability but lead to unacceptable latency and management complexity. Without delivering scalable performance, metadata will be the bottleneck of the entire system. Based on the observation that real file dependencies are few, and there are usually more idempotent than non-idempotent operations, we propose a practical strategy, Replichard, allowing a tradeoff between metadata consistency and scalable performance. Replichard provides metadata services through a cluster of metadata servers, in which a flexible consistency scheme is adopted: strict consistency for non-idempotent operations with dynamic write-lock sharding, and relaxed consistency with accuracy estimations of return values where consistency for idempotent requests is relaxed to achieve high throughput. Write-locks are dynamically created at subtree-level and designated to independent metadata servers in an application-oriented manner. A subtree metadata update that occurs on a particular server is replicated to all metadata servers conforming to the application "start-end" semantics, resulting in an eventually consistent namespace. An asynchronous notification mechanism is also devised to enable users to deal with potential stale reads from operations of relaxed consistency. A prototype was implemented based on HDFS, and the experimental results show promising scalability and performance for both micro benchmarks and various real-world applications written in Pig, Hive and MapReduce.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: This study first identifies the challenges presented by the underlying infrastructure in supporting scalable, high-performance rich metadata management, and presents GraphMeta, a graph-based engine designed for managing large-scale rich metadata.
Abstract: High-performance computing (HPC) systems face increasingly critical metadata management challenges, especially in the approaching exascale era. These challenges arise not only from exploding metadata volumes but also from increasingly diverse metadata, which contains data provenance and user-defined attributes in addition to traditional POSIX metadata. This "rich" metadata is critical to support many advanced data management functionality such as data auditing and validation. In our prior work, we presented a graph-based model that could be a promising solution to uniformly manage such rich metadata because of its flexibility and generality. At the same time, however, graph-based rich metadata management introduces significant challenges. In this study, we first identify the challenges presented by the underlying infrastructure in supporting scalable, high-performance rich metadata management. To tackle these challenges, we then present GraphMeta, a graph-based engine designed for managing large-scale rich metadata. We also utilize a series of optimizations designed for rich metadata graphs. We evaluate GraphMeta with both synthetic and real HPC metadata workloads and compare it with other approaches. The results show that its advantages in terms of rich metadata management in HPC systems, including better performance and scalability compared with existing solutions.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: The MetaStore framework automatically generates the necessary software code (services) and extends the functionality of the framework to handle heterogeneous metadata models and standards, and allows full-text search over metadata through automated creation of indexes.
Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data and metadata. Metadata is critical for scientific research, as it enables discovering, analysing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments is heterogeneous and subject to frequent changes, demanding a flexible data model. Currently, there does not exist an adaptive and a generic solution that is capable of handling heterogeneous metadata models. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database. To handle heterogeneous metadata models and standards, the MetaStore automatically generates the necessary software code (services) and extends the functionality of the framework. To leverage the functionality of NoSQL databases, the MetaStore framework allows full-text search over metadata through automated creation of indexes. Finally, a dedicated REST service is provided for efficient harvesting (sharing) of metadata using the METS metadata standard over the OAI-PMH protocol.

Journal ArticleDOI
01 Apr 2016
TL;DR: This work describes the components and workflow of the framework for computer-aided management of sensor metadata, and provides a user-friendly, template-driven metadata authoring tool composed of a backend web service and an HTML5/javascript client.
Abstract: The need for continuous, accurate, and comprehensive environmental knowledge has led to an increase in sensor observation systems and networks. The Sensor Web Enablement (SWE) initiative has been promoted by the Open Geospatial Consortium (OGC) to foster interoperability among sensor systems. The provision of metadata according to the prescribed SensorML schema is a key component for achieving this and nevertheless availability of correct and exhaustive metadata cannot be taken for granted. On the one hand, it is awkward for users to provide sensor metadata because of the lack in user-oriented, dedicated tools. On the other, the specification of invariant information for a given sensor category or model (e.g., observed properties and units of measurement, manufacturer information, etc.), can be labor- and timeconsuming. Moreover, the provision of these details is error prone and subjective, i.e., may differ greatly across distinct descriptions for the same system. We provide a user-friendly, template-driven metadata authoring tool composed of a backend web service and an HTML5/javascript client. This results in a form-based user interface that conceals the high complexity of the underlying format. This tool also allows for plugging in external data sources providing authoritative definitions for the aforementioned invariant information. Leveraging these functionalities, we compiled a set of SensorML profiles, that is, sensor metadata blueprints allowing end users to focus only on the metadata items that are related to their specific deployment. The natural extension of this scenario is the involvement of end users and sensor manufacturers in the crowd-sourced evolution of this collection of prototypes. We describe the components and workflow of our framework for computer-aided management of sensor metadata.

Journal ArticleDOI
01 Apr 2016
TL;DR: The workflow for metadata management envisages editing via customizable web- based forms, encoding of records in any XML application profile, translation into RDF (involving the semantic lift of metadata records), and storage of the metadata as RDF and back-translation into the original XML format with added semantics-aware features.
Abstract: In the geospatial realm, data annotation and discovery rely on a number of ad-hoc formats and protocols. These have been created to enable domain-specific use cases generalized search is not feasible for. Metadata are at the heart of the discovery process and nevertheless they are often neglected or encoded in formats that either are not aimed at efficient retrieval of resources or are plainly outdated. Particularly, the quantum leap represented by the Linked Open Data (LOD) movement did not induce so far a consistent, interlinked baseline in the geospatial domain. In a nutshell, datasets, scientific literature related to them, and ultimately the researchers behind these products are only loosely connected; the corresponding metadata intelligible only to humans, duplicated on different systems, seldom consistently. Instead, our workflow for metadata management envisages i) editing via customizable web- based forms, ii) encoding of records in any XML application profile, iii) translation into RDF (involving the semantic lift of metadata records), and finally iv) storage of the metadata as RDF and back-translation into the original XML format with added semantics-aware features. Phase iii) hinges on relating resource metadata to RDF data structures that represent keywords from code lists and controlled vocabularies, toponyms, researchers, institutes, and virtually any description one can retrieve (or directly publish) in the LOD Cloud. In the context of a distributed Spatial Data Infrastructure (SDI) built on free and open-source software, we detail phases iii) and iv) of our workflow for the semantics-aware management of geospatial metadata.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The results show that the approach presented is effective in collecting and storing provenance metadata and allows the query of an entire provenance of datasets and data products, thus enabling reuse, discovery, and visualization of raw data, processes, and scientists involved in its generation and evolution.
Abstract: Long-term research and environmental monitoring are essential for the improved management of ecosystems and natural resources. However, to reuse this data for new experiments, decision-making processes, and integrate these data with other long-term initiatives, scientists need more information related to data creation and its evolution, intellectual property rights, and technical information in order to evaluate the use of this data. Provenance metadata emerges as a way to evaluate the quality and reliability of data, audit processes and the data versioning, while enabling the data reuse and the reproducibility of experiments and analysis. However, most solutions for the capture and management of provenance metadata are based on specific tools, restricted scopes, and they are difficult to apply in distributed and heterogeneous environments. In this paper, we present an approach for capturing, managing, and publishing the provenance metadata generated in the environmental monitoring processes. Our computational architecture comprises three main components: (1) a data model based in PROV-DM and Dublin Core, (2) a repository of RDF Graphs, and (3) a Web API that provides services for collecting, storing, and querying provenance metadata. We demonstrate the application of our approach and show its practical usefulness by evaluating this architecture to manage provenance metadata generated during an environmental monitoring simulation. The results show that our approach is effective in collecting and storing provenance metadata and allows the query of an entire provenance of datasets and data products, thus enabling reuse, discovery, and visualization of raw data, processes, and scientists involved in its generation and evolution.

Proceedings ArticleDOI
07 Nov 2016
TL;DR: This work explores XSLT to provide a solution for this transition of descriptions to educational metadata and converted OBAA metadata in XML to OWL format, which aims to give more semantic and improve the reasoners inference capabilities.
Abstract: Metadata is broadly used to describe learning objects and its standards have been developed to improve interoperability. With the current Web extension by the Semantic Web, learning object metadata has been migrating from XML-based metadata standards to semantic metadata standards. This work explores XSLT to provide a solution for this transition of descriptions to educational metadata. The approach consists of retrieve, convert, and store metadata in Semantic Web context. In this work, we converted OBAA metadata in XML to OWL format. The OWL format aims to give more semantic and improve the reasoners inference capabilities. The converted metadata can be managed by different applications and systems. In addition, we have been working in OWL to OWL mappings through aligned ontologies.