scispace - formally typeset
Search or ask a question

Showing papers on "Metadata repository published in 2017"


Journal ArticleDOI
01 Jun 2017
TL;DR: In this article, the authors provide information about the logistics of this network, including real-time applications of the collected data as well as information on the quality control protocols, the construction of the station data and metadata repository and the means through which the data are made available to users.
Abstract: During the last 10 years, the Institute for Environmental Research and Sustainable Development of the National Observatory of Athens has developed and operates a network of automated weather stations across Greece. The motivation behind the network development is the monitoring of weather conditions in Greece with the aim to support not only the research needs (weather monitoring and analysis, weather forecast skill evaluation) but also the needs of various communities of the production sector (agriculture, constructions, leisure and tourism, etc.). By the end of 2016, 335 weather stations are in operation, providing real-time data at 10-min intervals. This paper provides information about the logistics of this network, including real-time applications of the collected data as well as information on the quality control protocols, the construction of the station data and metadata repository and the means through which the data are made available to users.

135 citations


Journal ArticleDOI
06 Nov 2017
TL;DR: A key element of this work is the definition of hierarchical metadata describing state-of-the-art electronic-structure calculations, which was agreed upon by two teams and is presented in this perspective paper.
Abstract: With big-data driven materials research, the new paradigm of materials science, sharing and wide accessibility of data are becoming crucial aspects. Obviously, a prerequisite for data exchange and big-data analytics is standardization, which means using consistent and unique conventions for, e.g., units, zero base lines, and file formats. There are two main strategies to achieve this goal. One accepts the heterogeneous nature of the community, which comprises scientists from physics, chemistry, bio-physics, and materials science, by complying with the diverse ecosystem of computer codes and thus develops “converters” for the input and output files of all important codes. These converters then translate the data of each code into a standardized, code-independent format. The other strategy is to provide standardized open libraries that code developers can adopt for shaping their inputs, outputs, and restart files, directly into the same code-independent format. In this perspective paper, we present both strategies and argue that they can and should be regarded as complementary, if not even synergetic. The represented appropriate format and conventions were agreed upon by two teams, the Electronic Structure Library (ESL) of the European Center for Atomic and Molecular Computations (CECAM) and the NOvel MAterials Discovery (NOMAD) Laboratory, a European Centre of Excellence (CoE). A key element of this work is the definition of hierarchical metadata describing state-of-the-art electronic-structure calculations.

91 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: SoMeta is presented, a scalable and decentralized metadata management approach for object-centric storage in HPC systems that provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy.
Abstract: Scientific data sets, which grow rapidly in volume, are often attached with plentiful metadata, such as their associated experiment or simulation information. Thus, it becomes difficult for them to be utilized and their value is lost over time. Ideally, metadata should be managed along with its corresponding data by a single storage system, and can be accessed and updated directly. However, existing storage systems in high-performance computing (HPC) environments, such as Lustre parallel file system, still use a static metadata structure composed of non-extensible and fixed amount of information. The burden of metadata management falls upon the end-users and require ad-hoc metadata management software to be developed.With the advent of "object-centric" storage systems, there is an opportunity to solve this issue. In this paper, we present SoMeta, a scalable and decentralized metadata management approach for object-centric storage in HPC systems. It provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy. In our experiments, SoMeta achieves up to 3.7X speedup over Lustre in performing common metadata operations, and up to 16X faster than SciDB and MongoDB for advanced metadata operations, such as adding and searching tags. Additionally, in contrast to existing storage systems, SoMeta offers scalable user-space metadata management by allowing users with the capability to specify the number of metadata servers depending on their workload.

35 citations


Book ChapterDOI
21 Oct 2017
TL;DR: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored as discussed by the authors, and the CEDAR Workbench is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, fill in templates to generate high-quality metadata, and to share and manage these resources.
Abstract: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed—the CEDAR Workbench—is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.

30 citations


21 Aug 2017
TL;DR: This paper presents the WASABI project, started in 2017, which aims at the construction of a 2 million song knowledge base that combines metadata collected from music databases on the Web, metadata resulting from the analysis of song lyrics, and metadata result from the audio analysis.
Abstract: This paper presents the WASABI project, started in 2017, which aims at (1) the construction of a 2 million song knowledge base that combines metadata collected from music databases on the Web, metadata resulting from the analysis of song lyrics, and metadata resulting from the audio analysis, and (2) the development of semantic applications with high added value to exploit this semantic database. A preliminary version of the WASABI database is already on-line 1 and will be enriched all along the project. The main originality of this project is the collaboration between the algorithms that will extract semantic metadata from the web and from song lyrics with the algorithms that will work on the audio. The following WebAudio enhanced applications will be associated with each song in the database: an online mixing table, guitar amp simulations with a virtual pedal-board, audio analysis visualization tools, annotation tools, a similarity search tool that works by uploading audio extracts or playing some melody using a MIDI device are planned as companions for the WASABI database.

24 citations


Proceedings ArticleDOI
01 Jun 2017
TL;DR: A new project-based multi-tenancy model for Hadoop is presented that provides a distributed database backend for the HDFS metadata layer and is extended to introduce projects, datasets, and project-users as new core concepts that enable a user-friendly, UI-driven Hadoops experience.
Abstract: Hadoop is a popular system for storing, managing,and processing large volumes of data, but it has bare-bonesinternal support for metadata, as metadata is a bottleneck andless means more scalability. The result is a scalable platform withrudimentary access control that is neither user- nor developer-friendly. Also, metadata services that are built on Hadoop, suchas SQL-on-Hadoop, access control, data provenance, and datagovernance are necessarily implemented as eventually consistentservices, resulting in increased development effort and morebrittle software. In this paper, we present a new project-based multi-tenancymodel for Hadoop, built on a new distribution of Hadoopthat provides a distributed database backend for the HadoopDistributed Filesystem's (HDFS) metadata layer. We extendHadoop's metadata model to introduce projects, datasets, andproject-users as new core concepts that enable a user-friendly, UI-driven Hadoop experience. As our metadata service is backed bya transactional database, developers can easily extend metadataby adding new tables and ensure the strong consistency ofextended metadata using both transactions and foreign keys.

24 citations


Journal ArticleDOI
TL;DR: The intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found.
Abstract: The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

22 citations


Book
17 Jan 2017
TL;DR: This paper focuses on the development of a Data Curation Service for the retrieval and preservation of historical data for use in the rapidly changing landscape of data reuse and reuse.
Abstract: Table of Contents Acknowledgments Foreword Preliminary Step 0: Establish Your Data Curation Service Step 1.0: Receive the Data 1.1 Recruit Data for Your Curation Service 1.2 Negotiate Deposit 1.3 Transfer Rights (Deposit Agreements) 1.4 Facilitate Data Transfer 1.5 Obtain Available Metadata and Documentation 1.6 Receive Notification of Data Arrival Step 2.0: Appraisal and Selection Techniques that Mitigate Risks Inherent to Data 2.1 Appraisal 2.2 Risk Factors for Data Repositories 2.3 Inventory 2.4 Selection 2.5 Assign Step 3.0: Processing and Treatment Actions for Data 3.1 Secure the Files 3.2 Create a Log of Actions Taken 3.3 Inspect the File Names and Structure 3.4 Open the Data Files 3.5 Attempt to Understand and Use the Data 3.6 Work with Author to Enhance the Submission 3.7 Consider the File Formats 3.8 File Arrangement and Description Step 4.0: Ingest and Store Data in Your Repository 4.1 Ingest the Files 4.2 Store the Assets Securely 4.3 Develop Trust in Your Digital Repository Step 5.0: Descriptive Metadata 5.1 Create and Apply Appropriate Metadata 5.2 Consider Disciplinary Metadata Standards for Data Step 6.0: Access 6.1 Determine Appropriate Levels of Access 6.2 Apply the Terms of Use and Any Relevant Licenses 6.3 Contextualize the Data 6.4 Increase Exposure and Discovery 6.5 Apply Any Necessary Access Controls 6.6 Ensure Persistent Access and Encourage Appropriate Citation 6.7 Release Data for Access and Notify Author Step 7.0: Preservation of Data for the Long Term 7.1 Preservation Planning for Long-Term Reuse 7.2 Monitor Preservation Needs and Take Action Step 8.0: Reuse 8.1 Monitor Data Reuse 8.2 Collect Feedback about Data Reuse and Quality Issues 8.3 Provide Ongoing Support as Long as Necessary 8.4 Cease Data Curation Brief Concluding Remarks and a Call to Action Bibliography Biographies

20 citations


Proceedings Article
01 Jan 2017
TL;DR: A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata.
Abstract: In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.

19 citations


Journal ArticleDOI
TL;DR: The design and use of a metadata-driven data repository for research data management is described, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.
Abstract: The design and use of a metadata-driven data repository for research data management is described. Metadata is collected automatically during the submission process whenever possible and is registered with DataCite in accordance with their current metadata schema, in exchange for a persistent digital object identifier. Two examples of data preview are illustrated, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.

15 citations


Journal ArticleDOI
TL;DR: A new organization of NeuroMorpho.Org metadata grounded on a set of interconnected hierarchies focusing on the main dimensions of animal species, anatomical regions, and cell types is presented, explicitly resolving all ambiguities caused by synonymy and homonymy.
Abstract: Neuronal morphology is extremely diverse across and within animal species, developmental stages, brain regions, and cell types. This diversity is functionally important because neuronal structure strongly affects synaptic integration, spiking dynamics, and network connectivity. Digital reconstructions of axonal and dendritic arbors are thus essential to quantify and model information processing in the nervous system. NeuroMorpho.Org is an established repository containing tens of thousands of digitally reconstructed neurons shared by several hundred laboratories worldwide. Each neuron is annotated with specific metadata based on the published references and additional details provided by data owners. The number of represented metadata concepts has grown over the years in parallel with the increase of available data. Until now, however, the lack of standardized terminologies and of an adequately structured metadata schema limited the effectiveness of user searches. Here we present a new organization of NeuroMorpho.Org metadata grounded on a set of interconnected hierarchies focusing on the main dimensions of animal species, anatomical regions, and cell types. We have comprehensively mapped each metadata term in NeuroMorpho.Org to this formal ontology, explicitly resolving all ambiguities caused by synonymy and homonymy. Leveraging this consistent framework, we introduce OntoSearch, a powerful functionality that seamlessly enables retrieval of morphological data based on expert knowledge and logical inferences through an intuitive string-based user interface with auto-complete capability. In addition to returning the data directly matching the search criteria, OntoSearch also identifies a pool of possible hits by taking into consideration incomplete metadata annotation.

Journal ArticleDOI
TL;DR: This work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms, which has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work proposes a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models, and shows that this approach can lead to substantial performance gains with the simple addition of network and geographic features.
Abstract: Low-dimensional vector representations of social media users can benefit applications like recommendation systems and user attribute inference Recent work has shown that user embeddings can be improved by combining different types of information, such as text and network data We propose a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models Experimenting with the task of friend recommendation on a dataset of 5,019 Twitter users, we show that our approach can lead to substantial performance gains with the simple addition of network and geographic features

Journal ArticleDOI
TL;DR: This article presents an automatic metadata extraction approach that creates from different optical data deriving from various satellite missions of scientific interest metadata information, based on an extended model of the standard ISO 19115.
Abstract: Scientists as well public institutions dealing with geospatial data often work with a large amount of heterogeneous data deriving from different sources. Without a well-defined, organized structure they face problems in finding and reusing existing data, and as consequence this may cause data inconsistency and storage problems. A catalog system based on the metadata of spatial data facilitates the management of large amount of data and offers service to retrieve, discover and exchange geographic data in an quick and easy fashion. Currently, most online catalogs are more focusing on the geographic data and there has been only few interests in catalogizing Earth observation data, in which in addition the acquisition information matters. This article presents an automatic metadata extraction approach that creates from different optical data deriving from various satellite missions of scientific interest (i.e. MODIS, LANDSAT, RapidEye, Suomi-NPP VIIRS, Sentinel-1A, Sentinel-2A) metadata information, based on an extended model of the standard ISO 19115. The XML schema ISO 19139-2 with the support of gridded and imagery information defined in ISO 19115-2 was examined, and based on the requirements of experts working in the research field of Earth observation the schema was extended. The XML schema ISO 19139-2 and its extension has been deployed as a new schema plugin in the spatial catalog Geonetwork Open Source in order to store all relevant metadata information about satellite data and the appropriate acquisition and processing information in an online catalog. A real-world scenario that is productively used in the EURAC research group institute for Applied Remote Sensing illustrates a workflow management for Earth observation data including data processing, metadata extraction, generation and distribution.

Patent
14 Sep 2017
TL;DR: In this article, a computer-implemented method of managing data in a data repository is disclosed, which comprises maintaining a data repositories, the data repository storing data imported from one or more data sources.
Abstract: A computer-implemented method of managing data in a data repository is disclosed. The method comprises maintaining a data repository, the data repository storing data imported from one or more data sources. A database entity added to the data repository is identified and a metadata object for storing metadata relating to the database entity is created and stored in a metadata repository. The metadata object is also added to a documentation queue. Metadata for the metadata object is received from user via a metadata management user interface and the received metadata is stored in the metadata repository and associated with the metadata object.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: In two prototype implementations, object labels, gaze data from eye-tracking and the corresponding video into a single multimedia container and visualize this data using a media player to facilitate visualization in standard multimedia players, streaming via the Internet, and easy use without conversion.
Abstract: There is an ever increasing amount of video data sets which comprise additional metadata, such as object labels, tagged events, or gaze data. Unfortunately, metadata are usually stored in separate files in custom-made data formats, which reduces accessibility even for experts and makes the data inaccessible for non-experts. Consequently, we still lack interfaces for many common use cases, such as visualization, streaming, data analysis, machine learning, high-level understanding and semantic web integration. To bridge this gap, we want to promote the use of existing multimedia container formats to establish a standardized method of incorporating content and metadata. This will facilitate visualization in standard multimedia players, streaming via the Internet, and easy use without conversion, as shown in the attached demonstration video and files. In two prototype implementations, we embed object labels, gaze data from eye-tracking and the corresponding video into a single multimedia container and visualize this data using a media player. Based on this prototype, we discuss the benefit of our approach as a possible standard. Finally, we argue for the inclusion of MPEG-7 in multimedia containers as a further improvement.

Journal ArticleDOI
TL;DR: This work designs Dindex, a distributed indexing service for metadata that incorporates a hierarchy of coarse-grained aggregation and horizontal key-coalition and demonstrates that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.
Abstract: In Big Data era, applications are generating orders of magnitude more data in both volume and quantity. While many systems emerge to address such data explosion, the fact that these data’s descriptors, i.e., metadata, are also “big” is often overlooked. The conventional approach to address the big metadata issue is to disperse metadata into multiple machines. However, it is extremely difficult to preserve both load-balance and data-locality in this approach. To this end, in this work we propose hierarchical indirection layers for indexing the underlying distributed metadata. By doing this, data locality is achieved efficiently by the indirection while load-balance is preserved. Three key challenges exist in this approach, however: first, how to achieve high resilience; second, how to ensure flexible granularity; third, how to restrain performance overhead. To address above challenges, we design Dindex, a distributed indexing service for metadata. Dindex incorporates a hierarchy of coarse-grained aggregation and horizontal key-coalition. Theoretical analysis shows that the overhead of building Dindex is compensated by only two or three queries. Dindex has been implemented by a lightweight distributed key-value store and integrated to a fully-fledged distributed filesystem. Experiments demonstrated that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.

Proceedings ArticleDOI
27 Jun 2017
TL;DR: It is shown that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.
Abstract: Scientists' capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma---an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.


Proceedings ArticleDOI
24 Sep 2017
TL;DR: A systematic mapping study of approaches and tools labeling source code elements with metadata and presenting them to developers in various forms, forming a taxonomy with four dimensions — source, target, presentation and persistence.
Abstract: Source code is a primary artifact where programmers are looking when they try to comprehend a program. However, to improve program comprehension efficiency, tools often associate parts of source code with metadata collected from static and dynamic analysis, communication artifacts and many other sources. In this article, we present a systematic mapping study of approaches and tools labeling source code elements with metadata and presenting them to developers in various forms. We selected 25 from more than 2,000 articles and categorized them. A taxonomy with four dimensions — source, target, presentation and persistence — was formed. Based on the survey results, we also identified interesting future research challenges.

Journal ArticleDOI
TL;DR: With OSSE, the foundation is laid to operate linked patient registries while respecting strong data protection regulations and the feedback given by the users will influence further development of OSSE.
Abstract: Meager amounts of data stored locally, a small number of experts, and a broad spectrum of technological solutions incompatible with each other characterize the landscape of registries for rare diseases in Germany. Hence, the free software Open Source Registry for Rare Diseases (OSSE) was created to unify and streamline the process of establishing specific rare disease patient registries. The data to be collected is specified based on metadata descriptions within the registry framework's so-called metadata repository (MDR), which was developed according to the ISO/IEC 11179 standard. The use of a central MDR allows for sharing the same data elements across any number of registries, thus providing a technical prerequisite for making data comparable and mergeable between registries and promoting interoperability.With OSSE, the foundation is laid to operate linked patient registries while respecting strong data protection regulations. Using the federated search feature, data for clinical studies can be identified across registries. Data integrity, however, remains intact since no actual data leaves the premises without the owner's consent. Additionally, registry solutions other than OSSE can participate via the OSSE bridgehead, which acts as a translator between OSSE registry networks and non-OSSE registries. The pseudonymization service Mainzelliste adds further data protection.Currently, more than 10 installations are under construction in clinical environments (including university hospitals in Frankfurt, Hamburg, Freiburg and Munster). The feedback given by the users will influence further development of OSSE. As an example, the installation process of the registry for undiagnosed patients at University Hospital Frankfurt is described in more detail.

Journal ArticleDOI
TL;DR: This paper developed a metadata reporting framework (FRAMES) to enable management and synthesis of observational data that are essential in advancing a predictive understanding of earth systems and utilizes best practices for data and metadata organization enabling consistent data reporting and compatibility with a variety of standardized data protocols.

Proceedings ArticleDOI
12 Nov 2017
TL;DR: EMPRESS provides a simple example of the next step in this evolution of application-level metadata management by integrating per-process metadata with the storage system itself, making it more broadly useful than single file or application formats.
Abstract: Significant challenges exist in the efficient retrieval of data from extreme-scale simulations. An important and evolving method of addressing these challenges is application-level metadata management. Historically, HDF5 and NetCDF have eased data retrieval by offering rudimentary attribute capabilities that provide basic metadata. ADIOS simplified data retrieval by utilizing metadata for each process' data. EMPRESS provides a simple example of the next step in this evolution by integrating per-process metadata with the storage system itself, making it more broadly useful than single file or application formats. Additionally, it allows for more robust and customizable metadata.

Journal ArticleDOI
TL;DR: Evaluating OntoSoft for organizing the metadata associated with a data pre-processing software workflow used in association with the Variable Infiltration Capacity (VIC) hydrologic model suggests that past efforts to document this software resulted in capturing key model metadata in unstructured files that could be formalized into a machine-readable form using the Onto soft Ontology.
Abstract: Metadata for hydrologic models is rarely organized in machine-readable forms. This lack of formal metadata is important because it limits the ability to catalog, identify, attribute, and understand unique model software; ultimately, it hinders the ability to reproduce past computational studies. Researchers have recently proposed an ontology for scientific software metadata called OntoSoft for addressing this problem. The objective of this research is to evaluate OntoSoft for organizing the metadata associated with a data pre-processing software workflow used in association with the Variable Infiltration Capacity (VIC) hydrologic model. This is accomplished by exploring what metadata are available from online resources and how this metadata aligns with the OntoSoft Ontology. The results suggest that past efforts to document this software resulted in capturing key model metadata in unstructured files that could be formalized into a machine-readable form using the OntoSoft Ontology. The OntoSoft Ontology and Portal are evaluated for capturing and sharing metadata for hydrologic modeling software.Data pre-processing software workflow for the Variable Infiltration Capacity (VIC) hydrologic model is used as a case study.90% of required OntoSoft metadata was obtained for 13 of the 15 software resources.Metadata divided across six sources can now be organized in a constant, machine-readable form.


Patent
Norie Iwasaki1, Matsui Sosuke1, Tsuyoshi Miyamura1, Terue Watanabe1, Yamamoto Noriko1 
09 Feb 2017
TL;DR: In this paper, an information processing apparatus, backup method, and program product that enable efficient differential backup of files stored in a storage device is presented, including a metadata management unit, a map generation unit, and a backup management unit for scanning metadata to detect files that have been created, modified, or deleted since the last backup.
Abstract: An information processing apparatus, backup method, and program product that enable efficient differential backup. In one embodiment, an information processing apparatus for files stored in a storage device includes: a metadata management unit for managing metadata of files stored in the storage device; a map generation unit for generating a map which indicates whether metadata associated with an identification value uniquely identifying a file in the storage device is present or absent; and a backup management unit for scanning the metadata to detect files that have been created, modified, or deleted since the last backup, and storing at least a data block and the metadata for a detected file in a backup storage device as backup information in association with the identification value.

Journal ArticleDOI
TL;DR: This work extends previous research, in which a publication service has been designed in the framework of the European Directive Infrastructure for Spatial Information in Europe (INSPIRE) as a solution to assist users in automatically publishing geospatial data and metadata in order to improve SDI maintenance and usability.
Abstract: Nowadays, the existence of metadata is one of the most important aspects of effective discovery of geospatial data published in Spatial Data Infrastructures (SDIs). However, due to lack of efficient mechanisms integrated in the data workflow, to assist users in metadata generation, a lot of low quality and outdated metadata are stored in the catalogues. This paper presents a mechanism for generating and publishing metadata through a publication service. This mechanism is provided as a web service implemented with a standard interface called a web processing service, which improves interoperability between other SDI components. This work extends previous research, in which a publication service has been designed in the framework of the European Directive Infrastructure for Spatial Information in Europe (INSPIRE) as a solution to assist users in automatically publishing geospatial data and metadata in order to improve, among other aspects, SDI maintenance and usability. Also, this work adds more extra features in order to support more geospatial formats, such as sensor data.

Journal ArticleDOI
TL;DR: A set of Python packages that can automatically generate ISA‐Tab metadata file stubs from raw XML metabolomics data files are reported, which reduces the time needed to capture metadata substantially, is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets.
Abstract: Summary Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets. Availability and Implementation mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/. Contact reza.salek@ebi.ac.uk or isatools@googlegroups.com. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: DiNoDB is proposed, an interactive-speed query engine for ad-hoc queries on temporary data that avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions.
Abstract: As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of ad-hoc queries compared to alternatives.

Journal ArticleDOI
TL;DR: The analysis results demonstrate that the model supports the feasibility of online metadata rebalance without the normal operation obstruction and increases the chances of maintaining balance in a huge cluster of metadata servers.
Abstract: This paper presents an effective method of metadata rebalance in exascale distributed file systems. Exponential data growth has led to the need for an adaptive and robust distributed file system whose typical architecture is composed of a large cluster of metadata servers and data servers. Though each metadata server can have an equally divided subset from the entire metadata set at first, there will eventually be a global imbalance in the placement of metadata among metadata servers, and this imbalance worsens over time. To ensure that disproportionate metadata placement will not have a negative effect on the intrinsic performance of a metadata server cluster, it is necessary to recover the balanced performance of the cluster periodically. However, this cannot be easily done because rebalancing seriously hampers the normal operation of a file system. This situation continues to get worse with both an ever-present heavy workload on the file system and frequent failures of server components at exascale. As one of the primary reasons for such a degraded performance, file system clients frequently fail to look up metadata from the metadata server cluster during the period of metadata rebalance; thus, metadata operations cannot proceed at their normal speed. We propose a metadata rebalance model that minimizes failures of metadata operations during the metadata rebalance period and validate the proposed model through a cost analysis. The analysis results demonstrate that our model supports the feasibility of online metadata rebalance without the normal operation obstruction and increases the chances of maintaining balance in a huge cluster of metadata servers.