scispace - formally typeset
Search or ask a question

Showing papers on "Metadata repository published in 2018"


Journal ArticleDOI
TL;DR: The MIRACUM Data Integration Center (DIC) as discussed by the authors is a consortium of academic and hospital partners as well as one industrial partner in eight German cities which have joined forces to create interoperable data integration centres (dIC) and make data within those DIC available for innovative new IT solutions in patient care and medical research.
Abstract: Introduction: This article is part of the Focus Theme of Methods of Information in Medicine on the German Medical Informatics Initiative. Similar to other large international data sharing networks (e.g. OHDSI, PCORnet, eMerge, RD-Connect) MIRACUM is a consortium of academic and hospital partners as well as one industrial partner in eight German cities which have joined forces to create interoperable data integration centres (DIC) and make data within those DIC available for innovative new IT solutions in patient care and medical research. Objectives: Sharing data shall be supported by common interoperable tools and services, in order to leverage the power of such data for biomedical discovery and moving towards a learning health system. This paper aims at illustrating the major building blocks and concepts which MIRACUM will apply to achieve this goal. Governance and Policies: Besides establishing an efficient governance structure within the MIRACUM consortium (based on the steering board, a central administrative office, the general MIRACUM assembly, six working groups and the international scientific advisory board), defining DIC governance rules and data sharing policies, as well as establishing (at each MIRACUM DIC site, but also for MIRACUM in total) use and access committees are major building blocks for the success of such an endeavor. Architectural Framework and Methodology: The MIRACUM DIC architecture builds on a comprehensive ecosystem of reusable open source tools (MIRACOLIX), which are linkable and interoperable amongst each other, but also with the existing software environment of the MIRACUM hospitals. Efficient data protection measures, considering patient consent, data harmonization and a MIRACUM metadata repository as well as a common data model are major pillars of this framework. The methodological approach for shared data usage relies on a federated querying and analysis concept. Use Cases: MIRACUM aims at proving the value of their DIC with three use cases: IT support for patient recruitment into clinical trials, the development and routine care implementation of a clinico-molecular predictive knowledge tool, and molecular-guided therapy recommendations in molecular tumor boards. Results: Based on the MIRACUM DIC release in the nine months conceptual phase first large scale analysis for stroke and colorectal cancer cohorts have been pursued. Discussion: Beyond all technological challenges successfully applying the MIRACUM tools for the enrichment of our knowledge about diagnostic and therapeutic concepts, thus supporting the concept of a Learning Health System will be crucial for the acceptance and sustainability in the medical community and the MIRACUM university hospitals.

70 citations


Journal ArticleDOI
TL;DR: The so-called FAIR Data Point was integrated into OSSE to provide a description of metadata in a FAIR manner, which is an important step towards unified documentation across multiple registries.
Abstract: The Open Source Registry for Rare Diseases (OSSE) provides a concept and a software for the management of registries for patients with rare diseases. A disease is defined as rare if less than 5 out of 10,000 people are affected. Up to date, approximately 6,000 rare diseases are catalogued. Networking and data exchange for research purposes remains challenging due to the paucity of interoperability and due to the fact that small data stocks are stored locally. The so called "Findable, Accessible, Interoperable, Reusable" (FAIR) Data Principles have been developed to improve research in the field of rare diseases. Subsequently, the OSSE architecture was adapted to implement the FAIR Data Principles. Therefore, the so-called FAIR Data Point was integrated into OSSE to provide a description of metadata in a FAIR manner. OSSE relies on the existing metadata repository (MDR), which is used in to define data elements in the system. This is an important step towards unified documentation across multiple registries. The integration and use of new procedures to improve interoperability plays an important role in the context of registries for rare diseases.

12 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: OntoSoft-VFF (Ontology for Software Version, Function and Functionality), a software metadata repository designed to capture information about software and workflow components that is important for managing workflow exploration and evolution, is proposed and implemented.
Abstract: Scientific workflow management systems play a major role in the design, execution and documentation of computational experiments. However, they have limited support for managing workflow evolution and exploration because they lack rich metadata for the software that implements workflow components. Such metadata could be used to support scientists in exploring local adjustments to a workflow, replacing components with similar software, or upgrading components upon release of newer software versions. To address this challenge, we propose OntoSoft-VFF (Ontology for Software Version, Function and Functionality), a software metadata repository designed to capture information about software and workflow components that is important for managing workflow exploration and evolution. Our approach uses a novel ontology to describe the functionality and evolution through time of any software used to create workflow components. OntoSoft-VFF is implemented as an online catalog that stores semantic metadata for software to enable workflow exploration through understanding of software functionality and evolution. The catalog also supports comparison and semantic search of software metadata. We showcase OntoSoft-VFF using machine learning workflow examples. We validate our approach by testing that a workflow system could compare differences in software metadata, explain software updates and describe the general functionality of workflow steps.

12 citations


Journal ArticleDOI
TL;DR: The structure and features of the Samply.MDR as well as its flexible usability are presented by giving an overview about its application in various projects.
Abstract: Collaboration in medical research is becoming common, especially for collecting relevant cases across institutional boundaries. If the data, which is usually very heterogeneously formalized and structured, can be integrated, such a collaboration can facilitate research. An absolute prerequisite for this is an extensive description about the formalization and exact meaning of every data element contained in a dataset. This information is commonly known as metadata. Various research networking projects tackle this challenge with the development of concepts and IT tools. The Samply Metadata Repository (Samply.MDR) is a solution for managing and publishing such metadata in a standardized and reusable way. In this article we present the structure and features of the Samply.MDR as well as its flexible usability by giving an overview about its application in various projects.

10 citations


Journal ArticleDOI
TL;DR: MetaStore is an adaptive metadata management framework based on a NoSQL database and an RDF triple store that automatically segregates the different categories of metadata in their corresponding data models to maximize the utilization of the data models supported by NoSQL databases.
Abstract: In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data, and the handling of associated metadata is critical, as it enables discovering, analyzing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments are heterogeneous and subject to frequent changes, demanding a flexible data model. Existing metadata management systems provide a broad range of features for handling scientific metadata. However, the principal limitation of these systems is their architecture design that is restricted towards either a single or at the most a few standard metadata models. Support for handling different types of metadata models, i.e., administrative, descriptive, structural, and provenance metadata, and including community-specific metadata models is not possible with these systems. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database and an RDF triple store. MetaStore provides a set of core functionalities to handle heterogeneous metadata models by automatically generating the necessary software code (services) and on-the-fly extends the functionality of the framework. To handle dynamic metadata and to control metadata quality, MetaStore also provides an extended set of functionalities such as enabling annotation of images and text by integrating the Web Annotation Data Model, allowing communities to define discipline-specific vocabularies using Simple Knowledge Organization System, and providing advanced search and analytical capabilities by integrating the ElasticSearch. To maximize the utilization of the data models supported by NoSQL databases, MetaStore automatically segregates the different categories of metadata in their corresponding data models. Complex provenance graphs and dynamic metadata are modeled and stored in an RDF triple store, whereas the static metadata is stored in a NoSQL database. For enabling large-scale harvesting (sharing) of metadata using the METS standard over the OAI-PMH protocol, MetaStore is designed OAI-compliant. Finally, to show the practical usability of the MetaStore framework and that the requirements from the research communities have been realized, we describe our experience in the adoption of MetaStore for three communities.

10 citations


07 Feb 2018
TL;DR: In this paper, the authors developed a specific innovative methodology based on recent advances in "big data" intelligent databases applied to the growing amount of high-spatial and multi-wavelength resolution, high-cadence data from NASA's missions and supporting ground-based observatories.
Abstract: The fundamental motivation of the project is that the scientific output of solar research can be greatly enhanced by better exploitation of the existing solar/heliosphere space-data products jointly with ground-based observations. Our primary focus is on developing a specific innovative methodology based on recent advances in "big data" intelligent databases applied to the growing amount of high-spatial and multi-wavelength resolution, high-cadence data from NASA's missions and supporting ground-based observatories. Our flare database is not simply a manually searchable time-based catalog of events or list of web links pointing to data. It is a preprocessed metadata repository enabling fast search and automatic identification of all recorded flares sharing a specifiable set of characteristics, features, and parameters. The result is a new and unique database of solar flares and data search and classification tools for the Heliophysics community, enabling multi-instrument/multi-wavelength investigations of flare physics and supporting further development of flare-prediction methodologies.

8 citations


Journal ArticleDOI
TL;DR: The solution, presented in this work, provides extensibility to simple and complex data types, unary and binary operations, type conversions, functions and visuals, thus enabling developers to seamlessly add new features to SLGeometry by implementing them as C# classes annotated with metadata.

7 citations


Patent
23 May 2018
TL;DR: In this article, a closed-loop unified metadata architecture is proposed to provide a meaningful, consistent and normalized view of the metadata that describes the information, as well as to determine data lineage and meaningful data quality metrics.
Abstract: There has been exponential growth in the capture and retention of immense quantities of information in a globally distributed manner. A closed-loop unified metadata architecture includes a universal metadata repository and implements data quality and data lineage analyses. The architecture solves significant technical challenges to provide a meaningful, consistent and normalized view of the metadata that describes the information, as well as to determine data lineage and meaningful data quality metrics.

6 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This work proposes to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases, and shows that this solution is flexible, it supports a variety of schema sizes and DBMS.
Abstract: With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.

6 citations


Journal ArticleDOI
TL;DR: The goal of the study is to quantitatively measure completeness of metadata records and to determine if metadata developed by LTER is more complete with respect to the recommendation than other collections in EML and in CSDGM.

6 citations


Patent
29 Mar 2018
TL;DR: In this article, the authors describe a hybrid data management system that operates by receiving, from a user interface, a modification to a field of data, which is transmitted to the decentralized data management systems.
Abstract: Disclosed herein are system, method, and computer program product embodiments for a hybrid data management system. An embodiment operates by receiving, from a user interface, a modification to a field of data. It is determined that the field of data corresponds to a decentralized data management system based on a look-up to a metadata repository. The modification is transmitted to the decentralized data management system. From the decentralized data management system, an asset identifier corresponding to the modification is received. The asset identifier is stored in a centralized database. Via the user interface, an indication that the field of data has been modified is provided.

Journal Article
TL;DR: In this article, the authors evaluate how on-board techniques rely on matching and mapping using a graph database, and apply algorithms for metadata management to different datasets relating to cancer related to cancer.
Abstract: To exchange data across several sites or to interpret it at a later point in time, it is necessary to create a general understanding of the data. As a standard practice, this understanding is achieved through metadata. These metadata are usually stored in relational databases, so-called metadata repositories (MDR). Typical functions of such an MDR include pure storage, administration and other specific metadata functionalities such as finding relations among data elements. This results in a multitude of connections between the data elements, which can be described as highly interconnected graphs. To use alternative databases such as graph databases for modelling and visualisation it has already been proven to be beneficial in previous studies. The objective of this work is to evaluate how on-board techniques rely on matching and mapping using a graph database. Different datasets relating to cancer were entered, and algorithms for metadata management were applied.

Patent
27 Dec 2018
TL;DR: In this paper, a plurality of data elements from a data lake associated with an organization are registered with one or more metadata objects through a metadata registration, which is performed using a graphical user interface by either receiving a manual input from a user or using a REST application programming interface.
Abstract: Embodiments provide data handling methods and systems for data lakes. In an embodiment, the method includes accessing a plurality of data elements from a data lake associated with an organization. Each data element is registered with one or more metadata objects through a metadata registration The metadata registration is performed using a graphical user interface by either receiving a manual input from a user or using a REST application programming interface. A unified metadata repository is formed based on the metadata registration of the plurality of data elements. Moreover, complex computations of the plurality of data elements for various data processing operations and business rules are performed. Graphical processing of the plurality of data elements in the data lake is performed for analyzing entities and their relationships to generate insights. The method further includes performing an analytical operation based at least on machine learning algorithms and deep learning techniques.


Proceedings Article
13 Sep 2018
TL;DR: A method for semantically enhancing the metadata stored in a medical multimedia data warehouse is presented, which allows the system to speed up the execution of a query, by computing the results of new, unforeseen queries, from the fact data already stored in the data warehouse.
Abstract: Data warehouses are versatile systems capable of storing and processing large quantities of data. They are most suited for aggregating and reporting. The data managed by these systems vary from simple, numeric data, to more complex, multimedia data. One of the domains in which multimedia data is intensively produced is medicine. We present a method for semantically enhancing the metadata stored in a medical multimedia data warehouse. This semantically rich environment will gain in autonomy, reducing the dependence on human intervention to resolve new, unforeseen queries. Furthermore, the use of the semantic relations defined in the ontology allows the system to speed up the execution of a query, by computing the results of new, unforeseen queries, from the fact data already stored in the data warehouse.

Journal Article
TL;DR: The goal of this work is to provide a way to "inject" the meaning of metadata keys into the web-based frontend of an application to make it "metadata aware".
Abstract: Whenever medical data is integrated from multiple sources, it is regarded good practice to separate data from information about its meaning, such as designations, definitions or permissible values (in short: metadata). However, the ways in which applications work with metadata are imperfect: Many applications do not support fetching metadata from externalized sources such as metadata repositories. In order to display human-readable metadata in any application, we propose not to change the application, but to provide a library that makes a change to the user interface. The goal of this work is to provide a way to "inject" the meaning of metadata keys into the web-based frontend of an application to make it "metadata aware".

Patent
22 Mar 2018
TL;DR: In this paper, a metadata collection system may be executed to automatically populate a metadata template based on the set of potential metadata entries, and the system may update entries in the metadata template using a translation tool and validate the updated entries to ensure that required data elements are present.
Abstract: A back-end application computer server may access a potential metadata entries data store containing a set of potential metadata entries, each entry including at least a data element name and a data element definition. A metadata collection system may be executed to automatically populate a metadata template based on the set of potential metadata entries. The system may update entries in the metadata template using a translation tool and validate the updated entries in the metadata template to ensure that required data elements are present. The system may also certify the validated entries load the set of certified metadata entries, including the certified data element names and certified data element definitions, into an enterprise metadata repository data store. Electronic messages may be exchanged to support at least one interactive user interface display associated with certification of the metadata template.

Journal ArticleDOI
TL;DR: This work introduces IntegrityCatalog, a novel software system that can be integrated into any digital repository and introduces a treap‐based persistent authenticated dictionary managing arbitrary length key/value pairs, which it uses to store all integrity metadata.
Abstract: Summary Digital repositories must periodically check the integrity of stored objects to assure users of their correctness. Prior solutions calculate integrity metadata and require the repository to store it alongside the actual data objects. To safeguard and detect damage to this metadata, prior solutions rely on widely visible media (unaffiliated third parties) to store and provide back digests of the metadata to verify it is intact. However, they do not address recovery of the integrity metadata in case of damage or adversarial attack. We introduce IntegrityCatalog, a novel software system that can be integrated into any digital repository. It collects all integrity-related metadata in a single component and treats them as first class objects, managing both their integrity and their preservation. We introduce a treap-based persistent authenticated dictionary managing arbitrary length key/value pairs, which we use to store all integrity metadata, accessible simply by object name. Additionally, IntegrityCatalog is a distributed system that includes a network protocol that manages both corruption detection and preservation of this metadata, using administrator-selected network peers with 2 possible roles. Verifiers store and offer attestations on digests and have minimal storage requirements, while preservers efficiently synchronize a complete copy of the catalog to assist in recovery in case of a detected catalog compromise on the local system. We present our approach in developing the prototype implementation, measure its performance experimentally, and demonstrate its effectiveness in real-world situations. We believe the implementation techniques of our open-source IntegrityCatalog will be useful in the construction of next-generation digital repositories.

Patent
14 Jun 2018
TL;DR: In this paper, a system and method for building a hyperdata hub to access an enriched data model is presented, where one or more data models are built based on user input to a user interface, and query definitions are built on the user input.
Abstract: A system and method for building a hyperdata hub to access an enriched data model is presented. One or more data models are built based on user input to a user interface, and one or more query definitions are built based on the user input to the user interface. Data is collected from external data sources and internal data sources, and contextual data is extracted based on the collected data according to the one or more data models and the one or more query definitions. The metadata associated with the one or more data models and one or more query definitions are stored, and data is matched with the contextual data associated with the hyperdata metadata repository.


10 Dec 2018
TL;DR: The Earthdata Search End-to-End Services (E2ES) workflow as mentioned in this paper leverages the Common Metadata Repository (CMR) newly implemented Unified Metadata Models for Services and Variables as well as a new service broker to expose and seamlessly integrate a collection's service capabilities and variables.
Abstract: The goal of NASA's Earthdata Search End-to-End Services workflow is to take the pain and headache out of searching for data and getting that data back in a format that is usable with only that data that is relevant for you. For too long scientists have had to jump through endless hoops, use tools that only offer specific data or specific services, and perform any number of other non-science tasks just to get started on their actual project. Earthdata Search leverages the Common Metadata Repository's (CMR) newly implemented Unified Metadata Models for Services and Variables as well as a new service broker to expose and seamlessly integrate a collection's service capabilities and variables into an intuitive user interface. Using the new End-to-End Services workflow, scientists will be able to quickly see what data is available to be customized, what customization options are available, and actually perform those customizations on the data all within Earthdata Search, regardless of who the data provider is. This talk will demonstrate the simple workflow that will be available to end users and also give an overview covering how the workflow is enabled by the metadata stored within the CMR. (https://search.earthdata.nasa.gov/)

Proceedings ArticleDOI
06 Jul 2018
TL;DR: The authors' Universal Metadata Repository (UMR) applied to three in-flight use cases which combine the power of a technical and business view using knowledge graphs for: searching, inferencing, traceability, administration, enforcing accessibility standards, and providing consistent organizational architecture are presented.
Abstract: Managing ever-growing content from heterogeneous data sources is a significant challenge in enterprise environments. Many data analysis tools work in isolation to capture various statistical, quality, and provenance information within the enterprise. Yielding meaningful and consistent information from a landscape of different vendor tools requires a holistic and transparent view over all existing extracted metadata. In this paper, we present our Universal Metadata Repository (UMR) applied to three in-flight use cases which combine the power of a technical and business view using knowledge graphs for: searching, inferencing, traceability, administration, enforcing accessibility standards, and providing consistent organizational architecture.

Patent
01 Mar 2018
TL;DR: In this paper, the authors propose an approach for managing data replication between first and second sites of a distributed computing environment by one or more processors based on an identified data block-set for replication.
Abstract: Embodiments for, in a shared storage environment, managing data replication between first and second sites of a distributed computing environment by one or more processors Based on an identified data block-set for replication, a unique metadata map is generated as a computed snapshot of the identified data block-set, the metadata map accounting for a predetermined block-size for transfer The unique metadata map is transferred to the second site The second site adds the unique metadata map to a global metadata repository

Patent
29 Nov 2018
TL;DR: In this article, a system, method, and computer-readable medium are disclosed for performing a deployment operation, comprising: receiving an application module command request; accessing a metadata repository for application modules to obtain metadata corresponding to the application module; determining whether a module corresponding to a command request is loaded within an application based upon metadata corresponding with the application modules.
Abstract: A system, method, and computer-readable medium are disclosed for performing a deployment operation, comprising: receiving an application module command request; accessing a metadata repository for application modules to obtain metadata corresponding to the application module; determining whether an application module corresponding to the application module command request is loaded within an application based upon metadata corresponding to the application module; contacting a package manager to download an application module package if the application module is not loaded within the application or an update to the application module exists; loading the application module package; and, providing an invocation to an entry point of the application module.