scispace - formally typeset
Search or ask a question

Showing papers on "Metadata repository published in 2019"


Journal ArticleDOI
TL;DR: It is shown that comparable data quality assessment across different partners of a distributed research network is feasible when a central metadata repository is combined with locally installed assessment processes.
Abstract: Background With the increasing personalization of clinical therapies, translational research is evermore dependent on multisite research cooperations to obtain sufficient data and biomaterial. Distributed research networks rely on the availability of high-quality data stored in local databases operated by their member institutions. However, reusing data documented by independent health providers for the purpose of care, rather than research (“secondary use”), reveal a high variability in terms of data formats, as well as poor data quality, across network sites. Objectives The aim of this work is the provision of a process for the assessment of data quality with regard to completeness and syntactic accuracy across independently operated data warehouses using common definitions stored in a central (network-wide) metadata repository (MDR). Methods For assessment of data quality across multiple sites, we employ a framework of so-called bridgeheads. These are federated data warehouses, which allow the sites to participate in a research network. A central MDR is used to store the definitions of the commonly agreed data elements and their permissible values. Results We present the design for a generator of quality reports within a bridgehead, allowing the validation of data in the local data warehouse against a research network's central MDR. A standardized quality report can be produced at each network site, providing a means to compare data quality across sites, as well as to channel feedback to the local data source systems, and local documentation personnel. A reference implementation for this concept has been successfully utilized at 10 sites across the German Cancer Consortium. Conclusions We have shown that comparable data quality assessment across different partners of a distributed research network is feasible when a central metadata repository is combined with locally installed assessment processes. To achieve this, we designed a quality report and the process for generating such a report. The final step was the implementation in a German research network.

15 citations


Journal ArticleDOI
TL;DR: A uniform query interface formetadata repositories is introduced combining the ISO 11179 standard for metadata repositories and the GraphQL query language, which facilitates access to metadata, enables better interaction with metadata as well as a basis for connecting existing repositories.
Abstract: Heterogeneous healthcare instance data can hardly be integrated without harmonizing its schema-level metadata. Many medical research projects and organizations use metadata repositories to edit, store and reuse data elements. However, existing metadata repositories differ regarding software implementation and have shortcomings when it comes to exchanging metadata. This work aims to define a uniform interface with a technical interlingua between the different MDR implementations in order to enable and facilitate the exchange of metadata, to query over distributed systems and to promote cooperation. To design a unified interface for multiple existing MDRs, a standardized data model must be agreed on. The ISO 11179 is an international standard for the representation of metadata, and since most MDR systems claim to be at least partially compliant, it is suitable for defining an interface thereupon. Therefore, each repository must be able to define which parts can be served and the interface must be able to handle highly linked data. GraphQL is a data access layer and defines query techniques designed to navigate easily through complex data structures. We propose QL4MDR, an ISO 11179-3 compatible GraphQL query language. The GraphQL schema for QL4MDR is derived from the ISO 11179 standard and defines objects, fields, queries and mutation types. Entry points within the schema define the path through the graph to enable search functionalities, but also the exchange is promoted by mutation types, which allow creating, updating and deleting of metadata. QL4MDR is the foundation for the uniform interface, which is implemented in a modern web-based interface prototype. We have introduced a uniform query interface for metadata repositories combining the ISO 11179 standard for metadata repositories and the GraphQL query language. A reference implementation based on the existing Samply.MDR was implemented. The interface facilitates access to metadata, enables better interaction with metadata as well as a basis for connecting existing repositories. We invite other ISO 11179-based metadata repositories to take this approach into account.

11 citations


Journal ArticleDOI
TL;DR: In this article, a lexical bag-of-words matcher was developed to semiautomatically map local biobank terms to the central ADOPT BBMRI-ERIC terminology.
Abstract: Background High-quality clinical data and biological specimens are key for medical research and personalized medicine. The Biobanking and Biomolecular Resources Research Infrastructure-European Research Infrastructure Consortium (BBMRI-ERIC) aims to facilitate access to such biological resources. The accompanying ADOPT BBMRI-ERIC project kick-started BBMRI-ERIC by collecting colorectal cancer data from European biobanks. Objectives To transform these data into a common representation, a uniform approach for data integration and harmonization had to be developed. This article describes the design and the implementation of a toolset for this task. Methods Based on the semantics of a metadata repository, we developed a lexical bag-of-words matcher, capable of semiautomatically mapping local biobank terms to the central ADOPT BBMRI-ERIC terminology. Its algorithm supports fuzzy matching, utilization of synonyms, and sentiment tagging. To process the anonymized instance data based on these mappings, we also developed a data transformation application. Results The implementation was used to process the data from 10 European biobanks. The lexical matcher automatically and correctly mapped 78.48% of the 1,492 local biobank terms, and human experts were able to complete the remaining mappings. We used the expert-curated mappings to successfully process 147,608 data records from 3,415 patients. Conclusion A generic harmonization approach was created and successfully used for cross-institutional data harmonization across 10 European biobanks. The software tools were made available as open source.

9 citations


Journal ArticleDOI
TL;DR: The authors' systematic UMLS-based analysis revealed the existence of a core data set and an exemplary reusable implementation for harmonized data capture is available on an established metadata repository.
Abstract: Background: For cancer domains such as acute myeloid leukemia (AML), a large set of data elements is obtained from different institutions with heterogeneous data definitions within one patient course. The lack of clinical data harmonization impedes cross-institutional electronic data exchange and future meta-analyses. Objective: This study aimed to identify and harmonize a semantic core of common data elements (CDEs) in clinical routine and research documentation, based on a systematic metadata analysis of existing documentation models. Methods: Lists of relevant data items were collected and reviewed by hematologists from two university hospitals regarding routine documentation and several case report forms of clinical trials for AML. In addition, existing registries and international recommendations were included. Data items were coded to medical concepts via the Unified Medical Language System (UMLS) by a physician and reviewed by another physician. On the basis of the coded concepts, the data sources were analyzed for concept overlaps and identification of most frequent concepts. The most frequent concepts were then implemented as data elements in the standardized format of the Operational Data Model by the Clinical Data Interchange Standards Consortium. Results: A total of 3265 medical concepts were identified, of which 1414 were unique. Among the 1414 unique medical concepts, the 50 most frequent ones cover 26.98% of all concept occurrences within the collected AML documentation. The top 100 concepts represent 39.48% of all concepts’ occurrences. Implementation of CDEs is available on a European research infrastructure and can be downloaded in different formats for reuse in different electronic data capture systems. Conclusions: Information management is a complex process for research-intense disease entities as AML that is associated with a large set of lab-based diagnostics and different treatment options. Our systematic UMLS-based analysis revealed the existence of a core data set and an exemplary reusable implementation for harmonized data capture is available on an established metadata repository.

7 citations


Journal ArticleDOI
TL;DR: The developed framework provides an interactive and lightweight visualization of high-resolution 3D models in a web browser based on 3D Heritage Online Presenter and Three.js library, implemented on top of WebGL API.
Abstract: . In the last decade 3D datasets of the Cultural Heritage field have become extremely rich and high detailed due to the evolution of the technologies they derive from. However, their online deployment, both for scientific and general public purposes is usually deficient in user interaction and multimedia integration. A single solution that efficiently addresses these issues is presented in this paper. The developed framework provides an interactive and lightweight visualization of high-resolution 3D models in a web browser. It is based on 3D Heritage Online Presenter (3DHOP) and Three.js library, implemented on top of WebGL API. 3DHOP capabilities are fully exploited and enhanced with new, high level functionalities. The approach is especially suited to complex geometry and it is adapted to archaeological and architectural environments. Thus, the multi-dimensional documentation of the archaeological site of Meteora, in central Greece is chosen as the case study. Various navigation paradigms are implemented and the data structure is enriched with the incorporation of multiple 3D model viewers. Furthermore, a metadata repository, comprises ortho-images, photographic documentation, video and text, is accessed straight forward through the inspection of the main 3D scene of Meteora by a system of interconnections.

5 citations


Patent
02 Jul 2019
TL;DR: In this paper, a tag metadata database module stores tag metadata received over a network connection in a tagmetadata database and retrieves tag metadata in response to requests received over the network and from within the historian system.
Abstract: A historian system enables the creation, storage, and retrieval of extended metadata properties. A tag metadata database module of the historian system stores tag metadata received over a network connection in a tag metadata database and retrieves tag metadata in response to requests received over the network and from within the historian system. An extended property database module creates extended properties associated with a tag metadata instance in response to requests, stores the created extended properties, and retrieves the stored extended properties in response to requests. The extended property search index module indexes extended properties as they are created, searches the indexed extended properties in response to requests, and provides the indexes of extended properties to enable location of the extended properties in the extended property database.

4 citations


Journal ArticleDOI
TL;DR: An enhanced composition of standardized RDF statements for detailed provenance representation is proposed for collaborative metadata development and an algorithm that extracts and translates provenance data from the repository into the proposed RDF Statements is developed.
Abstract: The German Center for Lung Research (DZL) is a research network with the aim of researching respiratory diseases. In order to enable consortium-wide retrospective research and prospective patient recruitment, we perform data integration into a central data warehouse. The enhancements of the underlying ontology is an ongoing process for which we developed the Collaborative Metadata Repository (CoMetaR) tool. Its technical infrastructure is based on the Resource Description Framework (RDF) for ontology representation and the distributed version control system Git for storage and versioning. Ontology development involves a considerable amount of data curation. Data provenance improves its feasibility and quality. Especially in collaborative metadata development, a comprehensive annotation about "who contributed what, when and why" is essential. Although RDF and Git versioning repositories are commonly used, no existing solution captures metadata provenance information in sufficient detail. We propose an enhanced composition of standardized RDF statements for detailed provenance representation. Additionally, we developed an algorithm that extracts and translates provenance data from the repository into the proposed RDF statements.

3 citations


Patent
08 Oct 2019
TL;DR: In this article, the source code file includes source code with first comment text having a first digital signature associated therewith, and in response, providing the code file for display in the IDE, receiving input data, determining that the input data includes authoring of comment text within the code files, and automatically: providing comment metadata that is associated with the comment text and providing a second digital signature that was associated with comment text.
Abstract: Methods, systems, and computer-readable storage media for receiving a request to open a source code file for editing within an integrated development environment (IDE), determining that the source code file includes source code with first comment text having a first digital signature associated therewith, authenticating the first digital signature, and in response, providing the source code file for display in the IDE, receiving input data, determining that the input data includes authoring of comment text within the source code file, and in response, automatically: providing comment metadata that is associated with the comment text and providing a second digital signature that is associated with the comment text, and storing the comment text, the comment metadata, and the second digital signature in a comment metadata repository.

3 citations


Journal ArticleDOI
TL;DR: The paper reports the findings of the study that investigated the distribution of the date elements in the metadata aggregated in the Polish Federation of Digital Libraries and related it to the types of libraries.
Abstract: Large-scale distributed digital library systems with aggregated metadata provide platforms for resource discovery and retrieval. For researchers, aggregated metadata offers a potential for big data...

3 citations


Journal ArticleDOI
TL;DR: Improved performances indicate improved performances ranging up to 4% in terms of precision, up to 18% in Terms of recall and up to 11% on F1 Scores, indicating the effectiveness of the proposed CRAFT model.
Abstract: Aspect-based opinion mining aims to provide results that aid in effective business decision making. Identifying the aspects, their major and minor causes proves to be the major challenge in this domain. This paper presents a cause-related aspect formulation technique (CRAFT) to perform opinion mining. The CRAFT model incorporates an enhanced aspect extraction module, ontology creation based on aspect and aspect categories, aspect and aspect category metadata repository creation and maintenance and a decision tree-based parallelized boosted ensemble. The proposed CRAFT model is implemented in Spark to incorporate parallelism in the architecture. The process of ontology creation and metadata repository creation aids in effective identification of both implicit and explicit aspects. Experiments were conducted using a customer review benchmark dataset incorporating reviews about five varied products. Comparisons were performed with state-of-the-art models CNN+LP, Popscu and TF-RBM. Comparisons indicate improved performances ranging up to 4% in terms of precision, up to 18% in terms of recall and up to 11% on F1 Scores, indicating the effectiveness of the proposed CRAFT model.

1 citations


Journal ArticleDOI
TL;DR: A software tool is proposed that builds on existing data integration infrastructures and provides a visually supported validation routine for data integration rules, to enable data providers understand the rules regarding their own data by presenting rules and available context visually.
Abstract: Data integration is the problem of combining data residing at different sources and providing the user with a unified view of these data. In medical informatics, such a unified view enables retrospective analyses based on more facts and prospective recruitment of more patients than any single data collection by itself. The technical part of data integration is based on rules interpreted by software. These rules define how to perform the translation of source database schemata into the target database schema. Translation rules are formulated by data managers who usually do not have the knowledge about meaning and acquisition methods of the data they handle. The professionals (data providers) collecting the source data who have the respective knowledge again usually have no sufficient technical background. Since data providers are neither able to formulate the transformation rules themselves nor able to validate them, the whole process is fault-prone. Additionally, in continuous development and maintenance of (meta-) data repositories, data structures underlie changes, which may lead to outdated transformation rules. We did not find any technical solution, which enables data providers to formulate transformation rules themselves or which provides an understandable reflection of given rules. Our approach is to enable data providers understand the rules regarding their own data by presenting rules and available context visually. Context information is fetched from a metadata repository. In this paper, we propose a software tool that builds on existing data integration infrastructures. The tool provides a visually supported validation routine for data integration rules. In a first step towards its evaluation, we implement the tool into the DZL data integration process and verify the correct presentation of transformation rules.

01 Oct 2019
TL;DR: The status of the Euclid SGS software infrastructure, the prototypes developed and the continuous system integration and testing performed through the Euclidean “SGS Challenges” are presented.
Abstract: The Science Ground Segment (SGS) of the Euclid mission provides distributed and redundant data storage and processing, federating nine Science Data Centres (SDCs) and a Science Operations Centre. The SGS reference architecture is based on loosely coupled systems and services, broadly organized into a common infrastructure of transverse software components and the scientific data Processing Functions. The SGS common infrastructure includes: 1) the Euclid Archive System (EAS), a central metadata repository which inventories, indexes and localizes the huge amount of distributed data; 2) a Distributed Storage System of EAS, providing a unified view of the SDCs storage systems and supporting several transfer protocols; 3) an Infrastructure Abstraction Layer, isolating the scientific data processing software from the underlying IT infrastructure and providing a common, lightweight workflow management system; 4) a Common Orchestration System, performing a balanced distribution of data and processing among the SDCs. Virtualization is another key element of the SGS infrastructure. We present the status of the Euclid SGS software infrastructure, the prototypes developed and the continuous system integration and testing performed through the Euclid “SGS Challenges”.

Proceedings ArticleDOI
25 Oct 2019
TL;DR: A method of metadata repository developing in terms of metadata responsible for describing business objects and the relationships between them is discussed, which allows organizing data storage within the data warehouse using a metadata repository based on the multidimensional organization principle.
Abstract: When organizing automated data collection in a data warehouse under the conditions of increasing data volume and complicating the business model of an enterprise, an information system data model control becomes one of the priority tasks. The article discusses a method of metadata repository developing in terms of metadata responsible for describing business objects and the relationships between them. The choice of "Data vault" determines the construction of a data warehouse within the framework of an information system based on the classical design approach with a 3-level data presentation architecture, which includes a data preparation area, or an online data warehouse, data warehouse and thematic data marts. The proposed approach allows organizing data storage within the data warehouse using a metadata repository based on the multidimensional organization principle. The metadata repository is responsible for the data collection process, the data storage process, and the presentation of data for analysis. The metadata repository is presented in the form of a metamodel that is semantically related to the domain of the system, is easily reconstructed in case of changes in the business model of the domain, and allows data marts to be created with the structure of a multidimensional data model based on the Star relational scheme. This allows you to organize the human-computer interaction when describing a metamodel, using mainly knowledge about the structure of the subject area. When describing a metamodel, the first-order predicate calculus language is used, which makes it possible to control the metamodel using a declarative programming style - the "Prolog" language. The key point in the structure of the information system is the way of transition from the "Data vault" model to a multidimensional data representation model based on associative rules of dependence between information objects.

Patent
19 Jul 2019
TL;DR: In this article, a capsule stores identification information such as a URL and a URN in the structure information of the metadata part thereof, in which identification information is to be stored, and a capsule engine unit decodes the identification information and, if a URL, directly acquires the entity of data or program constituting the content from a server that is an external storage.
Abstract: In the present invention, when encapsulating content and providing a user with the encapsulated content, a capsule stores identification information such as a URL and a URN in the structure information of the metadata part thereof, in which identification information is to be stored, and a capsule engine unit decodes the identification information and, if a URL, directly acquires the entity of data or program constituting the content from a server that is an external storage or, if a URN, temporarily inquires of a server of a dictionary such as a metadata repository about a URL and acquires from said server. Therefore, there is no need to install the entity of data or program in a data cache unit and it is possible to readily deliver or distribute a capsule. Furthermore, there is no need to install all of software from the beginning in the information processing device of a content provider or user and it is possible to start e-learning easily.

Book ChapterDOI
08 Jul 2019
TL;DR: This paper conduct a study on the augmentation of the current capabilities of the intelligent urban mobility and road transport in terms of the analytics dimension focusing on the data mining and big data analytics methodologies.
Abstract: This paper conduct a study on the augmentation of the current capabilities of the intelligent urban mobility and road transport in terms of the analytics dimension focusing on the data mining and big data analytics methodologies. A federated or a hybrid approach leverages the strengths and mitigates the weaknesses of both data warehouse and big data analytics. We discuss the challenges, requirements, integrated models, components, scenarios and proposed solutions to the performance, efficiency, availability, security and privacy concerns in the context of smart cities. Our approach relies on several layers that run in parallel to collect and manage all collected data and create several scenarios that will be used to assist urban mobility. The data warehouse and big data analytics can serve as means to support clustering, classification, recommending systems, frequent item set mining. The challenge here is to populate the repository architecture with the schema, view definitions, metadata and specify/integrate the types of this architecture (Centralized Metadata repository, Distributed Metadata repository, Federated or Hybrid Metadata repository).

Patent
31 Oct 2019
TL;DR: In this paper, a customer exposure management system is presented, which consists of a first input configured to receive operational data from a data ecosystem; a second input configured to receive risk derived data from risk data source; a metadata repository; and a rule engine comprising a processor coupled to the first input, the second input and metadata repository and further configured to execute rules to: merge the operational data and risk-derived data to generate a composite file on an account or customer basis; create one or more global attributes; identify an optimal income for the composite file; calculate a customer-
Abstract: The invention relates to a customer exposure management system. The system comprises: a first input configured to receive operational data from a data ecosystem; a second input configured to receive risk derived data from a risk data source; a metadata repository; and a rule engine comprising a processor coupled to the first input, the second input and metadata repository and further configured to execute rules to: merge the operational data and risk derived data to generate a composite file on an account or customer basis; create one or more global attributes; identify an optimal income for the composite file; calculate a customer exposure strategy metric that defines an optimal exposure; select a strategy from a plurality of strategies wherein the strategy implements the one or more global attributes; apply the selected strategy to the composite file; and execute a corresponding account action.

Patent
22 Aug 2019
TL;DR: In this article, the authors present system, method, and computer program product embodiment for an ETL (extract-transform-load) system, which operates by receiving, at a processor, a message including a request to move data from a source database to a target database.
Abstract: Disclosed herein are system, method, and computer program product embodiments for a an ETL (extract-transform-load) system. An embodiment operates by receiving, at a processor, a message including a request to move data from a source database to a target database. The data is retrieved from the source database. One or more operations to perform on the data that convert the data from a source format associated with the source database to a target format associated with the target database are determined from the message. The one or more operations are executed on the data. The data is stored on the target database in the target format.

Patent
05 Mar 2019
TL;DR: In this article, a problematic stage within the plurality of stages of a media delivery pipeline is identified based at least in part on analysis of the tracing metadata, which comprises a content identifier, a segment identifier, and a stage identifier.
Abstract: Methods, systems, and computer-readable media for monitoring of media pipeline health using tracing are disclosed. At a plurality of stages of a media delivery pipeline, tracing metadata is generated for elements of a media stream. The tracing metadata comprises a content identifier, a segment identifier, and a stage identifier. The tracing metadata is generated from the plurality of stages and sent to a metadata repository using instrumentation of components that implement the plurality of stages. A problematic stage within the plurality of stages is identified based at least in part on analysis of the tracing metadata.

Patent
06 Jun 2019
TL;DR: In this paper, an abstraction engine is used to determine a data manipulation definition generated by a creator platform to access a database via a first connection protocol (e.g., ODBC).
Abstract: According to some embodiments, an abstraction engine may determine a data manipulation definition generated by a creator platform to accesses a database via a first connection protocol (e.g., ODBC). The abstraction engine may then automatically analyze the data manipulation definition to discover a connectivity parameter (e.g., a DSN, a DBMS type, a DBMS host name, a port, etc.) associated with the access to the database via the first connection protocol. The data manipulation definition may then be stored along with the connectivity parameter as a meta-connection into a metadata repository. A consuming platform may retrieve the meta-connection from the metadata repository translate the meta-connection into the database manipulation definition to accesses the database via a second connection protocol (e.g., JDBC).

Patent
28 Mar 2019
TL;DR: An electronic program guide, EPG, media system (10) and a computer implemented media provision method for an electronic programme guide (100) was described in this paper, where the system comprises a data repository (20), a metadata repository (30), a media server (40), and a processor.
Abstract: An electronic programme guide, EPG, media system (10) and a computer implemented media provision method for an electronic programme guide are disclosed The system (10) comprises a data repository (20), a metadata repository (30), a media server (40) and a processor The data repository (20) stores an encoded video file corresponding to each of a plurality of media assets and the metadata repository (30) stores metadata on each media asset linked to the encoded video file corresponding to the respective media asset In response to the system (10) receiving a metadata query from an EPG (100) of a remote user device, a media item matching the metadata query is determined from the metadata in the metadata repository (30) and a link to the corresponding encoded video file is generated and communicated to the EPG (100) The link is operable to cause the media server (40) to serve the encoded video file to the remote user device for display as part of the EPG (100)

Patent
31 Oct 2019
TL;DR: In this article, the authors present a method, device and computer program product for flushing metadata in a multi-core system, which comprises: moving a metadata identifier included in a sub-list of a first list to a corresponding sub list of a second list, the sublist of the first list and the corresponding sub-lists of the second list being associated with the same processing unit.
Abstract: Embodiments of the present disclosure provide a method, device and computer program product for flushing metadata in a multi-core system. The method comprises: moving a metadata identifier included in a sub-list of a first list to a corresponding sub-list of a second list, the sub-list of the first list and the corresponding sub-list of the second list being associated with the same processing unit; moving the metadata identifier from the corresponding sub-list of the second list to a third list based on a storage position of the metadata identifier; and determining metadata to be flushed from the third list to a metadata repository. By means of the method and device for flushing metadata as proposed in the present disclosure, metadata synchronization contention can be reduced, IO efficiency can be improved, response time can be decreased, and the cache hit can be increased.