scispace - formally typeset
Search or ask a question

Showing papers on "Metadata repository published in 2023"



Journal ArticleDOI
TL;DR: ProMetaS as discussed by the authors is a process engineering/industry metadata Schema that defines various metadata categories adhering to available metadata standards, and its implementation into the KEEN data platform is described.
Abstract: Data collection in process industry yields a variety of data types. To maintain the knowledge about the data and to enable finding and reusing them (FAIR data), a common description with metadata is necessary. In the project KEEN, ProMetaS – the Process Engineering/Industry Metadata Schema – was developed. It defines various metadata categories adhering to available metadata standards. In this article, ProMetaS, its implementation into the KEEN data platform, and its application in two use cases is described.

2 citations


Journal ArticleDOI
TL;DR: In this paper , the authors present a survey of the current state of the art in big data, cloud native architectures and streaming platforms, which includes DevOps, CI/CD, DataOps and data fabric.
Abstract: ABSTRACT Implementations of metadata tend to favor centralized, static metadata. This depiction is at variance with the past decade of focus on big data, cloud native architectures and streaming platforms. Big data velocity can demand a correspondingly dynamic view of metadata. These trends, which include DevOps, CI/CD, DataOps and data fabric, are surveyed. Several specific cloud native tools are reviewed and weaknesses in their current metadata use are identified. Implementations are suggested which better exploit capabilities of streaming platform paradigms, in which metadata is continuously collected in dynamic contexts. Future cloud native software features are identified which could enable streamed metadata to power real time data fusion or fine tune automated reasoning through real time ontology updates.

1 citations


Journal ArticleDOI
TL;DR: The FAIR framework as mentioned in this paper is a set of concepts for cross-domain conversations on data and metadata, which are defined by the Data Documentation Initiative's Cross Domain Integration (DDI-CDI) project.
Abstract: Welcome to the first issue of IASSIST Quarterly for the year 2023 - IQ vol. 47(1). The last article in this issue has in the title the FAIR acronym that stands for Findable, Accessible, Interoperable, and Reusable. These are the concepts most often focused on by our articles in the IQ and FAIR has an extra emphasis in this issue. The first article introduces and demonstrates a shared vocabulary for data points where the need arose after confusions about data and metadata. Basically, I find that the most valuable virtue of well-structured data – I deliberately use a fuzzy term to save you from long excursions here in the editor's notes – is that other well-structured data can benefit from use of the same software. Similarly, well-structured metadata can benefit from the same software. I also see this as the driver for the second article, on time series data and description. Sometimes, the software mentioned is the same software in both instances as metadata is treated as data or vice versa. This allows for new levels of data-driven machine actions. These days universities are busy investigating and discussing the latest chatbots. I find many of the approaches restrictive and prefer to support the inclusive ones. Likewise, I also expect and look forward to bots having great relevance for the future implementation of FAIR principles. The first article is on data and metadata by George Alter, Flavio Rizzolo, and Kathi Schleidt and has the title ‘View points on data points: A shared vocabulary for cross-domain conversations on data and metadata’. The authors have observed that sharing data across scientific domains is often impeded by differences in the language used to describe data and metadata. To avoid confusion, the authors develop a terminology. Part of the confusion concerns disagreement about the boundaries between data and metadata; and that what is metadata in one domain can be data in another. The shift between data and metadata is what they name as ‘semantic transposition’. I find that such shifts are a virtue and a strength and as the authors say, there is no fixed boundary between data and metadata, and both can be acted upon by people and machines. The article draws on and refers to many other standards and developments, most cited are the data model of Observations and Measurements (ISO 19156) and tools of the Data Documentation Initiative’s Cross Domain Integration (DDI-CDI). The article is thorough and explanatory with many examples and diagrams for learning, including examples of transformations between the formats: wide, long, and multidimensional. The long format of entity-attribute-value has the value domain restricted by the attribute, and in examples time and source are added, which demonstrates how further metadata enter the format. When transposing to the wide format, this is a more familiar data matrix where the same value domain applies to the complete column. The multidimensional format with facets is for most readers the familiar aggregations published by statistical agencies. The authors argue that their domain-independent vocabulary enables the cross-domain conversation. George Alter is Research Professor Emeritus in the Institute for Social Research at the University of Michigan, Flavio Rizzolo is Senior Data Science Architect for Statistics Canada. Kathi Schleidt is a data scientist and the founder of DataCove. The format discussion in the first article is also the point of the second paper on ‘Modernizing data management at the US Bureau of Labor Statistics’. The US Bureau of Labor Statistics (BLS) has a focus on time series and Daniel W. Gillman and Clayton Waring (both from the BLS) view time series data as a combination of three components: A measure element; an element for person, places, and things (PPT); and a time element. In the paper Gillman and Waring also describe the conceptual model (UML) and the design and features of the system. First, they go back in history to the 1970s and the Codd relational model and to the standards developed and refined after 2000. You will not be surprised to find here among the references also the Data Documentation Initiative’s Cross Domain Integration (DDI-CDI). The mission is: ‘to find a simple and intuitive way to store and organize statistical data with the goal of making it easy to find and use the data’. A semantic approach is adopted, i.e. the focus is on the meaning of the data based upon the ‘Measures / People-Places-Things / Time’ model. Detailed examples show how PPT are categories of dimensions, for instance ‘nurse’ is in the Standard Occupational Classification and 'hospital' in the North American Industry Classification System. The paper – like the first paper – also refers to multidimensional structures. The modernization described at BLS is expected to be released in early 2023. The third paper is by João Aguiar Castro, Joana Rodrigues, Paula Mena Matos, Célia Sales, and Cristina Ribeiro where all authors are affiliated with the University of Porto. Like the earlier articles this also references the Data Documentation Initiative (DDI) with a focus on the concepts behind the FAIR acronym: Findable, Accessible, Interoperable, and Reusable. The title is: ‘Getting in touch with metadata: a DDI subset for FAIR metadata production in clinical psychology’. Clinical psychology is not an area frequently occurring in IASSIST Quarterly, but it turns out that the project described started with interviews and data description sessions with research groups in the Social Sciences for identifying a manageable DDI subset. The project also draws on other projects such as TAIL, TOGETHER, and Dendro. The TAIL project concerned the integration metadata tools in the research workflow and assessed the requirements of researchers from different domains. TOGETHER was a project in the psycho-oncology domain and family-centered care for hereditary cancer. As most researchers showed to be inexperienced with metadata, they concentrated on a DDI subset that meant that FAIR metadata would be available for deposit. Support for researchers is essential as the they have the domain expertise and can create highly detailed descriptions. On the other hand, data curators can ensure that the metadata follow the rules of FAIR. This was achieved by embedding the Dendro platform in the research workflow, where creation of metadata is performed in an incremental description of the data. The article includes screenshots of the user interface showing the choice of vocabularies. The approach and the adoption of a DDI subset produced more comprehensive metadata than is usually available. Submissions of papers for the IASSIST Quarterly are always very welcome. We welcome input from IASSIST conferences or other conferences and workshops, from local presentations or papers especially written for the IQ. When you are preparing such a presentation, give a thought to turning your one-time presentation into a lasting contribution. Doing that after the event also gives you the opportunity of improving your work after feedback. We encourage you to login or create an author profile at https://www.iassistquarterly.com (our Open Journal System application). We permit authors to have 'deep links' into the IQ as well as deposition of the paper in your local repository. Chairing a conference session or workshop with the purpose of aggregating and integrating papers for a special issue IQ is also much appreciated as the information reaches many more people than the limited number of session participants and will be readily available on the IASSIST Quarterly website at https://www.iassistquarterly.com. Authors are very welcome to take a look at the instructions and layout: https://www.iassistquarterly.com/index.php/iassist/about/submissions Authors can also contact me directly via e-mail: kbr@sam.sdu.dk. Should you be interested in compiling a special issue for the IQ as guest editor(s) I will also be delighted to hear from you. Karsten Boye Rasmussen - March 2023

Book ChapterDOI
01 Feb 2023
TL;DR: The cultural heritage community has become introspective, applying critical theory to metadata practices to uncover and attempt to address bias inherent in the descriptive process as discussed by the authors , to the extent that cultural heritage organizations experiment with reintroducing highly structured metadata in the form of linked data, examine the utility of specialized metadata formats for different communities, and apply both established and novel textual analysis tools to discovery systems.
Abstract: Descriptive metadata allows users to probe digital repositories and find relevant information. While today’s metadata formats have strong roots in historic classification, categorization, and description practices of libraries in the print world, the world wide web has allowed digital repositories to flourish using the metadata-based discovery of digital content. Old and new approaches mix as cultural heritage organizations experiment with reintroducing highly structured metadata in the form of linked data, examine the utility of specialized metadata formats for different communities, and apply both established and novel textual analysis tools to discovery systems. Metadata practices have matured to the extent that the cultural heritage community has become introspective, applying critical theory to metadata practices to uncover and attempt to address bias inherent in the descriptive process.

Posted ContentDOI
15 May 2023
TL;DR: The FAIR WISH (Findable, Accessible, Interoperable, Reusable) sample description template as mentioned in this paper can be used to register IGSNs for sample registration.
Abstract: The International Generic Sample Number (IGSN) is a unique and persistent identifier for physical objects that was originally developed in the Geosciences. In 2022, after 10 years of service operation and more than 10 million registered samples worldwide, IGSN e.V. and DataCite have agreed on a strategic partnership. As a result, all IGSNs are now registered as DataCite DOIs and the IGSN metadata schema will be mapped to the DataCite Metadata Schema according to agreed guidelines. This will, on the one hand, enrich the very limited mandatory information shared by IGSN allocating agents so far. On the other hand, the DataCite metadata schema is not designed for the comprehensive description of physical objects and their provenance.The IGSN Metadata Schema is modular: the mandatory Registration Schema only included information on the IGSN identifier, the minting agent and a date - complemented by the IGSN Description Schema (for data discovery) and additional extensions by the allocating agents to customise the sample description according to their sample’s subdomain.Within the project “FAIR Workflows to establish IGSN for Samples in the Helmholtz Association (FAIR WISH)”, funded by the Helmholtz Metadata Collaboration Platform (HMC), we(1) customised the GFZ-specific schema to describe water, soil and vegetation samples and(2) support the metadata collection by the individual researcher with a user-friendly, easy-to-use batch registration template in MS Excel.The information collected with the template can directly be converted to XML files (or JSON in the future) following the IGSN Metadata schema that is required to generate IGSN landing pages. The template is also the source for the generation of DataCite metadata.The integration of linked data vocabularies (RDF, SKOS) in the metadata is an essential step in harmonising information across different research groups and institutions and important for the implementation of the FAIR Principles (Findable, Accessible, Interoperable, Reusable) for sample descriptions. More information on these controlled vocabularies can be found in the FAIR WISH D1 List of identified linked open data vocabularies to be included in IGSN metadata (https://doi.org/10.5281/zenodo.6787200).The template to register IGSNs for samples should ideally fit to various sample types. In a first step, we created templates for samples from surface water and vegetation from AWI polar expeditions on land (AWI Use Case) and incorporated the two other FAIR WISH use cases with core material from the Ketzin coring site (Ketzin Use Case) and for a wide range of marine biogeochemical samples (Hereon Use Case). The template comprises few mandatory and many optional variables to describe a sample, the sampling activity, location and so on. Users can easily create their Excel-template, including only the variables needed to describe a sample. A tutorial on how to use the FAIR WISH: Sample description template (https://doi.org/10.5281/zenodo.7520016) can be found in the FAIR WISH D3 Video Tutorial for the FAIR SAMPLES Template (https://doi.org/10.5281/zenodo.7381390).  As our registration template is still a work in progress, we are furthermore happy for user feedback (https://doi.org/10.5281/zenodo.7377904).Here we will present the template and discuss its applicability for sample registration.

Journal ArticleDOI
TL;DR: In this paper , the authors developed a self-service system that automatically extracts metadata from a data lake and enables business analysts to explore the metadata through an easy-to-use interface.
Abstract: Abstract Data catalogs represent a promising solution for semantically classifying and organizing data sources and enriching raw data with metadata. However, recent research has shown that data catalogs are difficult to implement due to the complexity of the data landscape or issues with data governance. Moreover, data catalogs struggle to enable business analysts to find the data they need for their use cases. Against this backdrop, we develop a self-service system that automatically extracts metadata from a data lake and enables business analysts to explore the metadata through an easy-to-use interface. Specifically, instead of implementing the data catalog top-down, our system derives metadata from user queries bottom-up. Hereby, we conduct 15 interviews with business analysts to derive the underlying requirements of the system and evaluate its features with a focus group. Our findings illustrate that participants especially value the possibility to reuse queries from other users and appreciated the support in query validation as data preparation is a complex and time-consuming endeavour.

Posted ContentDOI
15 May 2023
TL;DR: The AuScope 3D Geomodels portal as mentioned in this paper is a website designed to display a variety of geological models and associated datasets and information from all over the Australian continent, such as Australian government geological surveys and research organisations.
Abstract: The AuScope 3D Geomodels Portal is a website designed to display a variety of geological models and associated datasets and information from all over the Australian continent. The models are imported from publicly available sources, namely Australian government geological surveys and research organisations. Often the models come in the form of downloadable file packages designed to be viewed in specialised geological software applications. They usually contain enough information to view the model’s structural geometry, datasets and a minimal amount of geological textual information. Seldom do they contain substantial metadata, often they were created before the term ‘FAIR’ was coined or the importance of metadata had dawned upon many of us. This creates challenges for data providers and aggregators trying to maintain a certain standard of FAIR compliance across all their offerings. How to improve the standard of FAIR compliance of metadata extracted from these models? How to integrate these models into existing metadata infrastructure? For the Geomodels portal, these concerns are alleviated within the automated model transformation software. This software transforms the source file packages into a format suitable for display in a modern WebGL compliant browser. Owing to the nature of the model source files only a very modest amount of metadata can be extracted. Hence other sources of metadata must be introduced. For example, often the dataset provider will publish a downloadable PDF report file or a description on a web page associated with the model. Automated textual analysis is used to extract more information from these sources. At the end of the transformation process, an ISO-compliant metadata record is created for importing into a geonetwork catalogue. The geonetwork catalogue record can be used for integration with other applications. For example, AuScope’s flagship portal, the AuScope Portal displays information, download links and a geospatial footprint of models on a map. The metadata can also be displayed in the Geomodels Portal.

Book ChapterDOI
01 Jan 2023
TL;DR: In this article , classification schemes are used to construct a hierarchy of elements of the classification scheme, which is necessary to identify groups of members for which there is a semantic connection with groups of other dimensions.
Abstract: The task of determining the metadata of a multidimensional information system corresponds to the description of the parameters of cells that contain information about the facts that are included in the multidimensional data cube. Classification schemes can be used when constructing metadata. The classification scheme corresponds to certain structural component of the observed phenomenon. The cell parameters are presented in the classification scheme in a hierarchical form and are combined in metadata when connecting several classification schemes. To construct a hierarchy of elements of the classification scheme, it is necessary to identify groups of members for which there is a semantic connection with groups of members of other dimensions. Cartesian product can be applied to groups of members. As a result, clusters of members’ combinations will be formed in the metadata. The complete metadata structure can be achieved by combining all clusters. In case of a large amount of aspects of analysis, a multidimensional data cube has specific properties related to sparsity. The use of classification schemes makes it possible to identify parts in the metadata that correspond to individual structural components of the observed phenomenon. If a multidimensional data cube is constructed in the process of automated data collection, the “Data vault” methodology can be used to describe the metadata. This method allows you to reflect the relationships between business objects in the metadata.

Journal ArticleDOI
TL;DR: In this article , a standardization framework for managing data in Data Lakes that combines mainly the 5Vs Big Data characteristics and blueprint ontologies is presented. But the proposed approach suffers from the lack of a disciplined approach to collect, store and retrieve data to support predictive and prescriptive analytics.
Abstract: Smart processing of Big Data has been recently emerged as a field that provides quite a few challenges related to how multiple heterogeneous data sources that produce massive amounts of structured, semi-structured and unstructured data may be handled. One solution to this problem is manage this fusion of disparate data sources through Data Lakes. The latter, though, suffers from the lack of a disciplined approach to collect, store and retrieve data to support predictive and prescriptive analytics. This chapter tackles this challenge by introducing a novel standardization framework for managing data in Data Lakes that combines mainly the 5Vs Big Data characteristics and blueprint ontologies. It organizes a Data Lake using a ponds architecture and describes a metadata semantic enrichment mechanism that enables fast storing to and efficient retrieval. The mechanism supports Visual Querying and offers increased security via Blockchain and Non-Fungible Tokens. The proposed approach is compared against other known metadata systems utilizing a set of functional properties with very encouraging results.

Book ChapterDOI
TL;DR: Macaroni as discussed by the authors is a metadata search engine with toolkits that support practitioners to obtain and enrich model metadata, which is useful for reporting, auditing, reproducibility, and interpretability.
Abstract: Machine learning (ML) researchers and practitioners are building repositories of pre-trained models, called model zoos. These model zoos contain metadata that detail various properties of the ML models and datasets, which are useful for reporting, auditing, reproducibility, and interpretability. Unfortunately, the existing metadata representations come with limited expressivity and lack of standardization. Meanwhile, an interoperable method to store and query model zoo metadata is missing. These two gaps hinder model search, reuse, comparison, and composition. In this demo paper, we advocate for standardized ML model metadata representation, proposing Macaroni, a metadata search engine with toolkits that support practitioners to obtain and enrich that metadata.

Journal ArticleDOI
TL;DR: Metadata, as a type of data, describes content, provides context, documents transactions, and situates data as discussed by the authors , has been a hot topic in the last several decades, motivated by the increase in digital information, open access, early data sharing policies, and interoperability goals.
Abstract: Metadata, as a type of data, describes content, provides context, documents transactions, and situates data. Interest in metadata has steadily grown over the last several decades, motivated initially by the increase in digital information, open access, early data sharing policies, and interoperability goals. This foundation has accelerated in more recent times, due to the increase in research data management policies and advances in AI. Specific to research data management, one of the larger factors has been the global adoption of the FAIR (findable, accessible, interoperable, and reusable) data principles [1, 2], which are highly metadatadriven. Additionally, researchers across nearly every domain are interested in leveraging metadata for machine learning and other AI applications. The accelerated interest in metadata expands across other communities as well. For example, industry seeks metadata to meet company goals; and users of information systems and social computing applications wish to know how their metadata is being used and demand greater control of who has access to their data and metadata. All of these developments underscore the fact that metadata is intelligent data, or what Riley has called value added data [3]. Overall, this intense and growing interest in metadata helps to frame the contributions included in this special issue of Data Intelligence.


Journal ArticleDOI
TL;DR: In this paper , the authors argue that disagreements over the boundary between data and metadata are a common source of confusion and propose a new terminology for describing how data are structured, and show how it can be applied to a variety of widely used data formats.
Abstract: Sharing data across scientific domains is often impeded by differences in the language used to describe data and metadata. We argue that disagreements over the boundary between data and metadata are a common source of confusion. Information appearing as data in one domain may be considered metadata in another domain, a process that we call “semantic transposition.” To promote greater understanding, we develop new terminology for describing how data and metadata are structured, and we show how it can be applied to a variety of widely used data formats. Our approach builds upon previous work, such as the Observations and Measurements (ISO 19156) data model. We rely on tools from the Data Documentation Initiative’s Cross Domain Integration (DDI-CDI) to illustrate how the same data can be represented in different ways, and how information considered data in one format can become metadata in another format.

Journal ArticleDOI
TL;DR: In this paper , a book metadata extraction system using image processing technology and OCR has been proposed, which achieved an accuracy of 98.78% with an average detection time of 1.49 seconds and has succeeded in presenting the extraction results on the website page.
Abstract: Extracting book metadata by retyping the identity of the book, such as the author's name, book title, publisher, and several other identities, is a routine that is carried out repeatedly at the Polewali Mandar district library, this activity takes much of time, using several staff and it turns out that this activity has much potential for input errors. Errors in extracting book metadata will result in errors in the book repository system database, resulting in difficulty finding and using books or book data information. This problem can be solved by creating a book metadata extraction system using image processing technology and OCR. This study aims to design a scanner technology to extract book metadata. Accuracy is carried out in 2 stages, the first validation of image extraction results using the ROC method and the second validation by directly matching the result of extracting the book's metadata with the actual book. The results of this study indicate that the system has worked with an accuracy of 98.78% with an average detection time of 1.49 seconds and has succeeded in presenting the extraction results on the website page. Thus the metadata extraction system with the OCR method can be applied to libraries to input book data.

Proceedings ArticleDOI
03 Feb 2023
TL;DR: In this article , the authors summarized several important links in metadata management technology in data warehouses and large-scale distributed file systems, and compared various metadata management architecture technologies, and summarized the metadata management characteristics and metadata management strategies.
Abstract: Metadata management plays an important role in enterprise information management. The complete metadata management system has directly affected the flexibility and high scalability of the platform. This paper summarizes several important links in metadata management technology in data warehouses and large-scale distributed file systems. This paper organizes the metadata management standards, compares various metadata management architecture technologies,and summarizes the metadata management characteristics and metadata management strategies. At the same time, this paper introduces the applicability of the current mainstream open source metadata management tools in detail, focuses on the research of cognitive metadata directory based on machine learning, and prospects the future research direction.

OtherDOI
28 Apr 2023
TL;DR: In this paper , the authors present an example of a Business Data Portfolio, which is meant to represent the reporting and analytic capabilities required or desired by business stakeholders, such as prioritizing data sources, having a clear understanding of current priorities, communicating what data is available for use, and promoting the re-use instead of re-creation of data assets.
Abstract: When organizations begin talking about Data Governance, it can quickly become a catch-all for any data-related issues. This chapter presents an example of a Business Data Portfolio. It is meant to represent the reporting and analytic capabilities required or desired by business stakeholders. The Portfolio has many benefits beyond Data Governance such as prioritizing data sources, having a clear understanding of current priorities, communicating what data is available for use, and promoting the re-use instead of re-creation of data assets. Metadata is information used to navigate through and understand the data landscape in an organization. Business users will find value in technical and operational metadata just as technical users find value in business metadata. Data quality can be broken down into four key categories: business definition, data element, data record, and data movement. Data profiling is important for every organization.

Journal ArticleDOI
TL;DR: In this paper , the authors examine the development of Dataverse, a global research data management consortium, focusing on data discoverability and current metadata implementation on the Dataverse portals established by 27 university libraries worldwide.

Posted ContentDOI
21 Feb 2023-bioRxiv
TL;DR: LISTER as mentioned in this paper is a methodological and algorithmic solution to disentangle the creation of metadata from ontology alignment and extract metadata from annotated template-based experiment documentation using minimum effort.
Abstract: The availability of scientific methods, code, and data is key for reproducing an experiment. Research data should be made available following the FAIR principle (findable, accessible, interoperable, and reusable). For that, the annotation of research data with metadata is central. However, existing research data management workflows often require that metadata should be created by the corresponding researchers, which takes effort and time. Here, we developed LISTER as a methodological and algorithmic solution to disentangle the creation of metadata from ontology alignment and extract metadata from annotated template-based experiment documentation using minimum effort. We focused on tailoring the integration between existing platforms by using eLabFTW as the electronic lab notebook and adopting the ISA (investigation, study, assay) model as the abstract data model framework; DSpace is used as a data cataloging platform. LISTER consists of three components: customized eLabFTW entries using specific hierarchies, templates, and tags; a ‘container’ concept in eLabFTW, making metadata of a particular container content extractable along with its underlying, related containers; a Python-based app to enable easy-to-use, semi-automated metadata extraction from eLabFTW entries. LISTER outputs metadata as machine-readable .json and human-readable .csv formats, and MM descriptions in .docx format that could be used in a thesis or manuscript. The metadata can be used as a basis to create or extend ontologies, which, when applied to the published research data, will significantly enhance its value due to a more complete and holistic understanding of the data, but might also enable scientists to identify new connections and insights in their field. We applied LISTER to the fields of computational biophysical chemistry as well as protein biochemistry and molecular biology, and our concept should be extendable to other life science areas.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the authors learn to perform several tasks related to metadata in video and audio data, such as title, artist, album, subject, genre, year, copyright, producer, software creator, comments, lyrics, and even album art images.
Abstract: In this chapter, you will learn to perform several tasks related to metadata. Metadata means to data about data. Multimedia metadata refers to information such as title, artist, album, subject, genre, year, copyright, producer, software creator, comments, lyrics, and even album art images that are used to describe the video and/or audio content.


Journal ArticleDOI
TL;DR: AdaM as discussed by the authors is an adaptive fine-grained metadata management scheme that trains an actor-critic network to migrate hot metadata nodes to different MDSs based on its observations of the current states.
Abstract: A major challenge confronting today’s distributed metadata management schemes is how to meet the dynamic requirements of various applications through effectively mapping and migrating metadata nodes to different metadata servers (MDS’s). Most of the existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make a timely and efficient update on the distribution of nodes. In this paper, we present the first deep reinforcement learning-leveraged distributed metadata management scheme, AdaM, to address the aforementioned dilemma. AdaM is an adaptive fine-grained metadata management scheme that trains an actor-critic network to migrate “hot” metadata nodes to different MDS’s based on its observations of the current “states” (i.e., access pattern, the structure of namespace tree and current distribution of nodes on MDS’s). Adaptive to varying access patterns, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. Besides, we propose a self-adaptive metadata cache policy, which dynamically combines the two strategies of managing caches on the server side and the client side to gain better query performance. Last but not least, we design a distributed metadata processing 2PC Protocol called MST-based 2PC to ensure data consistency. Experiments on a real-world dataset demonstrate the superiority of our proposed method over other schemes.

Journal ArticleDOI
TL;DR: The Experimental Data Connector (XDC) as discussed by the authors uses a single template Excel Workbook, which can be integrated into existing experimental workflow automation processes and semiautomated capture of results.
Abstract: Accelerating the development of synthetic biology applications requires reproducible experimental findings. Different standards and repositories exist to exchange experimental data and metadata. However, the associated software tools often do not support a uniform data capture, encoding, and exchange of information. A connection between digital repositories is required to prevent siloing and loss of information. To this end, we developed the Experimental Data Connector (XDC). It captures experimental data and related metadata by encoding it in standard formats and storing the converted data in digital repositories. Experimental data is then uploaded to Flapjack and the metadata to SynBioHub in a consistent manner linking these repositories. This produces complete connected experimental data sets that are exchangeable. The information is captured using a single template Excel Workbook, which can be integrated into existing experimental workflow automation processes and semiautomated capture of results.

Journal ArticleDOI
TL;DR: For instance, MongoDB as mentioned in this paper reported that the performance of the MongoDB system was negatively affected by the lack of a reliable data collection system in South Korea. But they did not specify the reasons why.
Abstract: 동적주제도 시스템은 고정/이동 플랫폼을 통해 여러 유형의 센서로부터 다양하고 방대한 정보를 수집하고 저장한다. 주요 데이터는 수집원·객체추출 데이터, 주제별동적정보 · 이동맥락정보 · 동적주제도 데이터셋으로 구성된다. 영상정보, 센싱정보, 분석정보 등 다양한 데이터를 포함하고 있으며, 다양한 참조 관계가 필요하기 때문에 각 데이터의 접근성을 보장하면서 정보의 통합이 가능한 관리 체계가 필요하다. 이를 위해 데이터의 흐름, 소유자, 주요 공정을 식별하고 데이터의 세부 내용을 파악했다. 정형 · 비정형의 데이터가 포함되어 있고, MongoDB 및 Postgres 등을 사용하고 있었다. 이런 정보를 관리하기 위한 방법으로 데이터 카탈로그를 분석했다. 데이터 카탈로그는 데이터레이크에 주로 적용되어 정형 · 비정형 데이터를 모두 수용할 수 있고, 비즈니스 메타데이터까지 포함하기 때문에 동적주제도 시스템의 DB를 통합 관리하기에 매우 적합하다. 데이터 카탈로그는 비즈니스 · 테크니컬 · 운영 메타데이터로 구성된다. 비즈니스 영역은 비즈니스 관점에서 모든 사용자가 동일한 명칭과 개념을 사용하도록 지원한다. 테크니컬 영역은 기존의 메타데이터와 유사하게 테이블, 컬럼 등의 정보를 제공한다. 운영 영역은 보고서, 대시보드 등 데이터 활용정보를 제공하며 데이터 리니지를 포함하기도 한다. 본 연구에서는 동적주제도 DB 관리 시스템의 기능을 데이터 카탈로그 기반으로 구성했다. 사용자는 카탈로그를 검색하고 실데이터를 조회할 수 있다. 큐레이션 기능과 DB 표준화 관리 기능도 통합해 제공한다. 관리자는 사용자의 이용 현황을 모니터링하고 시스템을 관리한다. 이와 같이 데이터 카탈로그를 이용하면 다양한 종류의 데이터와 여러 계층의 사용자를 위한 정보를 효과적으로 관리할 수 있다. 동적주제도 시스템이 데이터 카탈로그 기술을 채택하면 효율적인 정보 관리와 사용자 편의성을 동시에 제공할 수 있을 것으로 기대된다. A dynamic thematic map system collects and stores a variety of extensive information from various types of sensors through fixed/mobile platforms. The main data consist of collection sources and object extractions, dynamic information by subject, heterogeneous context information, and a dynamic subject map dataset. Since it contains various data, such as image information, sensing information, and analysis information, and requires various reference relationships, a management system that can integrate information while ensuring accessibility to the data is required. To do this, we identified data flows, owners, and key processes, and identified details of the data. Structured and unstructured data were included, and MongoDB and Postgres were used. We reviewed data catalogs as a way to manage this information. Since data catalogs are mainly applied to data lakes, they can accommodate both structured and unstructured data, and can even include business metadata, so they are very suitable for integrated management of a DB for the Dynamic Subject Map System. The data catalog consists of business, technical, and operational metadata. The business domain supports the use of the same nomenclature and concepts from a business perspective. Similar to existing metadata, the technical area provides tables and columns. The operational area provides data utilization through reports and dashboards, and includes data lineage. In this study, the functions for the Dynamic Subject Map System"s DB management are configured based on the data catalog. Users can search catalogs and view real data. A curation function and a DB standardization management function are integrated and provided. The administrator monitors the status of users and manages the system. In this way, by using a data catalog, various types of data and information for users of various classes can be effectively managed. If the Dynamic Subject Map System adopts data catalog technology, it is expected to provide efficient information management and user convenience.

Journal ArticleDOI
TL;DR: In this article , the authors consider potentially problematic metadata and how it affects the accessibility of digital visual archives and explore the practical applications of AI-reliant tools to analyse a large corpus of photographs and create or enrich metadata.
Abstract: Discussing the current AHRC/LABEX-funded EyCon (Early Conflict Photography 1890-1918 and Visual AI) project, this article considers potentially problematic metadata and how it affects the accessibility of digital visual archives. The authors deliberate how metadata creation and enrichment could be improved through Artificial Intelligence (AI) tools and explore the practical applications of AI-reliant tools to analyse a large corpus of photographs and create or enrich metadata. The amount of visual data created by digitisation efforts is not always followed by the creation of contextual metadata, which is a major problem for archival institutions and their users, as metadata directly affects the accessibility of digitised records. Moreover, the scale of digitisation efforts means it is often beyond the scope of archivists and other record managers to individually assess problematic or sensitive images and their metadata. Additionally, existing metadata for photographic and visual records are presenting issues in terms of out-dated descriptions or inconsistent contextual information. As more attention is given to the creation of accessible digital content within archival institutions, we argue that too little is being given to the enrichment of record data. In this article, the authors ask how new tools can address incomplete or inaccurate metadata and improve the transparency and accessibility of digital visual records.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: In this article , the authors identify a set of metadata fields/model that can be used to annotate declarative semantic and data mappings to facilitate their discovery, reusability, and quality assessment.
Abstract: A huge amount of data is produced each day through social networks like Facebook, Twitter, etc. Through semantic web technologies, data produced from these social networks can be shared, integrated, and used to gain knowledge. As social networks grow more heterogeneous, declarative data mapping can help address this issue. The use of declarative data mappings is a valuable method of describing the relationship between two different datasets. Theses mappings are created within the community, but without standard metadata such that quality mappings cannot be shared, discovered, and reused properly. The ultimate purpose of our research is to identify a set of metadata fields/model that can be used to annotate declarative semantic and data mappings to facilitate their discovery, reusability, and quality assessment. As an early step in our research, 18 participants completed an online questionnaire between June and September 2022. The questionnaire targeted participants with varying levels of experience with semantic and data mappings. The survey results indicate that the model we have proposed for mapping metadata has received a positive response from the expert and intermediate-level participants. Also, expert participants have suggested adding extra metadata fields which are being considered as the second version of the model that is being developed.

Journal ArticleDOI
TL;DR: In this article , the authors describe the implementation of discipline specific metadata into a data repository to provide more contextual information about data and propose a workflow with standardised data templates for automated metadata extraction during the ingest process.
Abstract: Abstract Complex research problems are increasingly addressed by interdisciplinary, collaborate research projects generating large amounts of heterogeneous amounts of data. The overarching processing, analysis and availability of data are critical success factors for these research efforts. Data repositories enable long term availability of such data for the scientific community. The findability and therefore reusability strongly builds on comprehensive annotations of datasets stored in repositories. Often generic metadata schema are used to annotate data. In this publication we describe the implementation of discipline specific metadata into a data repository to provide more contextual information about data. To avoid extra workload for researchers to provide such metadata a workflow with standardised data templates for automated metadata extraction during the ingest process has been developed. The enriched metadata are in the following used in the development of two repository plugins for data comparison and data visualisation. The added values of discipline-specific annotations and derived search features to support matching and reusable data is then demonstrated by use cases of two Collaborative Research Centres (CRC 1368 and CRC 1153).

Journal ArticleDOI
TL;DR: In this article , a general concept for extracting metadata and utilizing it in data analytics applications is proposed, which may help with system design in the future and is prototypically implemented regarding the structural metadata for tabular data.
Abstract: Abstract Providing data for data analysis projects is one core task of automation technology, however, it still has to be done with a lot of manual effort. One challenge is to keep the meaning of data remain interpretable within or across multiple software environments so that provider and user of data share a common understanding of the transferred data. It is acknowledged that machine interpretable metadata is one crucial building block for reaching this goal. However, in industrial automation and information systems today, exporting and utilizing data coupled with metadata is still not a common practice. Therefore, we propose a general concept for extracting metadata and utilizing it in data analytics applications, which may help with system design in the future. The concept is prototypically implemented regarding the structural metadata for tabular data.

Posted ContentDOI
15 May 2023
TL;DR: The HCDC datasearch portal as mentioned in this paper is an open-source software solution that combines data from a legacy database, file storage systems, OGC conform web services and a World Data Center.
Abstract: In Earth System Sciences, new data portals are currently being developed by what seems to be each new project and research initiative. But what happens to already existing solutions that are in a dire need of a software update? We will introduce the HCDC datasearch portal (https://hcdc.hereon.de/datasearch/), an open-source software solution, that combines data from a legacy database, file storage systems, OGC conform web services and a World Data Center. Our portal provides a common interface for all our heterogeneous data-sources to select and to download the data-products based on filters for metadata and spatio-temporal information.Three legacy portal solutions at Helmholtz-Zentrum Hereon are replaced by a scalable and easily extendable new portal based on an Elasticsearch cluster in the back-end and a user-friendly web interface as well as a machine readable API in the front-end. To ensure software that fits the user’s workflows, a stakeholder group was involved from the early stages of the planning up until the release of the final product.Extensibility of the portal is ensured by only storing metadata within the portal. Data access and download is configured based on each decentralized storage solution, e.g. a local database or a World Data Center. Harmonization of metadata is crucial for the user experience of the portal. We limited the searchable metadata to 14 fields in addition to geospatial and temporal metadata, including information such as the platform from which the data originates and the parameter that was measured. Whenever possible, controlled vocabularies were used. Due to the heterogeneity of the data, including climate model results as well as long-tail biogeochemical campaign data, this is an ongoing process.The HCDC datasearch portal provides an example of the challenges and opportunities of combining data from distributed data sources through a single entry-point based on state-of-the-art web technologies. It can be used to discuss the challenges of re-using legacy solutions in a continually progressing research data infrastructure world.

Journal ArticleDOI
Ana Trisovic1
TL;DR: This paper conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together and found that the majority of the clusters are formed by single type datasets, while in the rest of the sample, no meaningful clusters can be identified.
Abstract: Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.