Showing papers on "Metadata repository published in 2017"

PDF

Open Access

Journal Article•DOI•

The automatic weather stations NOANN network of the National Observatory of Athens: operation and database

[...]

Konstantinos Lagouvardos, Vassiliki Kotroni, Antonis Bezes, I. Koletsis, Theodora Kopania, Spyridon Lykoudis, N. Mazarakis, Katerina Papagiannaki, S. Vougioukas - Show less +5 more

01 Jun 2017

TL;DR: In this article, the authors provide information about the logistics of this network, including real-time applications of the collected data as well as information on the quality control protocols, the construction of the station data and metadata repository and the means through which the data are made available to users.

...read moreread less

Abstract: During the last 10 years, the Institute for Environmental Research and Sustainable Development of the National Observatory of Athens has developed and operates a network of automated weather stations across Greece. The motivation behind the network development is the monitoring of weather conditions in Greece with the aim to support not only the research needs (weather monitoring and analysis, weather forecast skill evaluation) but also the needs of various communities of the production sector (agriculture, constructions, leisure and tourism, etc.). By the end of 2016, 335 weather stations are in operation, providing real-time data at 10-min intervals. This paper provides information about the logistics of this network, including real-time applications of the collected data as well as information on the quality control protocols, the construction of the station data and metadata repository and the means through which the data are made available to users.

...read moreread less

135 citations

Journal Article•DOI•

Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats

[...]

Luca M. Ghiringhelli¹, Christian Carbogno¹, Sergey V. Levchenko¹, Fawzi Roberto Mohamed¹, Georg Huhs², Georg Huhs³, M. Lüders⁴, Micael J. T. Oliveira⁵, Micael J. T. Oliveira⁶, Matthias Scheffler⁷, Matthias Scheffler¹ - Show less +7 more•Institutions (7)

Fritz Haber Institute of the Max Planck Society¹, Humboldt University of Berlin², Barcelona Supercomputing Center³, Daresbury Laboratory⁴, University of Liège⁵, Max Planck Society⁶, University of California, Santa Barbara⁷

06 Nov 2017

TL;DR: A key element of this work is the definition of hierarchical metadata describing state-of-the-art electronic-structure calculations, which was agreed upon by two teams and is presented in this perspective paper.

...read moreread less

Abstract: With big-data driven materials research, the new paradigm of materials science, sharing and wide accessibility of data are becoming crucial aspects. Obviously, a prerequisite for data exchange and big-data analytics is standardization, which means using consistent and unique conventions for, e.g., units, zero base lines, and file formats. There are two main strategies to achieve this goal. One accepts the heterogeneous nature of the community, which comprises scientists from physics, chemistry, bio-physics, and materials science, by complying with the diverse ecosystem of computer codes and thus develops “converters” for the input and output files of all important codes. These converters then translate the data of each code into a standardized, code-independent format. The other strategy is to provide standardized open libraries that code developers can adopt for shaping their inputs, outputs, and restart files, directly into the same code-independent format. In this perspective paper, we present both strategies and argue that they can and should be regarded as complementary, if not even synergetic. The represented appropriate format and conventions were agreed upon by two teams, the Electronic Structure Library (ESL) of the European Center for Atomic and Molecular Computations (CECAM) and the NOvel MAterials Discovery (NOMAD) Laboratory, a European Centre of Excellence (CoE). A key element of this work is the definition of hierarchical metadata describing state-of-the-art electronic-structure calculations.

...read moreread less

91 citations

Proceedings Article•DOI•

SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing

[...]

Houjun Tang¹, Suren Byna¹, Bin Dong¹, Jialin Liu², Quincey Koziol² - Show less +1 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, United States Department of Energy²

01 Sep 2017

TL;DR: SoMeta is presented, a scalable and decentralized metadata management approach for object-centric storage in HPC systems that provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy.

...read moreread less

Abstract: Scientific data sets, which grow rapidly in volume, are often attached with plentiful metadata, such as their associated experiment or simulation information. Thus, it becomes difficult for them to be utilized and their value is lost over time. Ideally, metadata should be managed along with its corresponding data by a single storage system, and can be accessed and updated directly. However, existing storage systems in high-performance computing (HPC) environments, such as Lustre parallel file system, still use a static metadata structure composed of non-extensible and fixed amount of information. The burden of metadata management falls upon the end-users and require ad-hoc metadata management software to be developed.With the advent of "object-centric" storage systems, there is an opportunity to solve this issue. In this paper, we present SoMeta, a scalable and decentralized metadata management approach for object-centric storage in HPC systems. It provides a flat namespace that is dynamically partitioned, a tagging approach to manage metadata that can be efficiently searched and updated, and a light-weight and fault tolerant management strategy. In our experiments, SoMeta achieves up to 3.7X speedup over Lustre in performing common metadata operations, and up to 16X faster than SciDB and MongoDB for advanced metadata operations, such as adding and searching tags. Additionally, in contrast to existing storage systems, SoMeta offers scalable user-space metadata management by allowing users with the capability to specify the number of metadata servers depending on their workload.

...read moreread less

35 citations

Book Chapter•DOI•

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments.

[...]

Rafael S. Gonçalves¹, Martin J. O'Connor¹, Marcos Martínez-Romero¹, Attila L. Egyedi¹, Debra Willrett¹, John Graybeal¹, Mark A. Musen¹ - Show less +3 more•Institutions (1)

Stanford University¹

21 Oct 2017

TL;DR: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored as discussed by the authors, and the CEDAR Workbench is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, fill in templates to generate high-quality metadata, and to share and manage these resources.

...read moreread less

Abstract: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed—the CEDAR Workbench—is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.

...read moreread less

30 citations

WASABI: a Two Million Song Database Project with Audio and Cultural Metadata plus WebAudio enhanced Client Applications

[...]

Gabriel Meseguer-Brocal, Geoffroy Peeters, Guillaume Pellerin, Michel Buffa, Elena Cabrio, Catherine Faron Zucker, Alain Giboin, Isabelle Mirbel, Romain Hennequin, Manuel Moussallam, Francesco Piccoli, Thomas Fillon - Show less +8 more

21 Aug 2017

TL;DR: This paper presents the WASABI project, started in 2017, which aims at the construction of a 2 million song knowledge base that combines metadata collected from music databases on the Web, metadata resulting from the analysis of song lyrics, and metadata result from the audio analysis.

...read moreread less

Abstract: This paper presents the WASABI project, started in 2017, which aims at (1) the construction of a 2 million song knowledge base that combines metadata collected from music databases on the Web, metadata resulting from the analysis of song lyrics, and metadata resulting from the audio analysis, and (2) the development of semantic applications with high added value to exploit this semantic database. A preliminary version of the WASABI database is already on-line 1 and will be enriched all along the project. The main originality of this project is the collaboration between the algorithms that will extract semantic metadata from the web and from song lyrics with the algorithms that will work on the audio. The following WebAudio enhanced applications will be associated with each song in the database: an online mixing table, guitar amp simulations with a virtual pedal-board, audio analysis visualization tools, annotation tools, a similarity search tool that works by uploading audio extracts or playing some melody using a MIDI device are planned as companions for the WASABI database.

...read moreread less

24 citations

Proceedings Article•DOI•

Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata

[...]

Mahmoud Ismail¹, Ermias Gebremeskel, Theofilos Kakantousis², Gautier Berthou, Jim Dowling¹ - Show less +1 more•Institutions (2)

Royal Institute of Technology¹, Athens University of Economics and Business²

01 Jun 2017

TL;DR: A new project-based multi-tenancy model for Hadoop is presented that provides a distributed database backend for the HDFS metadata layer and is extended to introduce projects, datasets, and project-users as new core concepts that enable a user-friendly, UI-driven Hadoops experience.

...read moreread less

Abstract: Hadoop is a popular system for storing, managing,and processing large volumes of data, but it has bare-bonesinternal support for metadata, as metadata is a bottleneck andless means more scalability. The result is a scalable platform withrudimentary access control that is neither user- nor developer-friendly. Also, metadata services that are built on Hadoop, suchas SQL-on-Hadoop, access control, data provenance, and datagovernance are necessarily implemented as eventually consistentservices, resulting in increased development effort and morebrittle software. In this paper, we present a new project-based multi-tenancymodel for Hadoop, built on a new distribution of Hadoopthat provides a distributed database backend for the HadoopDistributed Filesystem's (HDFS) metadata layer. We extendHadoop's metadata model to introduce projects, datasets, andproject-users as new core concepts that enable a user-friendly, UI-driven Hadoop experience. As our metadata service is backed bya transactional database, developers can easily extend metadataby adding new tables and ensure the strong consistency ofextended metadata using both transactions and foreign keys.

...read moreread less

24 citations

Journal Article•DOI•

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.

[...]

Wei Hu¹, Amrapali Zaveri², Honglei Qiu¹, Michel Dumontier²•Institutions (2)

Nanjing University¹, Maastricht University²

18 Sep 2017-BMC Bioinformatics

TL;DR: The intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found.

...read moreread less

Abstract: The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

...read moreread less

22 citations

Book•

Curating Research Data, Volume Two: A Handbook of Current Practice

[...]

Lisa R Johnston

17 Jan 2017

TL;DR: This paper focuses on the development of a Data Curation Service for the retrieval and preservation of historical data for use in the rapidly changing landscape of data reuse and reuse.

...read moreread less

Abstract: Table of Contents Acknowledgments Foreword Preliminary Step 0: Establish Your Data Curation Service Step 1.0: Receive the Data 1.1 Recruit Data for Your Curation Service 1.2 Negotiate Deposit 1.3 Transfer Rights (Deposit Agreements) 1.4 Facilitate Data Transfer 1.5 Obtain Available Metadata and Documentation 1.6 Receive Notification of Data Arrival Step 2.0: Appraisal and Selection Techniques that Mitigate Risks Inherent to Data 2.1 Appraisal 2.2 Risk Factors for Data Repositories 2.3 Inventory 2.4 Selection 2.5 Assign Step 3.0: Processing and Treatment Actions for Data 3.1 Secure the Files 3.2 Create a Log of Actions Taken 3.3 Inspect the File Names and Structure 3.4 Open the Data Files 3.5 Attempt to Understand and Use the Data 3.6 Work with Author to Enhance the Submission 3.7 Consider the File Formats 3.8 File Arrangement and Description Step 4.0: Ingest and Store Data in Your Repository 4.1 Ingest the Files 4.2 Store the Assets Securely 4.3 Develop Trust in Your Digital Repository Step 5.0: Descriptive Metadata 5.1 Create and Apply Appropriate Metadata 5.2 Consider Disciplinary Metadata Standards for Data Step 6.0: Access 6.1 Determine Appropriate Levels of Access 6.2 Apply the Terms of Use and Any Relevant Licenses 6.3 Contextualize the Data 6.4 Increase Exposure and Discovery 6.5 Apply Any Necessary Access Controls 6.6 Ensure Persistent Access and Encourage Appropriate Citation 6.7 Release Data for Access and Notify Author Step 7.0: Preservation of Data for the Long Term 7.1 Preservation Planning for Long-Term Reuse 7.2 Monitor Preservation Needs and Take Action Step 8.0: Reuse 8.1 Monitor Data Reuse 8.2 Collect Feedback about Data Reuse and Quality Issues 8.3 Provide Ongoing Support as Long as Necessary 8.4 Cease Data Curation Brief Concluding Remarks and a Call to Action Bibliography Biographies

...read moreread less

20 citations

Proceedings Article•

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations.

[...]

Marcos Martínez-Romero¹, Martin J. O'Connor¹, Ravi D. Shankar¹, Maryam Panahiazar¹, Debra Willrett¹, Attila L. Egyedi¹, Olivier Gevaert¹, John Graybeal¹, Mark A. Musen¹ - Show less +5 more•Institutions (1)

Stanford University¹

01 Jan 2017

TL;DR: A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata.

...read moreread less

Abstract: In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.

...read moreread less

19 citations

Journal Article•DOI•

A metadata-driven approach to data repository design.

[...]

Matthew J. Harvey¹, Andrew McLean¹, Henry Rzepa¹•Institutions (1)

Imperial College London¹

24 Jan 2017-Journal of Cheminformatics

TL;DR: The design and use of a metadata-driven data repository for research data management is described, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.

...read moreread less

Abstract: The design and use of a metadata-driven data repository for research data management is described. Metadata is collected automatically during the submission process whenever possible and is registered with DataCite in accordance with their current metadata schema, in exchange for a persistent digital object identifier. Two examples of data preview are illustrated, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.

...read moreread less

15 citations

Journal Article•DOI•

An ontology-based search engine for digital reconstructions of neuronal morphology.

[...]

Sridevi Polavaram¹, Giorgio A. Ascoli¹•Institutions (1)

Krasnow Institute for Advanced Study¹

23 Mar 2017-Brain Informatics

TL;DR: A new organization of NeuroMorpho.Org metadata grounded on a set of interconnected hierarchies focusing on the main dimensions of animal species, anatomical regions, and cell types is presented, explicitly resolving all ambiguities caused by synonymy and homonymy.

...read moreread less

Abstract: Neuronal morphology is extremely diverse across and within animal species, developmental stages, brain regions, and cell types. This diversity is functionally important because neuronal structure strongly affects synaptic integration, spiking dynamics, and network connectivity. Digital reconstructions of axonal and dendritic arbors are thus essential to quantify and model information processing in the nervous system. NeuroMorpho.Org is an established repository containing tens of thousands of digitally reconstructed neurons shared by several hundred laboratories worldwide. Each neuron is annotated with specific metadata based on the published references and additional details provided by data owners. The number of represented metadata concepts has grown over the years in parallel with the increase of available data. Until now, however, the lack of standardized terminologies and of an adequately structured metadata schema limited the effectiveness of user searches. Here we present a new organization of NeuroMorpho.Org metadata grounded on a set of interconnected hierarchies focusing on the main dimensions of animal species, anatomical regions, and cell types. We have comprehensively mapped each metadata term in NeuroMorpho.Org to this formal ontology, explicitly resolving all ambiguities caused by synonymy and homonymy. Leveraging this consistent framework, we introduce OntoSearch, a powerful functionality that seamlessly enables retrieval of morphological data based on expert knowledge and logical inferences through an intuitive string-based user interface with auto-complete capability. In addition to returning the data directly matching the search criteria, OntoSearch also identifies a pool of possible hits by taking into consideration incomplete metadata annotation.

...read moreread less

Journal Article•DOI•

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

[...]

Maryam Panahiazar¹, Michel Dumontier¹, Olivier Gevaert¹•Institutions (1)

Stanford University¹

01 Aug 2017-Journal of Biomedical Informatics

TL;DR: This work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms, which has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

...read moreread less

Proceedings Article•DOI•

Incorporating Metadata into Content-Based User Embeddings.

[...]

Linzi Xing, Michael J. Paul¹•Institutions (1)

University of Colorado Boulder¹

01 Sep 2017

TL;DR: This work proposes a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models, and shows that this approach can lead to substantial performance gains with the simple addition of network and geographic features.

...read moreread less

Abstract: Low-dimensional vector representations of social media users can benefit applications like recommendation systems and user attribute inference Recent work has shown that user embeddings can be improved by combining different types of information, such as text and network data We propose a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models Experimenting with the task of friend recommendation on a dataset of 5,019 Twitter users, we show that our approach can lead to substantial performance gains with the simple addition of network and geographic features

...read moreread less

Journal Article•DOI•

Organizing earth observation data inside a spatial data infrastructure

[...]

Markus Innerebner, Armin Costa, Ekaterina Chuprikova, Roberto Monsorno, Bartolomeo Ventura - Show less +1 more

01 Mar 2017-Earth Science Informatics

TL;DR: This article presents an automatic metadata extraction approach that creates from different optical data deriving from various satellite missions of scientific interest metadata information, based on an extended model of the standard ISO 19115.

...read moreread less

Abstract: Scientists as well public institutions dealing with geospatial data often work with a large amount of heterogeneous data deriving from different sources. Without a well-defined, organized structure they face problems in finding and reusing existing data, and as consequence this may cause data inconsistency and storage problems. A catalog system based on the metadata of spatial data facilitates the management of large amount of data and offers service to retrieve, discover and exchange geographic data in an quick and easy fashion. Currently, most online catalogs are more focusing on the geographic data and there has been only few interests in catalogizing Earth observation data, in which in addition the acquisition information matters. This article presents an automatic metadata extraction approach that creates from different optical data deriving from various satellite missions of scientific interest (i.e. MODIS, LANDSAT, RapidEye, Suomi-NPP VIIRS, Sentinel-1A, Sentinel-2A) metadata information, based on an extended model of the standard ISO 19115. The XML schema ISO 19139-2 with the support of gridded and imagery information defined in ISO 19115-2 was examined, and based on the requirements of experts working in the research field of Earth observation the schema was extended. The XML schema ISO 19139-2 and its extension has been deployed as a new schema plugin in the spatial catalog Geonetwork Open Source in order to store all relevant metadata information about satellite data and the appropriate acquisition and processing information in an online catalog. A real-world scenario that is productively used in the EURAC research group institute for Applied Remote Sensing illustrates a workflow management for Earth observation data including data processing, metadata extraction, generation and distribution.

...read moreread less

Patent•

System for data management in a large scale data repository

[...]

Daljit Rehal

14 Sep 2017

TL;DR: In this article, a computer-implemented method of managing data in a data repository is disclosed, which comprises maintaining a data repositories, the data repository storing data imported from one or more data sources.

...read moreread less

Abstract: A computer-implemented method of managing data in a data repository is disclosed. The method comprises maintaining a data repository, the data repository storing data imported from one or more data sources. A database entity added to the data repository is identified and a metadata object for storing metadata relating to the database entity is created and stored in a metadata repository. The metadata object is also added to a documentation queue. Metadata for the metadata object is received from user via a metadata management user interface and the received metadata is stored in the metadata repository and associated with the metadata object.

...read moreread less

Proceedings Article•DOI•

Providing Video Annotations in Multimedia Containers for Visualization and Research

[...]

Julius Schöning¹, Patrick Faion¹, Gunther Heidemann¹, Ulf Krumnack¹•Institutions (1)

University of Osnabrück¹

24 Mar 2017

TL;DR: In two prototype implementations, object labels, gaze data from eye-tracking and the corresponding video into a single multimedia container and visualize this data using a media player to facilitate visualization in standard multimedia players, streaming via the Internet, and easy use without conversion.

...read moreread less

Abstract: There is an ever increasing amount of video data sets which comprise additional metadata, such as object labels, tagged events, or gaze data. Unfortunately, metadata are usually stored in separate files in custom-made data formats, which reduces accessibility even for experts and makes the data inaccessible for non-experts. Consequently, we still lack interfaces for many common use cases, such as visualization, streaming, data analysis, machine learning, high-level understanding and semantic web integration. To bridge this gap, we want to promote the use of existing multimedia container formats to establish a standardized method of incorporating content and metadata. This will facilitate visualization in standard multimedia players, streaming via the Internet, and easy use without conversion, as shown in the attached demonstration video and files. In two prototype implementations, we embed object labels, gaze data from eye-tracking and the corresponding video into a single multimedia container and visualize this data using a media player. Based on this prototype, we discuss the benefit of our approach as a possible standard. Finally, we argue for the inclusion of MPEG-7 in multimedia containers as a further improvement.

...read moreread less

Journal Article•DOI•

Toward Efficient and Flexible Metadata Indexing of Big Data Systems

[...]

Dongfang Zhao¹, Kan Qiao², Zhou Zhou³, Tonglin Li⁴, Zhihan Lu⁵, Xiaohua Xu⁶ - Show less +2 more•Institutions (6)

University of Washington¹, Google², Illinois Institute of Technology³, Oak Ridge National Laboratory⁴, University College London⁵, Kennesaw State University⁶

01 Mar 2017-IEEE Transactions on Big Data

TL;DR: This work designs Dindex, a distributed indexing service for metadata that incorporates a hierarchy of coarse-grained aggregation and horizontal key-coalition and demonstrates that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.

...read moreread less

Abstract: In Big Data era, applications are generating orders of magnitude more data in both volume and quantity. While many systems emerge to address such data explosion, the fact that these data’s descriptors, i.e., metadata, are also “big” is often overlooked. The conventional approach to address the big metadata issue is to disperse metadata into multiple machines. However, it is extremely difficult to preserve both load-balance and data-locality in this approach. To this end, in this work we propose hierarchical indirection layers for indexing the underlying distributed metadata. By doing this, data locality is achieved efficiently by the indirection while load-balance is preserved. Three key challenges exist in this approach, however: first, how to achieve high resilience; second, how to ensure flexible granularity; third, how to restrain performance overhead. To address above challenges, we design Dindex, a distributed indexing service for metadata. Dindex incorporates a hierarchy of coarse-grained aggregation and horizontal key-coalition. Theoretical analysis shows that the overhead of building Dindex is compensated by only two or three queries. Dindex has been implemented by a lightweight distributed key-value store and integrated to a fully-fledged distributed filesystem. Experiments demonstrated that Dindex accelerated metadata queries by up to 60 percent with a negligible overhead.

...read moreread less

Proceedings Article•DOI•

Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories

[...]

Paul Beckman¹, Tyler J. Skluzacek¹, Kyle Chard¹, Ian Foster¹•Institutions (1)

Argonne National Laboratory¹

27 Jun 2017

TL;DR: It is shown that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.

...read moreread less

Abstract: Scientists' capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma---an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.

...read moreread less

Journal Article•DOI•

AHP-TOPSIS Method for Learning Object Metadata Evaluation

[...]

Murat Ince, Tuncay Yigit, Ali Hakan Isik

01 Jan 2017-International Journal of Information and Education Technology

Proceedings Article•DOI•

Labeling source code with metadata: A survey and taxonomy

[...]

Matúš Sulír¹, Jaroslav Porubän¹•Institutions (1)

Technical University of Košice¹

24 Sep 2017

TL;DR: A systematic mapping study of approaches and tools labeling source code elements with metadata and presenting them to developers in various forms, forming a taxonomy with four dimensions — source, target, presentation and persistence.

...read moreread less

Abstract: Source code is a primary artifact where programmers are looking when they try to comprehend a program. However, to improve program comprehension efficiency, tools often associate parts of source code with metadata collected from static and dynamic analysis, communication artifacts and many other sources. In this article, we present a systematic mapping study of approaches and tools labeling source code elements with metadata and presenting them to developers in various forms. We selected 25 from more than 2,000 articles and categorized them. A taxonomy with four dimensions — source, target, presentation and persistence — was formed. Based on the survey results, we also identified interesting future research challenges.

...read moreread less

Journal Article•DOI•

[Registries for rare diseases : OSSE - An open-source framework for technical implementation].

[...]

Holger Storf¹, Jannik Schaaf, Dennis Kadioglu, Jens Göbel, Thomas O. F. Wagner, Frank Ückert - Show less +2 more•Institutions (1)

Goethe University Frankfurt¹

01 May 2017-Bundesgesundheitsblatt-gesundheitsforschung-gesundheitsschutz

TL;DR: With OSSE, the foundation is laid to operate linked patient registries while respecting strong data protection regulations and the feedback given by the users will influence further development of OSSE.

...read moreread less

Abstract: Meager amounts of data stored locally, a small number of experts, and a broad spectrum of technological solutions incompatible with each other characterize the landscape of registries for rare diseases in Germany. Hence, the free software Open Source Registry for Rare Diseases (OSSE) was created to unify and streamline the process of establishing specific rare disease patient registries. The data to be collected is specified based on metadata descriptions within the registry framework's so-called metadata repository (MDR), which was developed according to the ISO/IEC 11179 standard. The use of a central MDR allows for sharing the same data elements across any number of registries, thus providing a technical prerequisite for making data comparable and mergeable between registries and promoting interoperability.With OSSE, the foundation is laid to operate linked patient registries while respecting strong data protection regulations. Using the federated search feature, data for clinical studies can be identified across registries. Data integrity, however, remains intact since no actual data leaves the premises without the owner's consent. Additionally, registry solutions other than OSSE can participate via the OSSE bridgehead, which acts as a translator between OSSE registry networks and non-OSSE registries. The pseudonymization service Mainzelliste adds further data protection.Currently, more than 10 installations are under construction in clinical environments (including university hospitals in Frankfurt, Hamburg, Freiburg and Munster). The feedback given by the users will influence further development of OSSE. As an example, the installation process of the registry for undiagnosed patients at University Hospital Frankfurt is described in more detail.

...read moreread less

Journal Article•DOI•

A metadata reporting framework (FRAMES) for synthesis of ecohydrological observations

[...]

D. S. Christianson¹, Charuleka Varadharajan¹, Bradley O. Christoffersen², Matteo Detto³, Matteo Detto⁴, Boris Faybishenko¹, Bruno O. Gimenez⁵, Valerie Hendrix¹, Kolby J. Jardine¹, Robinson I. Negrón-Juárez¹, Gilberto Pastorello¹, Thomas L. Powell¹, Megha Sandesh¹, Jeffrey M. Warren⁶, Brett T. Wolfe⁴, Jeffrey Q. Chambers¹, Lara M. Kueppers¹, Lara M. Kueppers⁷, Nate G. McDowell⁸, Nate G. McDowell², Deb Agarwal¹ - Show less +17 more•Institutions (8)

Lawrence Berkeley National Laboratory¹, Los Alamos National Laboratory², Princeton University³, Smithsonian Tropical Research Institute⁴, National Institute of Amazonian Research⁵, Oak Ridge National Laboratory⁶, University of California, Berkeley⁷, Pacific Northwest National Laboratory⁸

01 Nov 2017-Ecological Informatics

TL;DR: This paper developed a metadata reporting framework (FRAMES) to enable management and synthesis of observational data that are essential in advancing a predictive understanding of earth systems and utilizes best practices for data and metadata organization enabling consistent data reporting and compatibility with a variety of standardized data protocols.

...read moreread less

Proceedings Article•DOI•

Empress: extensible metadata provider for extreme-scale scientific simulations

[...]

Margaret Lawson¹, Craig D. Ulmer¹, Shyamali Mukherjee¹, Gary J. Templet¹, Jay Lofstead¹, Scott Levy¹, Patrick Widener¹, Todd Kordenbrock - Show less +4 more•Institutions (1)

Sandia National Laboratories¹

12 Nov 2017

TL;DR: EMPRESS provides a simple example of the next step in this evolution of application-level metadata management by integrating per-process metadata with the storage system itself, making it more broadly useful than single file or application formats.

...read moreread less

Abstract: Significant challenges exist in the efficient retrieval of data from extreme-scale simulations. An important and evolving method of addressing these challenges is application-level metadata management. Historically, HDF5 and NetCDF have eased data retrieval by offering rudimentary attribute capabilities that provide basic metadata. ADIOS simplified data retrieval by utilizing metadata for each process' data. EMPRESS provides a simple example of the next step in this evolution by integrating per-process metadata with the storage system itself, making it more broadly useful than single file or application formats. Additionally, it allows for more robust and customizable metadata.

...read moreread less

Journal Article•DOI•

Evaluation of the OntoSoft Ontology for describing metadata for legacy hydrologic modeling software

[...]

Bakinam T. Essawy¹, Jonathan L. Goodall¹, Hao Xu², Yolanda Gil³•Institutions (3)

University of Virginia¹, University of North Carolina at Chapel Hill², University of Southern California³

01 Jun 2017-Environmental Modelling and Software

TL;DR: Evaluating OntoSoft for organizing the metadata associated with a data pre-processing software workflow used in association with the Variable Infiltration Capacity (VIC) hydrologic model suggests that past efforts to document this software resulted in capturing key model metadata in unstructured files that could be formalized into a machine-readable form using the Onto soft Ontology.

...read moreread less

Abstract: Metadata for hydrologic models is rarely organized in machine-readable forms. This lack of formal metadata is important because it limits the ability to catalog, identify, attribute, and understand unique model software; ultimately, it hinders the ability to reproduce past computational studies. Researchers have recently proposed an ontology for scientific software metadata called OntoSoft for addressing this problem. The objective of this research is to evaluate OntoSoft for organizing the metadata associated with a data pre-processing software workflow used in association with the Variable Infiltration Capacity (VIC) hydrologic model. This is accomplished by exploring what metadata are available from online resources and how this metadata aligns with the OntoSoft Ontology. The results suggest that past efforts to document this software resulted in capturing key model metadata in unstructured files that could be formalized into a machine-readable form using the OntoSoft Ontology. The OntoSoft Ontology and Portal are evaluated for capturing and sharing metadata for hydrologic modeling software.Data pre-processing software workflow for the Variable Infiltration Capacity (VIC) hydrologic model is used as a case study.90% of required OntoSoft metadata was obtained for 13 of the 15 software resources.Metadata divided across six sources can now be organized in a constant, machine-readable form.

...read moreread less

Book Chapter•DOI•

Involving Data Creators in an Ontology-Based Design Process for Metadata Models

[...]

João Aguiar Castro¹, Ricardo Carvalho Amorim¹, Rúbia Gattelli¹, Yulia Karimova¹, João Rocha da Silva¹, Cristina Ribeiro¹ - Show less +2 more•Institutions (1)

University of Porto¹

01 Jan 2017

Patent•

Data backup using metadata mapping

[...]

Norie Iwasaki¹, Matsui Sosuke¹, Tsuyoshi Miyamura¹, Terue Watanabe¹, Yamamoto Noriko¹ - Show less +1 more•Institutions (1)

IBM¹

09 Feb 2017

TL;DR: In this paper, an information processing apparatus, backup method, and program product that enable efficient differential backup of files stored in a storage device is presented, including a metadata management unit, a map generation unit, and a backup management unit for scanning metadata to detect files that have been created, modified, or deleted since the last backup.

...read moreread less

Abstract: An information processing apparatus, backup method, and program product that enable efficient differential backup. In one embodiment, an information processing apparatus for files stored in a storage device includes: a metadata management unit for managing metadata of files stored in the storage device; a map generation unit for generating a map which indicates whether metadata associated with an identification value uniquely identifying a file in the storage device is present or absent; and a backup management unit for scanning the metadata to detect files that have been created, modified, or deleted since the last backup, and storing at least a data block and the metadata for a detected file in a backup storage device as backup information in association with the identification value.

...read moreread less

Journal Article•DOI•

Approach to Facilitating Geospatial Data and Metadata Publication Using a Standard Geoservice

[...]

Sergio Trilles, Laura Díaz, Joaquín Huerta

25 Apr 2017-ISPRS international journal of geo-information

TL;DR: This work extends previous research, in which a publication service has been designed in the framework of the European Directive Infrastructure for Spatial Information in Europe (INSPIRE) as a solution to assist users in automatically publishing geospatial data and metadata in order to improve SDI maintenance and usability.

...read moreread less

Abstract: Nowadays, the existence of metadata is one of the most important aspects of effective discovery of geospatial data published in Spatial Data Infrastructures (SDIs). However, due to lack of efficient mechanisms integrated in the data workflow, to assist users in metadata generation, a lot of low quality and outdated metadata are stored in the catalogues. This paper presents a mechanism for generating and publishing metadata through a publication service. This mechanism is provided as a web service implemented with a standard interface called a web processing service, which improves interoperability between other SDI components. This work extends previous research, in which a publication service has been designed in the framework of the European Directive Infrastructure for Spatial Information in Europe (INSPIRE) as a solution to assist users in automatically publishing geospatial data and metadata in order to improve, among other aspects, SDI maintenance and usability. Also, this work adds more extra features in order to support more geospatial formats, such as sensor data.

...read moreread less

Journal Article•DOI•

mzML2ISA & nmrML2ISA: generating enriched ISA-Tab metadata files from metabolomics XML data

[...]

Martin Larralde¹, Thomas N. Lawson, Ralf J. M. Weber², Pablo Moreno³, Kenneth Haug³, Philippe Rocca-Serra⁴, Mark R. Viant², Christoph Steinbeck⁵, Christoph Steinbeck³, Reza M. Salek³ - Show less +6 more•Institutions (5)

École normale supérieure de Cachan¹, University of Birmingham², European Bioinformatics Institute³, University of Oxford⁴, Schiller International University⁵

15 Aug 2017-Bioinformatics

TL;DR: A set of Python packages that can automatically generate ISA‐Tab metadata file stubs from raw XML metabolomics data files are reported, which reduces the time needed to capture metadata substantially, is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets.

...read moreread less

Abstract: Summary Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets. Availability and Implementation mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/. Contact reza.salek@ebi.ac.uk or isatools@googlegroups.com. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary Data

[...]

Yongchao Tian¹, Ioannis Alagiannis², Erietta Liarou², Anastasia Ailamaki², Pietro Michiardi¹, Marko Vukolic³ - Show less +2 more•Institutions (3)

Institut Eurécom¹, École Polytechnique Fédérale de Lausanne², IBM³

01 Sep 2017-IEEE Transactions on Big Data

TL;DR: DiNoDB is proposed, an interactive-speed query engine for ad-hoc queries on temporary data that avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions.

...read moreread less

Abstract: As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of ad-hoc queries compared to alternatives.

...read moreread less

Journal Article•DOI•

Adaptive metadata rebalance in exascale file system

[...]

Myung-Hoon Cha¹, Dong-Oh Kim¹, Hong-Yeon Kim¹, Young-Kyun Kim¹•Institutions (1)

Electronics and Telecommunications Research Institute¹

01 Apr 2017-The Journal of Supercomputing

TL;DR: The analysis results demonstrate that the model supports the feasibility of online metadata rebalance without the normal operation obstruction and increases the chances of maintaining balance in a huge cluster of metadata servers.

...read moreread less

Abstract: This paper presents an effective method of metadata rebalance in exascale distributed file systems. Exponential data growth has led to the need for an adaptive and robust distributed file system whose typical architecture is composed of a large cluster of metadata servers and data servers. Though each metadata server can have an equally divided subset from the entire metadata set at first, there will eventually be a global imbalance in the placement of metadata among metadata servers, and this imbalance worsens over time. To ensure that disproportionate metadata placement will not have a negative effect on the intrinsic performance of a metadata server cluster, it is necessary to recover the balanced performance of the cluster periodically. However, this cannot be easily done because rebalancing seriously hampers the normal operation of a file system. This situation continues to get worse with both an ever-present heavy workload on the file system and frequent failures of server components at exascale. As one of the primary reasons for such a degraded performance, file system clients frequently fail to look up metadata from the metadata server cluster during the period of metadata rebalance; thus, metadata operations cannot proceed at their normal speed. We propose a metadata rebalance model that minimizes failures of metadata operations during the metadata rebalance period and validate the proposed model through a cost analysis. The analysis results demonstrate that our model supports the feasibility of online metadata rebalance without the normal operation obstruction and increases the chances of maintaining balance in a huge cluster of metadata servers.

...read moreread less