scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A comparison of research data management platforms: architecture, flexible metadata and interoperability

TL;DR: A synthetic overview of current platforms that can be used for data management purposes and shows that there is still plenty of room for improvement, mainly regarding the specificity of data description in different domains, as well as the potential for integration of the data management platforms with existing research management tools.
Abstract: Research data management is rapidly becoming a regular concern for researchers, and institutions need to provide them with platforms to support data organization and preparation for publication. Some institutions have adopted institutional repositories as the basis for data deposit, whereas others are experimenting with richer environments for data description, in spite of the diversity of existing workflows. This paper is a synthetic overview of current platforms that can be used for data management purposes. Adopting a pragmatic view on data management, the paper focuses on solutions that can be adopted in the long tail of science, where investments in tools and manpower are modest. First, a broad set of data management platforms is presented—some designed for institutional repositories and digital libraries—to select a short list of the more promising ones for data management. These platforms are compared considering their architecture, support for metadata, existing programming interfaces, as well as their search mechanisms and community acceptance. In this process, the stakeholders’ requirements are also taken into account. The results show that there is still plenty of room for improvement, mainly regarding the specificity of data description in different domains, as well as the potential for integration of the data management platforms with existing research management tools. Nevertheless, depending on the context, some platforms can meet all or part of the stakeholders’ requirements.

Summary (2 min read)

1 Introduction

  • The number of published scholarly papers is steadily increasing, and there is a growing awareness of the importance, diversity and complexity of data generated in research contexts [25].
  • Implementation costs, architecture, interoperability, content dissemination capabilities, implemented search features and community acceptance are also taken into consideration.
  • This evaluation considers aspects relevant to the authors’ ongoing work, focused on finding solutions to research data management, and takes into consideration their past experience in this field [33].
  • Moreover, the authors look at staging platforms, which are especially tailored to capture metadata records as they are produced, offering researchers an integrated environment for their management along with the data.
  • As datasets become organised and described, their value and their potential for reuse will prompt further preservation actions.

3 Scope of the analysis

  • The stakeholders in the data management workflow can greatly influence whether research data is reused.
  • The selection of platforms in the analysis acknowledges their role, as well as the importance of the adoption of community standards to help with data description and management in the long run.
  • On the other hand, such solutions are usually harder to install and maintain by institutions in the so-called long tail of science—institutions that create large numbers of small datasets, though do not possess the necessary financial resources and preservation expertise to support a complete preservation workflow [18].
  • The Fedora framework3 is used by some institutions, and is also under active development, with the recent release of Fedora 4.
  • The former includes aspects such as how they are deployed into a production environment, the locations where they keep their data, whether their source code is available, and other aspects that are related to the compliance with preservation best practices.

4 Platform comparison

  • Based on the selection of the evaluation scope, this section addresses the comparison of the platforms according to key features that can help in the selection of a platform for data management.
  • Adopting a dynamic approach to data management, tasks can be made easier for the researchers, and motivate them to use the data management platform as part of their daily research activities, while they are working on the data.
  • This platform is flexible, available under an open-source license, and compatible with several metadata representations, while still providing a complete API.
  • While the evaluated platforms have different description requirements upon deposit, most of them lack the support for domainspecific metadata schemas.
  • This search feature makes it easier for researchers to find the datasets that are from relevant domains and belong to specific collections or similar dataset categories (the concept varies between platforms as they have different organizational structures).

5 Data staging platforms

  • Most of the analyzed solutions target data repositories, i.e. the end of the research workflow.
  • These requirements have been identified by several research and data management institutions, who have implemented integrated solutions for researchers to manage data not only when it is created, but also throughout the entire research workflow.
  • It provides researchers with 20GB of storage for free, and is integrated with other modules for dataset sharing and staging, including some computational processing on the stored data.
  • Dendro is a single solution targeted at improving the overall availability and quality of research data.
  • Curators can expand the platform’s data model by loading ontologies that specify domain-specific or generic metadata descriptors that can then be used by researchers in their projects.

6 Conclusion

  • The evaluation showed that it can be hard to select a platform without first performing a careful study of the requirements of all stakeholders.
  • Its features and the extensive API making it also possible to use this repository to manage research data, making use of its keyvalue dictionary to store any domain-level descriptors.
  • A very important factor to consider is also the control over where the data is stored.
  • The authors consider that these solutions should be compared to other collaborative solutions such as Dendro, a research data mana- gement solution currently under development.
  • This should, of course, be done while taking into consideration available metadata standards that can contribute to overall better conditions for long-term preservation [36].

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Noname manuscript No.
(will be inserted by the editor)
Acomparisonofresearchdatamanagementplatforms
Architecture, flexible metadata and interoperability
Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva,
Cristina Ribeiro
Received: date / Accepted: date
Abstract Research data management is rapidly be-
coming a regular concern for researchers, and institu-
tions need to provide them with platforms to support
data organization and preparation for publication. Some
institutions have adopted institutional repositories as
the basis for data deposit, whereas others are experi-
menting with richer environments for data description,
in spite of the diversity of existing workflows. This pa-
per is a synthetic overview of current platforms that
can be used for data management purposes. Adopt-
ing a pragmatic view on data management, the paper
focuses on solutions that can be adopted in the long-
tail of science, where investments in tools and man -
power are modest. First, a broad set of data mana-
gement platforms is presented—some designed for in-
stitutional repositories and digital libraries—to select
ashortlistofthemorepromisingonesfordatama-
nagement. These platforms are compared considering
This paper is an extended version of a previously published
comparative study. Please refer to the WCIST 2015 confer-
ence proceedings (doi: 10.1007/978-3-319-16486-1)
Ricardo C arvalho Amorim
INESC TEC—Faculdade de Eng enharia da Universidade do
Porto
E-mail: ricardo.amorim3@gmail.com
João Aguiar Castro
INESC TEC—Faculdade de Engenharia da Universidade do
Porto
E-mail: joaoaguiarcastro@gmail.com
João Rocha da Silva
INESC TEC—Faculdade de Engenharia da Universidade do
Porto
E-mail: joaorosilva@gmail.com
Cristina R ibeiro
INESC TEC—Faculdade de Engenharia da Universidade do
Porto
E-mail: mcr@fe.up.pt
their architecture, support for metadata, existing pro-
gramming interfaces, as well as their search mechanisms
and community acceptance. In this pro c e ss, the stake-
holders’ requirements are also taken into account. The
results show that there is still plenty of room for im -
provement, mainly regarding the specificity of data de-
scription in dierent domains, as well as the potential
for integration of the data management platforms with
existing research management tools. Nevertheless, de-
pending on the context, some platforms can meet all or
part of the stakeholders’ requirements.
1 Introduction
The number of published scholarly papers is steadily
increasing, and there is a growing awareness of the im-
portance, diversity and complexity of data generated
in research contexts [25]. The management of these as-
sets is currently a concern for both researchers and in-
stitutions who have to streamline scholarly communi-
cation, while keeping record of research contributions
and ensuring the correct licensing of their contents [23,
18]. At the same time, academic institutions have new
mandates, requiring data management activities to be
carried out during the research projects, as a part of
research grant contracts [14,26]. These activities are
invariably supported by software platforms, increasing
the demand for such infrastructure s .
This paper presents an overview of several promi-
nent research data management platforms that can be
put in place by an institution to support part of its
research data management workflow. It starts by iden-
tifying a set of well known repositories that are cur-
rently being used for either publications or data ma-
nagement, discussing their use in several research in-
This is a post-peer-review, pre-copyedit version of an article published in
Universal Access in the Information Society. The final authenticated version is available online at:
https://doi.org/10.1007/s10209-016-0475-y

2 Ricardo C arvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro
stitutions. Then, focus moves to their fitne ss to han-
dle research data, namely their domain-specific meta-
data requirements and preservation guidelines. Imple-
mentation costs, architecture, interoperability, content
dissemination capabilities, implemented search features
and community acceptance are also taken into consider-
ation. When faced with the many alternatives currently
av ailable, it can be dicult for institutions to choose a
suitable platform to meet their specific requirements.
Several comparative studies between existing solutions
were already carried out in order to evaluate dierent
aspects of each implementation, confirming that this is
an issue with increasing importance [16,3,6]. This eval-
uation considers aspects relevant to the authors’ ongo-
ing work, focused on finding solutions to research data
management, and takes into consideration their past ex-
perience in this field [33]. This experience has provided
insights on specific, local needs that can influence the
adoption of a platform and therefore the success in its
deployment.
It is clear that the eort in creating metadata for
research datasets is very dierent from what is required
for research publications. While publications can be ac-
curately described by librarians, good quality metadata
for a dataset requires the contribution of the researchers
involved in its production. Their knowledge of the do-
main is required to adequately document the dataset
production context so that others can reuse it. Involv-
ing the researchers in the deposit stage is a challenge, as
the investment in metadata production for data publi-
cation and sharing is typically higher than that required
for the addition of notes that are only intended for their
peers in a research group [7].
Moreover, the authors look at staging platforms,
which are especially tailored to capture metadata re-
cords as they are produced, oering researchers an in-
tegrated environment for their manage m ent along with
the data. As this is an area with several proposals in
active development, EUDAT, which includes tools for
data staging, and Dendro, a platform proposed for en-
gaging researchers in data description, taking into ac-
count the need for data and metadata organisation will
be contemplated.
Staging platforms are capable of exporting the en-
closed datasets and metadata records to research data
repositories. The platforms selected for the analysis in
the sequel as candidates for u s e are considered as re-
search data management repositories for datasets in
the long tail of science , as they are designed with shar-
ing and dissemination in mind. Together, staging plat-
forms and research data repositories provide the tools to
handle the stages of the research workflow. Long-term
preservation imposes further requirements, and other
tools may be necessary to satisfy th e m. However, as da-
tasets become organised and described, their value and
their potential for reuse will prompt further preserva-
tion actions.
2 From publications to data management
The growth in the number of research publications,
combined with a strong drive towards open access poli-
cies [8,10], continue to foster the development of open-
source platforms for managing bibliographic records.
While data citation is not yet a widespread practice, the
importance of citable datasets is growing. Until a cul-
ture of data citation is widely adopted, however, many
research groups are opting to pu blish so-called “data
papers”, which are more easily citable than datasets.
Data pape rs serve not only as a reference to datasets
but also document their production context [9].
As data management becomes an increasin gly im-
portant part of the research workflow [24], solutions de-
signed for managing research data are being actively
developed by both open-source communities and data
management-related companies. As with institutional
repositories, many of their design and development chal-
lenges have to do with description and long-term preser-
vation of research data. There are, however, at least
two fundamental dierences between publications and
datasets: the latter are often purely numeric, making
it very hard to derive any type of metadata by sim-
ply looking at their contents; also, datasets require de-
tailed, domain-specific des c riptions to be corre ctly in-
terpreted. Metadata requ ire ments can also vary greatly
from domain to domain, requiring repository data mod-
els to be flexible enough to adequately represent these
records [35]. The eort invested in adequate dataset
description is worthwhile, since it has been shown that
research publications that provide access to their base
data consistently yield higher citation rates than those
that do not [27].
As these rep ositories deal with a reasonably small
set of managed formats for deposit, several reference
models, such as the OAIS (Open Archival Information
System) [12]arecurrentlyinusetoensurepreservation
and to promote metadata interchange and dissemina-
tion. Besides capturing the available metadata during
the ingestion process, data re positories often distribute
this information to other instances, improving the pub-
lications’ visibility through specialised research search
engines or repository indexers. While the former focus
on querying each repository f or exposed contents, the
latter help users find data repositories that match their
needs—such as repositories from a specific domain or
storing data from a specific community. Governmental

A comparison of research data management platforms 3
institutions are also promoting the d isclosure of open
data to improve citizen commitment and government
transparency, and this motivates the use of data mana-
gement platforms in this context.
2.1 An overview on existing repositories
While depositing and accessing publications from dif-
ferent domains is already possible in most institutions,
ensuring the same level of accessibility to data resources
is s till challenging, and d i erent solutions are being ex-
perimented to expose and share data in some communi-
ties. Addressing this issue, we synthesize a preliminary
classification of these solutions according to their spe-
cific purpose: they are either targeting staging, early
research activities or managing deposited datasets and
making them available to the community.
Table 1 identifies features of the selected platforms
that may render them convenient for data management.
To build the table, the authors resorted to the docu-
mentation of the platforms, and to basic experiments
with demonstration instances, whenever available. In
the firs t column, under “Registered repositories”, is the
number of running instances of each platform, accord-
ing to the OpenDOAR platform as of mid-October 2015.
In the analysis, five evaluation criteria that can be
relevant for an institution to make a coarse-grained as-
sessment of the solutions are considered. Some exist-
ing tools were excluded from this first analysis, mainly
because some of their characteristics place them out-
side of the scope of this work. This is the case of plat-
forms specifically targeting research publications (and
that cannot be easily modified for managing data), and
heavy-weight platforms targeted at long-term preserva-
tion. Also excluded were those that, from a technical
point of view, do not comply with de sirable require-
ments for this domain such as adopting an open-source
approach, or providing access to th eir features via com-
prehensive APIs.
By comparing the number of existing installations,
it is natural to assume that a large number of instance s
for a platform is a goo d indication of the existence of
support for its implementation. Repositories such as
DSpace are widely used among institutions to manage
publications. Therefore, institutions using DSpace to
manage publications can use their support for the plat-
form to expand or replicate the repository and meet
additional requirements.
It is important to mention that some repositories
do not implement interfaces with existing repository
indexers, and this may cause the OpenDOAR statistics
to show a value lower than the actual number of e xis ting
installations. More over, services provided by EUDAT,
Figshare and Zenodo, for instance, consis t of a single
installation that receives all the deposited data, rather
than a distributed array of manageable ins tallation s.
Government-supported platforms such as CKAN are
currently being used as part of the open government ini-
tiatives in several countries, allowing the disclosure of
data related to sensitive issues such as budget execu-
tion, and their aim is to vouch f or transparency and
credibility towards tax payers [ 21,20]. Although not
specifically tailored to meet research data management
requirements, these data-focus ed repositories also count
with an increasing number of instances supporting com-
plex research data management workflows [38], even at
universities
1
.
Access to the source code can also be a valuable cri-
terion for selecting a platform, primarily to avoid ven-
dor lock-in, which is usually associated with commer-
cial software or other provided services. Vendor lock-
in is undesirable from a preservation point of view as
it places th e maintenance of the platform (and conse-
quently the data stored insid e) in the hands of a single
vendor, that may not be able to provide support indef-
initely. The availability of the a platform’s source code
also allows additional modifications to be carried out
in order to create customized workflows—examples in-
clude improved metadata capabilities and data brows-
ing functionalities. Commercial solutions such as Con-
tentDM may incur high costs for the subscription fees,
which can make them cost-prohibitive for non-profit or-
ganizations or small research institutions. In some cases
only a small portion of the source code for the entire
solution is actually available to the public. This is the
case with EUDAT, where only the B2Share modu le is
currently open
2
—the re main in g modules are unavail-
able to date.
From an integration point of view, the existence of
an API can allow for further development and help with
the repository maintenance, as the software ages. Solu-
tions that do not, at least partially, comply with this
requirement, may hinder the integration with external
platforms to improve the visibility of existing contents.
The lack of an API creates a barrier to the development
of tools to support a platform in specific environments,
such as laboratories that frequently produce data to
be directly deposited and disclosed. Finally, regarding
long-term preservation, some platforms fail to provide
unique identifiers for the resources upon deposit, mak-
ing persistent references to data and data citation in
publications hard.
1
http://ckan.org/2013/11/28/ckan4rdm-st-andrews/
2
Source code repository for B2Share is hoste d via GitHub
at https://github.com/EUDAT-B2SHARE/b2share

4 Ricardo C arvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro
Table 1: Limitations of the identified repository solutions. Source:
5
OpenDOAR platform
4
Corresponding web-
site.
Only available through additional plug-ins.
Only partially.
Registered
rep osito ries
5
Closed
source
No
API
No unique
identifiers
Complex
installation or setup
No OAI-PMH
compliance
CKAN 139
4
5
5
ContentDM 53 5
Dataverse 2
Digital Commons 141 55
DSpace 1305
ePrints 407 5
EUDAT 5
Fedora 41 5
Figshare 5
Greenstone 51 55 5
Invenio 20
Omeka 4 55
SciELO 18 5
WEKO 40 No data
Zenodo
Support for flexible research workflows makes some
repository solutions attractive to smaller institutions
looking for solutions to implement their data manage-
ment workflows. Both DSpace and ePrints, for instance,
are quite common as institutional repositories to man-
age publications, as they oer broad compatibility with
the harvesting protocol OAI-PMH (Open Archives Ini-
tiative Protocol for Metadata Harvesting) [22] and with
preservation guidelines according to the OAIS model.
OAIS requires the existence of dierent packages with
specific purposes, namely SIP (Submission Information
Package), AIP (Archival Information Package) and DIP
(Dissemination Information Package). The OAIS ref-
erence model defines SIP as a representation of pack-
aged items to be deposited in the repository. AIP, on
the other hand, represents the packaged digital objects
within the OAIS-compliant system, and DIP holds one
or several digital artifacts and their representation in-
formation, in such a format that can be interpreted by
potential users.
2.2 Stakeholders in research data management
Several stakeholders are involved in dataset description
throughout the data management workflow, playing an
important part in their management and dissemina-
tion [24,7]. These stakeholders—researchers, research
institutions, curators, harvesters,anddevelopers—play
agoverningroleindeningthemainrequirementsof
adatarepositoryforthemanagementofresearchout-
puts. As key metadata providers, researchers are re-
sponsible for the description of research data. They
are not nec e ss arily knowledgeable in data management
practices, but can provide domain-sp ecific, more or less
formal descriptions to complement generic metadata.
This captures the essential data production context,
making it possible for other researchers to reuse the
data [7]. As data creators, researchers can play a central
role in data deposit by selecting appropriate file formats
for their datasets, preparing their structure and pack-
aging them approp riately [15]. Institutions are also mo-
tivated to have th e ir data recognized and preserved ac-
cording to the requirements of funding institutions [17,
26]. In this regard, institutions value metadata in com-
pliance to standards, which make data ready for in-
clusion in networked environments, therefore increas-
ing their visibility. To make sure that this context is
correctly passed, along with the data, to the preser-
vation stage, curators are mainly interested in main-
taining d ata quality and integrity over time. Usually,
curators are information experts, so it is expected that
their close c ollaboration with researchers can result in
both detailed and compliant metadata records.
Considering data dissemination and reuse, harves-
ters can be either individuals looking for specific data

A comparison of research data management platforms 5
or se rvices which index the content of several reposito-
ries. The se services can make particularly good use of
established protocols, such as the OAI-PMH, to retrieve
metadata from dierent sources and create an interface
to expose the indexed resources . Finally, contributing
to the improvement and expansion of these repositories
over time, developers are concerned with the underly-
ing technologies, an also in having extensive APIs to
promote integration with other tools.
3 Scope of the analysis
The stakeholders in the data management workflow can
greatly influence whether research data is reused. The
selection of platforms in the analysis acknowledges their
role, as well as the importance of the adoption of com-
munity standards to help with data description and ma-
nagement in the long run.
For this comparison, data management platforms
with instances running at both research and govern-
ment institutions have been considered, namely DSpace,
CKAN, Zenodo, Figshare, ePrints, Fedora and EUDAT.
If the long-term preservation of research assets is an
important requirement of the stakeholders in question,
other alternatives such as RODA [30]andArchivemat-
ica may also be considered strong candidates, since th ey
implement comprehensive preservation guidelines not
only for the digital objects themselves but also for their
whole life cycle and associated p rocesses . On one hand,
these platforms have a strong concern with long-term
preservation by strictly following existing standards such
as OAIS, PREMIS or METS, which cover the dier-
ent stages of a long-term preservation workflow. On the
other hand , such solutions are usually harder to install
and maintain by institutions in the so-called long tail of
science—institutions that create large numbers of small
datasets, though do not possess the necessary financial
resources and preservation expertise to support a com-
plete preservation workflow [18].
The Fedora framework
3
is used by some institutions,
and is also under active development, with the recent
release of Fedora 4. The fact that it is designed as a
framework to be fully customized and instantiated, in-
stead of being a “turnkey” solution, places Fedora in a
dierent level, that can not be directly compared with
other solutions. Two open-source examples of Fedora’s
implementations are Hydra
4
and Islandora
5
.Bothare
open-source, capable of handling research workflows,
and u se the best-practices approach already implemen-
3
http://www.fedora-commons.org/
4
http://projecthydra.org/
5
http://islandora.ca/
ted in the core Fedora framework. Although these are
not prese nt in the comparison table, this section will
also consider their strengths, whe n compared to the
other platforms.
An overview of the previous ly identified stakehold-
ers led to the selection of two important dimensions
for the assessment of the platform features: their archi-
tecture and their metadata and dissemination capabil-
ities. The former includes aspects such as how they are
deployed into a production environment, the locations
where they keep their data, whethe r their source code
is available, and other aspects that are related to the
compliance with preservation best practices. The latter
focuses on how resource-related metadata is handled
and the level of compliance of these records with es-
tablished standards and exchange protocols. Other im-
portant aspects are their adoption within the research
communities and the availability of support for exten-
sions. Table 2 shows an overview of the results of our
evaluation.
4 Platform comparison
Based on the selection of the evaluation scope, this
section addresses the comparison of the platforms ac-
cording to key features that can help in the selection
of a platform for data management. Table 2 groups
these features in two categories: (i) Architecture, for
structural-related characteristics; and (ii) Metadata and
dissemination, for those related to flexible description
and interoperability. This analysis is guided by the use
cases in the research data management environment.
4.1 Architecture
Regarding the architecture of the platforms, several as-
pects are considered. From the point of view of a re-
search institution, a quick and simple deployment of
the selected platform is an important aspect. There are
two main scenarios: the institution can either outsource
an external service or install and customize its own
repository, supporting the infrastructure maintenance
costs. Contracting a service provided by a dedicated
company such as Figshare or Zenodo delegates platform
maintenance for a fee. The service-based approach may
not be viable in some scenarios, as some researchers or
institutions may be reluctant to deposit their data in
aplatformoutsidetheircontrol[11]. DSpace, ePrints,
CKAN or any Fedora-based solution can be installed
and run completely under the control of the research
institution an d therefore oer a better control over the
stored data. As open-source solutions, they also have

Citations
More filters
Proceedings ArticleDOI
01 Jan 2020
TL;DR: An approach to organize the research data management process using a research data and knowledge management system and a domain specific vocabulary is being developed.
Abstract: Regarding the development of complex new technological processes, research data management systems are of major importance to support scientist in large collaborative projects and allow project teams to take advantage of a research organized according to the FAIR data principles. At the example of the CRC 1153, where an interdisciplinary team researches novel process chains for the manufacturing of hybrid components, an approach to organize the research data management process using a research data and knowledge management system and a domain specific vocabulary is being developed.

6 citations

Journal ArticleDOI
TL;DR: The model proposed here for integration is a hybrid model which can translate metadata standards and use the Z39.50 and OEI protocol to transfer data.
Abstract: This paper aims to propose an integrating model for creating virtual libraries in Iranian universities of medical sciences.,This study was conducted with an analytic survey method. The statistical population comprised 66 Iranian universities of medical sciences, of which 59 libraries participated in the study. A researcher-made checklist was used for data collection. To ensure the accuracy of data, interviews and, in some cases, observations were also performed. Statistical estimates, including frequency, percentage, cumulative frequency and diagrams, were used for data analysis, and the system analysis method was used for modeling.,Results demonstrated that the library software programs of the studied universities of medical sciences do not have desirable interoperability capabilities. Only Azarsa program can exchange information with other systems. In terms of metadata and its standards, the studied libraries use programs with various standards, with MARC and Dublin Core standards being the most frequently used ones in the studied sample.,The model proposed here for integration is a hybrid model which can translate metadata standards and use the Z39.50 and OEI protocol to transfer data.

5 citations

Journal ArticleDOI
TL;DR: This paper presents organizational measures, data andmetadata management concepts, and technical solutions to form a flexible research data management framework that allows for efficiently sharing the full range of data and metadata among all researchers of the project, and smooth publishing of selected data and data streams to publicly accessible sites.
Abstract: The consistent management of research data is crucial for the success of long-term and large-scale collaborative research. Research data management is the basis for efficiency, continuity, and quality of the research, as well as for maximum impact and outreach, including the long-term publication of data and their accessibility. Both funding agencies and publishers increasingly require this long term and open access to research data. Joint environmental studies typically take place in a fragmented research landscape of diverse disciplines; researchers involved typically show a variety of attitudes towards and previous experiences with common data policies, and the extensive variety of data types in interdisciplinary research poses particular challenges for collaborative data management. In this paper, we present organizational measures, data and metadata management concepts, and technical solutions to form a flexible research data management framework that allows for efficiently sharing the full range of data and metadata among all researchers of the project, and smooth publishing of selected data and data streams to publicly accessible sites. The concept is built upon data type-specific and hierarchical metadata using a common taxonomy agreed upon by all researchers of the project. The framework’s concept has been developed along the needs and demands of the scientists involved, and aims to minimize their effort in data management, which we illustrate from the researchers’ perspective describing their typical workflow from the generation and preparation of data and metadata to the long-term preservation of data including their metadata.

5 citations


Cites background from "A comparison of research data manag..."

  • ...These research-data-management solutions differ with respect to technical architecture, metadata, user and programming interfaces, scope, coverage, and costs (e.g., Glatard et al. 2017; Amorim et al. 2017)....

    [...]

DOI
07 Aug 2019
TL;DR: This study aims to prove the existing theories of the repository and digital library architecture to be adapted to the development of current knowledge and technology by using the Big Data model and using the Open Archive Initiative (OAI) and designing a new repository.
Abstract: So far, many repository developments have been carried out, but still in the repository analysis built by other parties (other countries), besides that, the use of complex repository software and requires expensive architecture. This research is needed so that later the research results can develop a repository architecture which is a new technology that is better and more efficient and can be used by universities in Indonesia, especially in AMIK Indonesia. This study aims to prove the existing theories of the repository and digital library architecture to be adapted to the development of current knowledge and technology by using the Big Data model and using the Open Archive Initiative (OAI) and designing a new repository to facilitate universities for online publications as needed, so that user limitations on publishing management become complex, and develop existing research results, in developing repositories that are more flexible than existing repositories and digital libraries. Broadly speaking, this research is divided into three stages, namely pre-development data collection, development and implementation, and post-development data collection. Pre-development data collection is intended to get the provision of preliminary studies about the core of the problem at hand, while the development and implementation phase focuses on modeling software design into a diagram and making programming code to implement the design that has been made. While the stages of post-development data collection are for reforming the application made, drawing conclusions, and suggestions for the topic of subsequent research. The proposed system allows users to see publication information and repositories also equipped with the Open Archive Initiative (OAI) module so that it can be crawled by indexing machines. From the results of the study, it can be concluded that pre-development data collection, development and implementation, and post-development data collection have been carried out. The use of this repository can bring benefits including the ability to send data so that it can be indexed on various indexing websites. The repository application that researchers build can be installed on web hosting. The repository is built with framework Codeigniter, node.js and uses supporting programming languages such as HTML, CSS, Jquery, JavaScript, JSON, AJAX, Bootstrap as media in designing interfaces. This repository application is named with T-REPOSITORY.

5 citations

29 Dec 2016
TL;DR: TAIL as mentioned in this paper is a portfolio of exemplos de gestao de dados em diversos dominios that poderao ser used by the investigadores for avaliar o esforco requerido e as compensacoes a obter com this atividade.
Abstract: A gestao dos dados de investigacao preocupa neste momento tanto os investigadores como os responsaveis por gestao de ciencia e as agencias de financiamento. Os investigadores tem consciencia do valor dos dados de investigacao, enquanto as agencias de financiamento estabelecem mandatos para planos de curadoria e partilha de dados como parte dos seus regulamentos. Os responsaveis por politicas de ciencia querem tambem garantir que os resultados obtidos com os seus planos de investimento tem o maior impacto possivel. O projeto TAIL, a decorrer de 2016 a 2019, vai construir um portfolio de exemplos de gestao de dados em diversos dominios que poderao ser usados pelos investigadores para avaliar o esforco requerido e as compensacoes a obter com esta atividade. O projeto tem como base o trabalho realizado no estudo dos fluxos de trabalho dos investigadores usando a plataforma Dendro e as interfaces moveis para a recolha de dados e metadados de que e exemplo o LabTablet. Estes resultados preliminares informam os processos a usar na publicacao de conjuntos de dados existentes em repositorios nacionais e internacionais, no desenho de modelos de metadados para a descricao pormenorizada dos dominios e no alinhamento com as infraestruturas europeias e nacionais.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: The thinking about digital preservation over the past five years has advanced to the point where the needs are widely recognized and well defined, the technical approaches at least superficially mapped out, and the need for action is now clear.
Abstract: In the fall of 2002, something extraordinary occurred in the continuing networked information revolution, shifting the dynamic among individually driven innovation, institutional progress, and the evolution of disciplinary scholarly practices. The development of institutional repositories emerged as a new strategy that allows universities to apply serious, systematic leverage to accelerate changes taking place in scholarship and scholarly communication, both moving beyond their historic relatively passive role of supporting established publishers in modernizing scholarly publishing through the licensing of digital content, and also scaling up beyond ad-hoc alliances, partnerships, and support arrangements with a few select faculty pioneers exploring more transformative new uses of the digital medium. Many technology trends and development efforts came together to make this strategy possible. Online storage costs have dropped significantly; repositories are now affordable. Standards like the open archives metadata harvesting protocol are now in place; some progress has also been made on the standards for the underlying metadata itself. The thinking about digital preservation over the past five years has advanced to the point where the needs are widely recognized and well defined, the technical approaches at least superficially mapped out, and the need for action is now clear. The development of free, publicly accessible journal article collections in disciplines such as high-energy physics has demonstrated ways in which the network can change scholarly communication by altering dissemination and access patterns; separately, the development of a series of extraordinary digital works had at least suggested the potential of creative authorship specifically for the digital medium to transform the presentation and transmission of scholarship. The leadership of the Massachusetts Institute of Technology (MIT) in the development and deployment of the DSpace institutional repository system , created in collaboration with the Hewlett Packard Corporation,

938 citations


"A comparison of research data manag..." refers background in this paper

  • ...keeping record of research contributions and ensuring the correct licensing of their contents [17, 22]....

    [...]

01 Jan 2013
TL;DR: Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: to reproduce or to verify research, to make results of publicly funded research available to the public, to enable others to ask new questions of extant data, and to advance the state of research and innovation.
Abstract: We must all accept that science is data and that data are science, and thus provide for, and justify the need for the support of, much-improved data curation. (Hanson, Sugden, & Alberts) Researchers are producing an unprecedented deluge of data by using new methods and instrumentation. Others may wish to mine these data for new discoveries and innovations. However, research data are not readily available as sharing is common in only a few fields such as astronomy and genomics. Data sharing practices in other fields vary widely. Moreover, research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context. Data sharing is thus a conundrum. Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation. These rationales differ by the arguments for sharing, by beneficiaries, and by the motivations and incentives of the many stakeholders involved. The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice. © 2012 Wiley Periodicals, Inc.

634 citations


"A comparison of research data manag..." refers background in this paper

  • ...Involving the researchers in the deposit stage is a challenge, as the investment in metadata production for data publication and sharing is typically higher than that required for the addition of notes that are only intended for their peers in a research group [7]....

    [...]

  • ...tial data production context, making it possible for other researchers to reuse the data [7]....

    [...]

  • ...Several stakeholders are involved in dataset description throughout the data management workflow, playing an important part in their management and dissemination [7, 23]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors examined four rationales for sharing data, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation.
Abstract: We must all accept that science is data and that data are science, and thus provide for, and justify the need for the support of, much-improved data curation. (Hanson, Sugden, & Alberts) Researchers are producing an unprecedented deluge of data by using new methods and instrumentation. Others may wish to mine these data for new discoveries and innovations. However, research data are not readily available as sharing is common in only a few fields such as astronomy and genomics. Data sharing practices in other fields vary widely. Moreover, research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context. Data sharing is thus a conundrum. Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation. These rationales differ by the arguments for sharing, by beneficiaries, and by the motivations and incentives of the many stakeholders involved. The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice. © 2012 Wiley Periodicals, Inc.

597 citations

Journal ArticleDOI
01 Oct 2013-PeerJ
TL;DR: There is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data, and a robust citation benefit from open data is found, although a smaller one than previously reported.
Abstract: Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

423 citations


"A comparison of research data manag..." refers background in this paper

  • ...The effort invested in adequate dataset description is worthwhile, since it has been shown that research publications that provide access to their base data consistently yield higher citation rates than those that do not [26]....

    [...]

Book
01 Jun 2012
TL;DR: This document is a technical Recommended Practice for use in developing a broader consensus on what is required for an archive to provide permanent, or indefinite Long Term, preservation of digital information.
Abstract: This document is a technical Recommended Practice for use in developing a broader consensus on what is required for an archive to provide permanent, or indefinite Long Term, preservation of digital information. This Recommended Practice establishes a common framework of terms and concepts which make up an Open Archival Information System (OAIS). It allows existing and future archives to be more meaningfully compared and contrasted. It provides a basis for further standardization within an archival context and it should promote greater vendor awareness of, and support of, archival requirements. CCSDS has changed the classification of Reference Models from Blue (Recommended Standard) to Magenta (Recommended Practice). Through the process of normal evolution, it is expected that expansion, deletion, or modification of this document may occur. This Recommended Practice is therefore subject to CCSDS document management and change control procedures, which are defined in the Procedures Manual for the Consultative Committee for Space Data Systems. Current issue updates document based on input from user community (note). Current versions of CCSDS documents are maintained at the CCSDS Web site: http://www.ccsds.org/

419 citations


Additional excerpts

  • ...As these repositories deal with a reasonably small set of managed formats for deposit, several reference models, such as the Open Archival Information System (OAIS) [12], are currently in use to ensure preservation and to promote metadata interchange and dissemination....

    [...]

  • ...As these repositories deal with a reasonably small set of managed formats for deposit, several reference models, such as the OAIS (Open Archival Information System) [12] are currently in use to ensure preservation and to promote metadata interchange and dissemination....

    [...]

Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

This paper is a synthetic overview of current platforms that can be used for data management purposes. Adopting a pragmatic view on data management, the paper focuses on solutions that can be adopted in the longtail of science, where investments in tools and manpower are modest. First, a broad set of data management platforms is presented—some designed for institutional repositories and digital libraries—to select a short list of the more promising ones for data management. This paper is an extended version of a previously published comparative study. The results show that there is still plenty of room for improvement, mainly regarding the specificity of data description in different domains, as well as the potential for integration of the data management platforms with existing research management tools.