What are the contributions in this paper?

Q: What are the contributions in this paper?

This paper is a synthetic overview of current platforms that can be used for data management purposes. Adopting a pragmatic view on data management, the paper focuses on solutions that can be adopted in the longtail of science, where investments in tools and manpower are modest. First, a broad set of data management platforms is presented—some designed for institutional repositories and digital libraries—to select a short list of the more promising ones for data management. This paper is an extended version of a previously published comparative study. The results show that there is still plenty of room for improvement, mainly regarding the specificity of data description in different domains, as well as the potential for integration of the data management platforms with existing research management tools.

(Open Access) A comparison of research data management platforms: architecture, flexible metadata and interoperability (2017) | Ricardo Carvalho Amorim

Noname manuscript No.

(will be inserted by the editor)

Acomparisonofresearchdatamanagementplatforms

Architecture, ﬂexible metadata and interoperability

Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva,

Cristina Ribeiro

Received: date / Accepted: date

Abstract Research data management is rapidly be-

coming a regular concern for researchers, and institu-

tions need to provide them with platforms to support

data organization and preparation for publication. Some

institutions have adopted institutional repositories as

the basis for data deposit, whereas others are experi-

menting with richer environments for data description,

in spite of the diversity of existing workﬂows. This pa-

per is a synthetic overview of current platforms that

can be used for data management purposes. Adopt-

ing a pragmatic view on data management, the paper

focuses on solutions that can be adopted in the long-

tail of science, where investments in tools and man -

power are modest. First, a broad set of data mana-

gement platforms is presented—some designed for in-

stitutional repositories and digital libraries—to select

ashortlistofthemorepromisingonesfordatama-

nagement. These platforms are compared considering

This paper is an extended version of a previously published

comparative study. Please refer to the WCIST 2015 confer-

ence proceedings (doi: 10.1007/978-3-319-16486-1)

Ricardo C arvalho Amorim

INESC TEC—Faculdade de Eng enharia da Universidade do

Porto

E-mail: ricardo.amorim3@gmail.com

João Aguiar Castro

INESC TEC—Faculdade de Engenharia da Universidade do

Porto

E-mail: joaoaguiarcastro@gmail.com

João Rocha da Silva

INESC TEC—Faculdade de Engenharia da Universidade do

Porto

E-mail: joaorosilva@gmail.com

Cristina R ibeiro

INESC TEC—Faculdade de Engenharia da Universidade do

Porto

E-mail: mcr@fe.up.pt

their architecture, support for metadata, existing pro-

gramming interfaces, as well as their search mechanisms

and community acceptance. In this pro c e ss, the stake-

holders’ requirements are also taken into account. The

results show that there is still plenty of room for im -

provement, mainly regarding the speciﬁcity of data de-

scription in diﬀerent domains, as well as the potential

for integration of the data management platforms with

existing research management tools. Nevertheless, de-

pending on the context, some platforms can meet all or

part of the stakeholders’ requirements.

1 Introduction

The number of published scholarly papers is steadily

increasing, and there is a growing awareness of the im-

portance, diversity and complexity of data generated

in research contexts [25]. The management of these as-

sets is currently a concern for both researchers and in-

stitutions who have to streamline scholarly communi-

cation, while keeping record of research contributions

and ensuring the correct licensing of their contents [23,

18]. At the same time, academic institutions have new

mandates, requiring data management activities to be

carried out during the research projects, as a part of

research grant contracts [14,26]. These activities are

invariably supported by software platforms, increasing

the demand for such infrastructure s .

This paper presents an overview of several promi-

nent research data management platforms that can be

put in place by an institution to support part of its

research data management workﬂow. It starts by iden-

tifying a set of well known repositories that are cur-

rently being used for either publications or data ma-

nagement, discussing their use in several research in-

This is a post-peer-review, pre-copyedit version of an article published in

Universal Access in the Information Society. The final authenticated version is available online at:

https://doi.org/10.1007/s10209-016-0475-y

2 Ricardo C arvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro
stitutions. Then, focus moves to their ﬁtne ss to han-
dle research data, namely their domain-speciﬁc meta-
data requirements and preservation guidelines. Imple-
mentation costs, architecture, interoperability, content
dissemination capabilities, implemented search features
and community acceptance are also taken into consider-
ation. When faced with the many alternatives currently
av ailable, it can be diﬃcult for institutions to choose a
suitable platform to meet their speciﬁc requirements.
Several comparative studies between existing solutions
were already carried out in order to evaluate diﬀerent
aspects of each implementation, conﬁrming that this is
an issue with increasing importance [16,3,6]. This eval-
uation considers aspects relevant to the authors’ ongo-
ing work, focused on ﬁnding solutions to research data
management, and takes into consideration their past ex-
perience in this ﬁeld [33]. This experience has provided
insights on speciﬁc, local needs that can inﬂuence the
adoption of a platform and therefore the success in its
deployment.
It is clear that the eﬀort in creating metadata for
research datasets is very diﬀerent from what is required
for research publications. While publications can be ac-
curately described by librarians, good quality metadata
for a dataset requires the contribution of the researchers
involved in its production. Their knowledge of the do-
main is required to adequately document the dataset
production context so that others can reuse it. Involv-
ing the researchers in the deposit stage is a challenge, as
the investment in metadata production for data publi-
cation and sharing is typically higher than that required
for the addition of notes that are only intended for their
peers in a research group [7].
Moreover, the authors look at staging platforms,
which are especially tailored to capture metadata re-
cords as they are produced, oﬀering researchers an in-
tegrated environment for their manage m ent along with
the data. As this is an area with several proposals in
active development, EUDAT, which includes tools for
data staging, and Dendro, a platform proposed for en-
gaging researchers in data description, taking into ac-
count the need for data and metadata organisation will
be contemplated.
Staging platforms are capable of exporting the en-
closed datasets and metadata records to research data
repositories. The platforms selected for the analysis in
the sequel as candidates for u s e are considered as re-
search data management repositories for datasets in
the long tail of science , as they are designed with shar-
ing and dissemination in mind. Together, staging plat-
forms and research data repositories provide the tools to
handle the stages of the research workﬂow. Long-term
preservation imposes further requirements, and other
tools may be necessary to satisfy th e m. However, as da-
tasets become organised and described, their value and
their potential for reuse will prompt further preserva-
tion actions.
2 From publications to data management
The growth in the number of research publications,
combined with a strong drive towards open access poli-
cies [8,10], continue to foster the development of open-
source platforms for managing bibliographic records.
While data citation is not yet a widespread practice, the
importance of citable datasets is growing. Until a cul-
ture of data citation is widely adopted, however, many
research groups are opting to pu blish so-called “data
papers”, which are more easily citable than datasets.
Data pape rs serve not only as a reference to datasets
but also document their production context [9].
As data management becomes an increasin gly im-
portant part of the research workﬂow [24], solutions de-
signed for managing research data are being actively
developed by both open-source communities and data
management-related companies. As with institutional
repositories, many of their design and development chal-
lenges have to do with description and long-term preser-
vation of research data. There are, however, at least
two fundamental diﬀerences between publications and
datasets: the latter are often purely numeric, making
it very hard to derive any type of metadata by sim-
ply looking at their contents; also, datasets require de-
tailed, domain-speciﬁc des c riptions to be corre ctly in-
terpreted. Metadata requ ire ments can also vary greatly
from domain to domain, requiring repository data mod-
els to be ﬂexible enough to adequately represent these
records [35]. The eﬀort invested in adequate dataset
description is worthwhile, since it has been shown that
research publications that provide access to their base
data consistently yield higher citation rates than those
that do not [27].
As these rep ositories deal with a reasonably small
set of managed formats for deposit, several reference
models, such as the OAIS (Open Archival Information
System) [12]arecurrentlyinusetoensurepreservation
and to promote metadata interchange and dissemina-
tion. Besides capturing the available metadata during
the ingestion process, data re positories often distribute
this information to other instances, improving the pub-
lications’ visibility through specialised research search
engines or repository indexers. While the former focus
on querying each repository f or exposed contents, the
latter help users ﬁnd data repositories that match their
needs—such as repositories from a speciﬁc domain or
storing data from a speciﬁc community. Governmental

A comparison of research data management platforms 3

institutions are also promoting the d isclosure of open

data to improve citizen commitment and government

transparency, and this motivates the use of data mana-

gement platforms in this context.

2.1 An overview on existing repositories

While depositing and accessing publications from dif-

ferent domains is already possible in most institutions,

ensuring the same level of accessibility to data resources

is s till challenging, and d iﬀ erent solutions are being ex-

perimented to expose and share data in some communi-

ties. Addressing this issue, we synthesize a preliminary

classiﬁcation of these solutions according to their spe-

ciﬁc purpose: they are either targeting staging, early

research activities or managing deposited datasets and

making them available to the community.

Table 1 identiﬁes features of the selected platforms

that may render them convenient for data management.

To build the table, the authors resorted to the docu-

mentation of the platforms, and to basic experiments

with demonstration instances, whenever available. In

the ﬁrs t column, under “Registered repositories”, is the

number of running instances of each platform, accord-

ing to the OpenDOAR platform as of mid-October 2015.

In the analysis, ﬁve evaluation criteria that can be

relevant for an institution to make a coarse-grained as-

sessment of the solutions are considered. Some exist-

ing tools were excluded from this ﬁrst analysis, mainly

because some of their characteristics place them out-

side of the scope of this work. This is the case of plat-

forms speciﬁcally targeting research publications (and

that cannot be easily modiﬁed for managing data), and

heavy-weight platforms targeted at long-term preserva-

tion. Also excluded were those that, from a technical

point of view, do not comply with de sirable require-

ments for this domain such as adopting an open-source

approach, or providing access to th eir features via com-

prehensive APIs.

By comparing the number of existing installations,

it is natural to assume that a large number of instance s

for a platform is a goo d indication of the existence of

support for its implementation. Repositories such as

DSpace are widely used among institutions to manage

publications. Therefore, institutions using DSpace to

manage publications can use their support for the plat-

form to expand or replicate the repository and meet

additional requirements.

It is important to mention that some repositories

do not implement interfaces with existing repository

indexers, and this may cause the OpenDOAR statistics

to show a value lower than the actual number of e xis ting

installations. More over, services provided by EUDAT,

Figshare and Zenodo, for instance, consis t of a single

installation that receives all the deposited data, rather

than a distributed array of manageable ins tallation s.

Government-supported platforms such as CKAN are

currently being used as part of the open government ini-

tiatives in several countries, allowing the disclosure of

data related to sensitive issues such as budget execu-

tion, and their aim is to vouch f or transparency and

credibility towards tax payers [ 21,20]. Although not

speciﬁcally tailored to meet research data management

requirements, these data-focus ed repositories also count

with an increasing number of instances supporting com-

plex research data management workﬂows [38], even at

universities

Access to the source code can also be a valuable cri-

terion for selecting a platform, primarily to avoid ven-

dor lock-in, which is usually associated with commer-

cial software or other provided services. Vendor lock-

in is undesirable from a preservation point of view as

it places th e maintenance of the platform (and conse-

quently the data stored insid e) in the hands of a single

vendor, that may not be able to provide support indef-

initely. The availability of the a platform’s source code

also allows additional modiﬁcations to be carried out

in order to create customized workﬂows—examples in-

clude improved metadata capabilities and data brows-

ing functionalities. Commercial solutions such as Con-

tentDM may incur high costs for the subscription fees,

which can make them cost-prohibitive for non-proﬁt or-

ganizations or small research institutions. In some cases

only a small portion of the source code for the entire

solution is actually available to the public. This is the

case with EUDAT, where only the B2Share modu le is

currently open

—the re main in g modules are unavail-

able to date.

From an integration point of view, the existence of

an API can allow for further development and help with

the repository maintenance, as the software ages. Solu-

tions that do not, at least partially, comply with this

requirement, may hinder the integration with external

platforms to improve the visibility of existing contents.

The lack of an API creates a barrier to the development

of tools to support a platform in speciﬁc environments,

such as laboratories that frequently produce data to

be directly deposited and disclosed. Finally, regarding

long-term preservation, some platforms fail to provide

unique identiﬁers for the resources upon deposit, mak-

ing persistent references to data and data citation in

publications hard.

http://ckan.org/2013/11/28/ckan4rdm-st-andrews/

Source code repository for B2Share is hoste d via GitHub

at https://github.com/EUDAT-B2SHARE/b2share

4 Ricardo C arvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro

Table 1: Limitations of the identiﬁed repository solutions. Source:

OpenDOAR platform

Corresponding web-

site.

†

Only available through additional plug-ins.

⇤

Only partially.

Registered

rep osito ries

Closed

source

API

No unique

identiﬁers

Complex

installation or setup

No OAI-PMH

compliance

CKAN 139

†

⇤

ContentDM 53 5

Dataverse 2

Digital Commons 141 55

DSpace 1305

ePrints 407 5

†

EUDAT — 5

⇤

Fedora 41 5

Figshare — 5

Greenstone 51 55 5

Invenio 20

Omeka 4 55

†

SciELO 18 5

WEKO 40 No data

Zenodo —

Support for ﬂexible research workﬂows makes some

repository solutions attractive to smaller institutions

looking for solutions to implement their data manage-

ment workﬂows. Both DSpace and ePrints, for instance,

are quite common as institutional repositories to man-

age publications, as they oﬀer broad compatibility with

the harvesting protocol OAI-PMH (Open Archives Ini-

tiative Protocol for Metadata Harvesting) [22] and with

preservation guidelines according to the OAIS model.

OAIS requires the existence of diﬀerent packages with

speciﬁc purposes, namely SIP (Submission Information

Package), AIP (Archival Information Package) and DIP

(Dissemination Information Package). The OAIS ref-

erence model deﬁnes SIP as a representation of pack-

aged items to be deposited in the repository. AIP, on

the other hand, represents the packaged digital objects

within the OAIS-compliant system, and DIP holds one

or several digital artifacts and their representation in-

formation, in such a format that can be interpreted by

potential users.

2.2 Stakeholders in research data management

Several stakeholders are involved in dataset description

throughout the data management workﬂow, playing an

important part in their management and dissemina-

tion [24,7]. These stakeholders—researchers, research

institutions, curators, harvesters,anddevelopers—play

agoverningroleindeﬁningthemainrequirementsof

adatarepositoryforthemanagementofresearchout-

puts. As key metadata providers, researchers are re-

sponsible for the description of research data. They

are not nec e ss arily knowledgeable in data management

practices, but can provide domain-sp eciﬁc, more or less

formal descriptions to complement generic metadata.

This captures the essential data production context,

making it possible for other researchers to reuse the

data [7]. As data creators, researchers can play a central

role in data deposit by selecting appropriate ﬁle formats

for their datasets, preparing their structure and pack-

aging them approp riately [15]. Institutions are also mo-

tivated to have th e ir data recognized and preserved ac-

cording to the requirements of funding institutions [17,

26]. In this regard, institutions value metadata in com-

pliance to standards, which make data ready for in-

clusion in networked environments, therefore increas-

ing their visibility. To make sure that this context is

correctly passed, along with the data, to the preser-

vation stage, curators are mainly interested in main-

taining d ata quality and integrity over time. Usually,

curators are information experts, so it is expected that

their close c ollaboration with researchers can result in

both detailed and compliant metadata records.

Considering data dissemination and reuse, harves-

ters can be either individuals looking for speciﬁc data

A comparison of research data management platforms 5
or se rvices which index the content of several reposito-
ries. The se services can make particularly good use of
established protocols, such as the OAI-PMH, to retrieve
metadata from diﬀerent sources and create an interface
to expose the indexed resources . Finally, contributing
to the improvement and expansion of these repositories
over time, developers are concerned with the underly-
ing technologies, an also in having extensive APIs to
promote integration with other tools.
3 Scope of the analysis
The stakeholders in the data management workﬂow can
greatly inﬂuence whether research data is reused. The
selection of platforms in the analysis acknowledges their
role, as well as the importance of the adoption of com-
munity standards to help with data description and ma-
nagement in the long run.
For this comparison, data management platforms
with instances running at both research and govern-
ment institutions have been considered, namely DSpace,
CKAN, Zenodo, Figshare, ePrints, Fedora and EUDAT.
If the long-term preservation of research assets is an
important requirement of the stakeholders in question,
other alternatives such as RODA [30]andArchivemat-
ica may also be considered strong candidates, since th ey
implement comprehensive preservation guidelines not
only for the digital objects themselves but also for their
whole life cycle and associated p rocesses . On one hand,
these platforms have a strong concern with long-term
preservation by strictly following existing standards such
as OAIS, PREMIS or METS, which cover the diﬀer-
ent stages of a long-term preservation workﬂow. On the
other hand , such solutions are usually harder to install
and maintain by institutions in the so-called long tail of
science—institutions that create large numbers of small
datasets, though do not possess the necessary ﬁnancial
resources and preservation expertise to support a com-
plete preservation workﬂow [18].
The Fedora framework
3
is used by some institutions,
and is also under active development, with the recent
release of Fedora 4. The fact that it is designed as a
framework to be fully customized and instantiated, in-
stead of being a “turnkey” solution, places Fedora in a
diﬀerent level, that can not be directly compared with
other solutions. Two open-source examples of Fedora’s
implementations are Hydra
4
and Islandora
5
.Bothare
open-source, capable of handling research workﬂows,
and u se the best-practices approach already implemen-
3
http://www.fedora-commons.org/
4
http://projecthydra.org/
5
http://islandora.ca/
ted in the core Fedora framework. Although these are
not prese nt in the comparison table, this section will
also consider their strengths, whe n compared to the
other platforms.
An overview of the previous ly identiﬁed stakehold-
ers led to the selection of two important dimensions
for the assessment of the platform features: their archi-
tecture and their metadata and dissemination capabil-
ities. The former includes aspects such as how they are
deployed into a production environment, the locations
where they keep their data, whethe r their source code
is available, and other aspects that are related to the
compliance with preservation best practices. The latter
focuses on how resource-related metadata is handled
and the level of compliance of these records with es-
tablished standards and exchange protocols. Other im-
portant aspects are their adoption within the research
communities and the availability of support for exten-
sions. Table 2 shows an overview of the results of our
evaluation.
4 Platform comparison
Based on the selection of the evaluation scope, this
section addresses the comparison of the platforms ac-
cording to key features that can help in the selection
of a platform for data management. Table 2 groups
these features in two categories: (i) Architecture, for
structural-related characteristics; and (ii) Metadata and
dissemination, for those related to ﬂexible description
and interoperability. This analysis is guided by the use
cases in the research data management environment.
4.1 Architecture
Regarding the architecture of the platforms, several as-
pects are considered. From the point of view of a re-
search institution, a quick and simple deployment of
the selected platform is an important aspect. There are
two main scenarios: the institution can either outsource
an external service or install and customize its own
repository, supporting the infrastructure maintenance
costs. Contracting a service provided by a dedicated
company such as Figshare or Zenodo delegates platform
maintenance for a fee. The service-based approach may
not be viable in some scenarios, as some researchers or
institutions may be reluctant to deposit their data in
aplatformoutsidetheircontrol[11]. DSpace, ePrints,
CKAN or any Fedora-based solution can be installed
and run completely under the control of the research
institution an d therefore oﬀer a better control over the
stored data. As open-source solutions, they also have