scispace - formally typeset
Open AccessJournal ArticleDOI

E-Science and its Implications for the Library Community

Tony Hey, +1 more
- 01 Oct 2006 - 
- Vol. 24, Iss: 4, pp 515-528
TLDR
The nature of the “e‐Science’ revolution in twenty‐first century scientific research and its consequences for the library community are explained and the issue of open access is discussed with reference to arXiv, PubMed Central and EPrints.
Abstract
Purpose – The purpose of this article is to explain the nature of the “e‐Science’ revolution in twenty‐first century scientific research and its consequences for the library community.Design/methodology/approach – The concepts of e‐Science are illustrated by a discussion of the CombeChem, eBank and SmartTea projects. The issue of open access is then discussed with reference to arXiv, PubMed Central and EPrints. The challenges these trends present to the library community are discussed in the context of the TARDis project and the University of Southampton Research Repository.Findings – Increasingly academics will need to collaborate in multidisciplinary teams distributed across several sites in order to address the next generation of scientific problems. In addition, new high‐throughput devices, high‐resolution surveys and sensor networks will result in an increase in scientific data collected by several orders of magnitude. To analyze, federate and mine this data will require collaboration between scienti...

read more

Content maybe subject to copyright    Report

1
e-Science and its implications for the library community
Tony Hey
Microsoft Corporation, Redmond, USA
Jessie Hey
School of Electronics and Computer Science and University of Southampton
Libraries, University of Southampton, Southampton, UK
Abstract
Purpose:
To explain the nature of the ‘e-Science’ revolution in 21
st
century scientific research and its
consequences for the library community.
Design/methodology/approach:
The concepts of e-Science are illustrated by a discussion of the
CombeChem, eBank and SmartTea projects. The issue of open access is then discussed with
reference to arXiv, PubMed Central and EPrints. The challenges these trends present to the library
community are discussed in the context of the TARDis project and the University of Southampton
Research Repository.
Findings:
Increasingly academics will need to collaborate in multidisciplinary teams distributed across
several sites in order to address the next generation of scientific problems. In addition, new high-
throughput devices, high resolution surveys and sensor networks will result in an increase in scientific
data collected by several orders of magnitude. To analyze, federate and mine this data will require
collaboration between scientists and computer scientists; to organize, curate and preserve this data will
require collaboration between scientists and librarians. A vital part of the developing research
infrastructure will be digital repositories containing both publications and data.
Originality/value:
The paper provides a synthesis of e-Science concepts, the question of open
access to the results of scientific research, and a changing attitude towards academic publishing and
communication. The paper offers a new perspective on coming demands on the library and is of
special interest to librarians with strategic tasks.
Keywords:
Digital libraries, Digital storage
Paper type:
Research paper
Introduction
As Thomas Friedman (2005) eloquently explains in his book ‘The World is Flat’, the
convergence of communication and computing technologies is changing the world of
both business and leisure. It would be naïve to think that the academic research
community will be immune from these changes. The methodology of research in
many fields is changing and we are on the threshold of a new era of data-driven
science. In the last few decades computational science has emerged as a new
methodology for scientific research on an equal footing with the traditional
experimental and theoretical methodologies. Simulation is now used as a standard
weapon in the armoury of the scientist to explore domains otherwise inaccessible to
the traditional research methodologies - such as the evolution of the early universe,
the design of new materials, the exploration of climatology over geological timescales
and, of course, the weather forecasts we now take for granted. Its use in industry is
becoming even more widespread with computational fluid dynamics and finite

2
element simulations now an essential part of the design process, complementing
traditional experimental wind tunnel and safety testing in the aero and auto
manufacturing industries, with simulations of oil fields and analysis of seismic data
now playing a key role in the oil and gas industry, and with simulation playing an
increasingly important role in the drug design life cycle in the pharmaceutical industry.
The next decade will see the emergence of a new, fourth research
methodology, namely ‘e-Science’ or networked, data-driven science. Many areas of
science are about to be transformed by the availability of vast amounts of new
scientific data that can potentially provide insights at a level of detail never before
envisaged. However, this new data dominant era brings new challenges for the
scientists and they will need the skills and technologies both of computer scientists
and of the library community to manage, search and curate these new data
resources. Libraries will not be immune from change in this new world of research.
The advent of the Web is changing the face of scholarly publishing and the role of
publishers and libraries. The National Science Foundation Blue Ribbon Report on
Cyberinfrastructure lays out a vision of this new world. On publishing, the report
states:
The primary access to the latest findings in a growing number of fields is through the Web,
then through classic preprints and conferences, and lastly through refereed archival papers.
(Atkins et al., 2003, p. 9)
And on scientific data the report states:
Archives containing hundreds or thousands of terabytes of data will be affordable and
necessary for archiving scientific and engineering information. (Atkins et al., 2003, p. 11)
This paper explores some of the challenges facing both the scientific and library
communities in this new emerging world of research and delineates the key role that
can be played by computer science and by IT companies such as Microsoft in
assisting the research community.
e-Science and Licklider’s vision
It is no coincidence that it was at CERN, the particle physics accelerator laboratory in
Geneva, that Tim Berners-Lee invented the World Wide Web. Given the distributed
nature of the multi-institute collaborations required for modern particle physics
experiments, the particle physics community urgently needed a tool for exchanging
information. It was their community who first enthusiastically embraced the Web as a
mechanism for information exchange within their experimental collaborations and it
was no accident that the first Web site in the USA was at the Stanford Linear
Accelerator Center Library. As we all now know, since its beginnings in the early
1990’s, the Web has not only taken the entire scientific world by storm but also the
worlds of business and leisure. Now, just a decade or so later, scientists need to
develop capabilities for collaboration that go far beyond those of the original World
Wide Web. In addition to being able just to access information from different sites,
scientists now want to be able to use remote computing resources, to integrate,
federate and analyze information from many disparate and distributed data resources,
and to access and control remote experimental equipment. The ability to access,
move, manipulate and mine data is the central requirement of these new collaborative
science applications whether the data is held in flat files or databases, or is data

3
generated by accelerator or telescopes, or data gathered in real-time from potentially
mobile sensor networks.
In the United Kingdom, at the end of the 1990’s, John Taylor became Director
General of Research Councils at the Office of Science and Technology (OST) in the
UK roughly equivalent to Director of the National Science Foundation (NSF) in the
USA. Taylor had been Director of Hewlett-Packard (HP) Laboratories in Europe and
HP’s vision for the future of computing has long been that IT resources will become a
new ‘utility’. Rather than purchase IT infrastructure, users will pay for IT services as
they consume them, in the same way as the conventional utilities such as electricity,
gas and water – and now mobile telephones. In his role at the OST as overseeing the
funding of UK scientific research, Taylor realized that many areas of science could
benefit from a common IT infrastructure to support multi-disciplinary and distributed
collaborations. He articulated a vision for this type of distributed, collaborative science
and introduced the term ‘e-Science’:
e-Science is about global collaboration in key areas of science, and the next generation of
infrastructure that will enable it. (Taylor, 2001)
It is important to emphasize that e-Science is not a new scientific discipline in its own
right: e-Science is shorthand for the set of tools and technologies required to support
collaborative, networked science. The entire e-Science infrastructure is intended to
empower scientists to do their research in faster, better and different ways.
Of course, these problems are not new – the computer science community has
been grappling with the challenges of distributed computing for decades. Indeed,
such an e-Science infrastructure was very close to the vision that J.C.R. Licklider
(‘Lick’) took with him to ARPA (Advanced Research Projects Agency) when he
initiated the core set of research projects that led to the creation of the ARPANET.
Larry Roberts, one of his successors at ARPA and principal architect of the
ARPANET, described this vision as follows:
Lick had this concept of the intergalactic network which he believed was everybody could use
computers anywhere and get at data anywhere in the world. He didn’t envision the number of
computers we have today by any means, but he had the same concept all of the stuff linked
together throughout the world, that you can use a remote computer, get data from a remote
computer, or use lots of computers in your job. The vision was really Lick’s originally. (Segaller,
1998, p. 40)
The ARPANET of course led to the present day Internet - but the killer applications
have so far been email and the Web rather than the distributed computing vision
originally described by Licklider. Of course, in the early 1960’s, Licklider was only
envisaging connecting a small number of rather scarce and expensive computers,
and at relatively few sites. However, over the past thirty years, Moore’s Law – Gordon
Moore’s prediction that the number of transistors on a chip would double about every
18 months so that the price-performance is halved at the same time has led to an
explosion in the number of supercomputers, mainframes, workstations, personal
computers and PDAs that are now connected to the Internet. Already we are
beginning to see programmable sensors and RFIDs intelligent tagging devices -
being connected to the network.

4
An example of e-Science: The CombeChem, eBank and SmartTea projects
The CombeChem project [1] was funded by the Engineering and Physical Sciences
Research Council in the UK and its goals were to enhance the correlation and
prediction of chemical structures and properties by using technologies for automation,
semantics and Grid computing (see Frey
et al.
, 2003; Hughes
et al.
, 2004). A key
driver for the project was the fact that large volumes of new chemical data are being
created by new high throughput technologies. One example uses the technologies of
combinatorial chemistry in which large numbers of new chemical compounds are
synthesized simultaneously. The volume of data and the speed by which it can be
produced highlights the need for assistance in organizing, annotating and searching
this data. The CombeChem team consisted of a collection of scientists from several
disciplines chemistry, computer science and mathematics who developed a
prototype test-bed that integrated chemical structure-property data resources with a
‘Grid’ style distributed computing environment. The project explored automated
procedures for finding similarities in solid-state crystal structures across families of
compounds and evaluated new statistical design concepts in order to improve the
efficiency of combinatorial experiments in the search for new enzymes and
pharmaceutical salts for improved drug delivery.
The CombeChem project also explored some other important e-Science
themes. One theme concerned the use of a remote X-ray crystallography service for
determining the structure of new compounds. This service can be combined in
workflows with services for computer simulations on clusters or searches through
existing chemical databases. Another important e-Science theme was the exploration
of new forms of electronic publication both of the data and research papers. This e-
Publication theme was examined in the eBank project [2] funded by the Joint
Information Systems Committee (JISC). One of the key concepts of the CombeChem
project was that of ‘Publication@Source’ which establishes a complete end-to-end
connection between the results obtained at the laboratory bench and the final
published analyses (Frey
et al.
, 2002). This theme is linked to yet another of the e-
Science themes explored in the CombeChem project that was concerned with
human-computer interfaces and the digital capture of information. In the associated
SmartTea project [3], computer scientists studied the way chemists within the
laboratory used their lab notebooks and developed acceptable interfaces to handheld
tablet technology (see Schraefel
et al.
, 2004a; Schraefel
et al.
, 2004b). This is
important since it facilitates information capture at the very earliest stage of the
experiment. Using tablet PCs, the SmartTea system has been successfully trialed in a
synthetic organic chemistry laboratory and linked to a flexible back-end storage
system. A key usability finding was, not surprisingly, that users needed to feel in
control of the technology and that a successful interface must be adapted to their
preferred way of working. This necessitated a high degree of flexibility in the design of
the lab book user interface. The computer scientists on the team also investigated
the representation and storage of human-scale experiment metadata and introduced
an ontology to describe the record of an experiment.
A novel storage system for the data from the electronic lab book was also
developed in the project. In the same way that the interfaces needed to be flexible to
cope with whatever chemists wished to record, the back end solutions also needed to

5
be similarly flexible to store any metadata that might be created. This electronic lab
book data feeds directly into the scientific data processing. All usage of the data
through the chain of processing is now effectively an annotation upon it, and the data
provenance is explicit. The creation of original data is accompanied by information
about the experimental conditions in which it is created. There then follows a chain of
processing such as aggregation of experimental data, selection of a particular data
subset, statistical analysis and modeling and simulation. The handling of this
information may include explicit annotation of a diagram or editing of a digital image.
All of this generates secondary data, accompanied by the information that describes
the process that produced it. This digital record is therefore enriched and interlinked
by a variety of annotations such as relevant sensor data, usage records or explicit
interactions. By making these annotations machine processable, they can be used
both for their anticipated purpose and for subsequent unanticipated reuse. In the
CombeChem project this was achieved by deployment of Web Services and
Semantic Web technologies (Berners-Lee
et al.
, 2001). RDF (Resource Description
Framework) was used throughout the system: at present there are over 70 million
RDF triples in the CombeChem triplestore. This system was found to give a much
higher degree of flexibility to the type of metadata that can be stored compared to
traditional relational databases.
In the sister eBank project, raw crystallographic data was annotated with
metadata and ‘published’ by being archived in the UK National Data Store as a
‘Crystallographic e-Print[2]. Publications can then be linked back directly to the raw
data for other researchers to access and analyze or verify. Another noteworthy
feature of the project was that pervasive computing devices were used to capture
laboratory conditions so that chemists could be notified in real time about the
progress of their experiment using hand held PDAs.
The imminent data deluge: A key driver for e-Science
One of the key drivers underpinning the e-Science movement is the imminent
availability of large amounts of data arising from the new generations of scientific
experiments and surveys (Hey and Trefethen, 2003). New high-throughput
experimental devices are now being deployed in many fields of science - from
astronomy to biology - and this will lead to a veritable deluge of scientific data over
the next 5 years or so. In order to exploit and explore the many Petabytes of
scientific data that will arise from such next-generation scientific experiments, from
supercomputer simulations, from sensor networks and from satellite surveys,
scientists will need the assistance of specialized search engines and powerful data
mining tools. To create such tools, the primary data will need to be annotated with
relevant metadata giving such information as to the provenance, content and the
conditions that produced the data. Over the course of the next few years, scientists
will create vast distributed digital repositories of scientific data that will require
management services similar to those of more conventional digital libraries as well as
other data-specific services. As we have stressed, the ability to search, access,
move, manipulate and mine such data will be a central requirement – or a competitive
advantage - for this new generation of collaborative data-centric e-Science
applications.

Citations
More filters
Journal ArticleDOI

Determining Data Information Literacy Needs: A Study of Students and Research Faculty

TL;DR: The need for a data information literacy program (DIL) to prepare students to engage in such an "e-research" environment is articulated.
Journal ArticleDOI

Bibliometrics and Research Data Management Services: Emerging Trends in Library Support for Research

TL;DR: Gaps in knowledge, skills, and confidence were significant constraints, with near-universal support for including bibliometrics and particularly data management in professional education and continuing development programs and the study found that librarians need a multilayered understanding of the research environment.
Journal ArticleDOI

Research data management services in academic research libraries and perceptions of librarians

TL;DR: In this article, the authors report the results of two studies: librarians' RDS practices in U.S. and Canadian academic research libraries, and the RDS-related library policies in those or similar libraries.
Journal ArticleDOI

Developments in research data management in academic libraries: Towards an understanding of research data service maturity

TL;DR: The range of RDM activities explored in this study are positioned on a "landscape maturity model,” which reflects current and planned research data services and practice in academic libraries, representing a “snapshot” of current developments and a baseline for future research.
References
More filters
Book

The World Is Flat: A Brief History of the Twenty-first Century

TL;DR: Friedman and Friedman went to the same high school and used the Golden Arches Theory of Conflict Prevention as inspiration for his column "The GoldenArches theory of conflict prevention" as discussed by the authors.
Journal ArticleDOI

Institutional Repositories: Essential Infrastructure For Scholarship In The Digital Age

TL;DR: The thinking about digital preservation over the past five years has advanced to the point where the needs are widely recognized and well defined, the technical approaches at least superficially mapped out, and the need for action is now clear.
Journal ArticleDOI

A brief history of the internet

TL;DR: This paper was first published online by the Internet Society in December 20031 and is being re-published in ACM SIGCOMM Computer Communication Review because of its historic import.
Related Papers (5)