scispace - formally typeset
Open AccessJournal ArticleDOI

A survey on scholarly data

Reads0
Chats0
TLDR
This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.
Abstract
Survey of big scholarly data with respect to the different phases of the big data lifecycle.Identifies the different big data tools and technologies that can be used for development of scholarly applications.Investigates research challenges and limitations specific to big scholarly data and its applications.Provides research directions and paves way towards the development of a generic and comprehensive big scholarly data platform. Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.

read more

Content maybe subject to copyright    Report

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
You may not further distribute the material or use it for any profit-making activity or commercial gain
You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from orbit.dtu.dk on: Aug 10, 2022
A Survey of Scholarly Data: From Big Data Perspective
Khan, Samiya; Liu, Xiufeng; Shakil, Kashish A.; Alam, Mansaf
Published in:
Information Processing & Management
Link to article, DOI:
10.1016/j.ipm.2017.03.006
Publication date:
2017
Document Version
Peer reviewed version
Link back to DTU Orbit
Citation (APA):
Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A Survey of Scholarly Data: From Big Data Perspective.
Information Processing & Management, 53(4), 923-944. https://doi.org/10.1016/j.ipm.2017.03.006

1
Cloud-Based Big Data Management and
Analytics for Scholarly Resources: Current
Trends, Challenges and Scope for Future
Research
Samiya Khan, Kashish A. Shakil, and Mansaf Alam
Abstract—With the shifting focus of organizations and governments towards digitization of academic and technical documents,
there has been an increasing need to use this reserve of scholarly documents for developing applications that can facilitate and
aid in better management of research. In addition to this, the evolving nature of research problems has made them essentially
interdisciplinary. As a result, there is a growing need for scholarly applications like collaborator discovery, expert finding and
research recommendation systems. This research paper reviews the current trends and identifies the challenges existing in the
architecture, services and applications of big scholarly data platform with a specific focus on directions for future research.
Index Terms— Cloud-based Big Data Analytics, Scholarly Resources Big Scholarly Data, Big Scholarly Data Platform, Cloud-
based Big Data Management, Big Data Analytics
——————————
——————————
1 INTRODUCTION
HE digital world is facing the aftermath of data explo-
sion, which has led to the coining of terms like data
deluge. In simple terms, data deluge is a phrase used to
describe the excessively huge volume of data generated at
a regularly increasing basis in the world. Organizations
are overwhelmed by the processing and storage require-
ments of such large volumes of data. With that said, an-
other implication of the data deluge is that it has made
the scientific method completely obsolete.
Traditionally, the scientific method for solving a prob-
lem requires definition of the problem, proposal of a solu-
tion and collection of data that can solve or support a so-
lution to the problem. However, there is abundant, easily
accessible data, present today. In order to make use of
this reservoir of data, researchers need to ask the right
questions that this data can answer for them. Therefore,
the approach is changed from ‘ask the question; collect
data’ to ‘frame a question that the available data can an-
swer’. In order to support this new approach, particularly
for scholarly resources, big scholarly data analytics has
come into existence.
Scholarly documents are generated on a daily basis in
the form of research documents, project proposals, tech-
nical reports and academic papers, in addition to several
other types of documents, by researchers and students
from all over the world. Moreover, there have been sever-
al initiatives by Governments and Organizations to digit-
ize existing academic resources [7][8][9]. It is this huge
reservoir of academia data that is popularly referred to as
‘scholarly data’. However, it is important to note that this
is a generalized description and the definition may vary
from one scholarly community to another. For instance,
Google Scholar does not count patents as a scholarly re-
source.
With that said, the abundance of data sources makes
large-scale analysis of scholarly data possible and feasi-
ble. However, commercially available solutions in this
area are rather limited. There have been several research
efforts in the field of academic search engines. Some of
the popular search engines include CiteSeerX [1] and
Google Scholar [2]. In addition, assessment and bench-
marking tools like Microsoft Academic Search [3] and
AMiner [4] also exist. While these are primary sources of
scholarly data, BASE [5] or Q-Sensei Scholar [6] are ser-
vices that depend on secondary sources of preprocessed
data.
Big Scholarly Data Analytics have far-reaching impli-
cations on the ease with which research is performed.
Primarily, analytics for big scholarly data can be divided
into four categories namely, research management, col-
laborator discovery, expert finder systems and recom-
mender systems. Such analytics have gained immense
importance and relevance lately particularly with the ad-
vent of multi-disciplinary research projects.
Such projects have increased the scale and complexity
of research problems manifold and emphasize on the
pressing need for collaboration among researchers as well
as institutes or organizations. Research collaboration is
not a neo-concept. However, there has been a recent shift
in the manner in which collaborations are initiated. Tradi-
tionally, researchers and scholars used to meet periodical-
ly in conferences and symposiums to explore new re-
search domains and possibility for collaborations.
————————————————
Samiya Khan is with the Department of Computer Science, Jamia Millia
Islamia, New Delhi, India. E-mail: samiyashaukat@yahoo.com.
Kashish A. Shakil is with the Department of Computer Science, Jamia
Millia Islamia, New Delhi, India. E-mail: shakilkashish@yahoo.co.in.
Mansaf Alam is with the Department of Computer Science, Jamia Millia
Islamia, New Delhi, India. E-mail: malam2@jmi.ac.in.
T

2
With the increasing popularity of Internet, these plat-
forms have been complemented with academic search-
oriented web engines like Google Scholar and academic
social networking portals like ResearchGate [35] and Ac-
ademia [36]. While these platforms allow researchers to
follow each other’s research activities and interests, they
have also created a sense of realization in the research
community that the final published article is merely a
milestone in research.
Other aspects of research like dataset used and sup-
porting material considered for the research are equally
important. This is one of the reasons for the staggering
rise of interest in research data management. Although,
research management, collaborator discovery and expert
finding remain popular analytics applications, several
other useful applications can be implemented to make
optimal use of the heaps of scholarly data available to
provide personal, local and global insights in the research
work performed in this area.
This research paper aims to study the current trends in
cloud-based data management and analytics of big schol-
arly data and identify the challenges that continue to exist
in the different phases of the system. Besides this, it shall
also give an analysis of the scope for future research in
this field. The rest of the paper has been organized in the
following manner: Section 2 gives an introduction to
cloud-based big data analytics and reviews existing plat-
form for big scholarly data, which also serves as the base
for future research work in big scholarly data analytics.
The trends, challenges and research directions have
been classified under three main categories namely, data
management, analytics and visualization. Section 3, Sec-
tion 4 and Section 5 cover these three categories in detail.
The challenges discussed in the three sections mentioned
above constitute only technical challenges. This field of
study also suffers from some non-technical challenges,
which have been described in a Section 6. The paper con-
cludes with a remark on the scope of research in this area
and future research directions.
2 BACKGROUND AND METHODOLOGY
Big data analytics is a vast field that has found appli-
cations in diverse domains and studies. Some of the most
impactful researches that have merged big data analytics
with other fields of study include business analytics, mul-
ti-scale climate data analytics [11], banking customer ana-
lytics [14], smart cities [16], recommender systems for
ecommerce [13], social media analytics [12], healthcare
data analytics [15], intelligent transport management sys-
tems [18] and railway assets management system [17].
Evidently, the type of data analytics required for ful-
fillment of the needs of specific fields is different. Chen
and Zhang [19] provided an extensive survey on the
tools, techniques and technologies used for big data ana-
lytics. The commonest mathematical tools used for analy-
sis of data include fundamental mathematical concepts,
statistical tools and methods for solving optimization
problems. On the other hand, analytical techniques re-
quired for making big data analytics feasible and usable
for the end users include machine learning, data mining,
signal processing, neural networks and visualization
methods.
In order to implement the techniques mentioned
above, MapReduce and Hadoop [20] has been identified
as the most effective and efficient framework. Hadoop is
an open-source implementation of the MapReduce pro-
gramming model that allows distributed processing of a
huge volume of heterogeneous data using commodity
machines. Although, the research work paid little heed to
deploying Hadoop on the Cloud, it has indicated that
Cloud Computing is one of the proposed technologies for
backing big data analytics applications.
Cloud computing promises to be a good solution to
the big data problem considering the scalability and elas-
ticity that it offers [25]. However, the viability of this syn-
ergistic model is yet to be explored and tested. Big data
computing, particularly in the cloud environment, itself
suffers from some inherent challenges [24][26].
Assuncao et al. [21] presented the technical and non-
technical challenges associated with cloud-based big data
analytics, with specific emphasis on the relevant work
that has been performed in each sub-area. While the latter
deals with issues concerning the management and adop-
tion of these solutions, the former has been further classi-
fied into three categories namely, data management,
model building and scoring and visualization and user
interaction. A typical workflow for big data analytics giv-
en by [21] has been illustrated in Fig. 1.
One of the pioneering research projects in the field of
Big Scholarly Data is CiteSeerX. Wu et al. [10] presented
the platform for big scholarly data, which proposes to
move the then-existing system of CiteSeer to a private
cloud. Teregowda and Giles [160] elaborated on this in a
detailed report on scaling SeerSuite in the cloud envi-
ronment. The platform is divided into three components
namely, architecture, services and applications. The sys-
tem makes use of Crawl Cluster, HDFS, NoSQL and
MapReduce for implementation.
Fig. 1. Workflow for Big Data Analytics

S. KHAN ET AL.: CLOUD-BASED BIG DATA MANAGEMENT AND ANALYTICS FOR SCHOLARLY RESOURCES 3
The proposed system can broadly be divided on the
basis of user interaction into two sections frontend and
backend. The frontend includes load balancers and web
servers. This interface allows user to interact with the sys-
tem, takes their requests and communicates the results
back to the users. On the other hand, the backend per-
forms crawling of web sources for relevant data, extrac-
tion of information from raw data and ingestion of infor-
mation into the system to support applications like re-
search management, collaborator discovery and expert
finding, in addition to several others. An illustration of
the big scholarly data platform, proposed by Wu et al.
[10], has been presented in Fig. 2.
On the basis of the architecture, challenges and re-
search directions proposed by Wu et al. [10], this research
divides the challenges presented by cloud-based analytics
of big scholarly data into technical and non-technical
challenges. Research papers under each category have
been analyzed using the qualitative research methodolo-
gy to provide an extensive survey on cloud-based big
scholarly data platform. The technical challenges are fur-
ther divided on the basis of the functionality to which the
challenges belong. The three categories include data man-
agement, analytics and visualization, which have been
covered in the sections that follow.
3 SCHOLARLY DATA MANAGEMENT
Data is generated in many diverse forms in any scholarly
platform. One of the primary sources of data is the huge
reservoir of existing scholarly documents on the Internet.
In addition to this, there are author webpages, academic
social networks and secondary sources of scholarly in-
formation like institution and organization webpages that
also render significant data for a comprehensive analysis
of the scholarly community. Evidently, there are several
sources of data, providing different types of information.
Moreover, this data is continuously updated, appended
and removed. Challenges in data management can be
further divided into four sub-categories: (i) big data char-
acteristics (ii) data acquisition and integration (iii) infor-
mation extraction (iii) data preprocessing (iv) data pro-
cessing and resource management. The different facets of
data management of big scholarly data have been dis-
cussed below.
3.1 Big Scholarly Data Characteristics
Big data is traditionally characterized by three main fea-
tures namely volume, variety and velocity. It can be de-
rived from the meaning of these words that volume char-
acterizes the size of data, variety symbolizes the types of
data included and velocity indicates the rate of data gen-
eration.
The volume of data can be assessed by evaluating the
size of scholarly documents available on the web as raw
data. Khabsa and Giles [23] estimated that the number of
English scholarly documents available on the Internet is
approximately 114 million and this value is incremented
at a daily rate of tens of thousands. It is crucial mention
here that this is the lower bound value. It has also been
stated that the Google Scholar accommodates 87% of the
total [23]. Therefore, the number of English scholarly
documents on Google Scholar is around 100 million [23].
It is important to understand that the big scholarly da-
taset is not just limited to scholarly documents. Infor-
mation extracted from raw data and linked to create cita-
tion and knowledge graphs are also significant contribu-
tors to the size, variety and volume of big scholarly data.
Caragea et al. [32] gave an estimate of the big scholarly
dataset maintained at CiteSeerX until May 2013. The total
number of documents in the aforementioned system was
approximately 2.35 million. However, this count includes
duplicates and upon removal of the same, the approxi-
mate count is reduced to 1.9 million documents. In addi-
tion to this, the number of unique authors in the database
is 2.4 million while the number of citations, which in-
cludes repetitions, is about 52 million.
From data size perspective, Caragea et al. [32] estimat-
ed the size of CiteSeerX to be 6TB, which is growing at a
daily rate of 10-20GB. From the numbers stated above, it
can be implied that scholarly dataset is indeed ‘big’. Spe-
cifically, there are three main reasons why scholarly data
is called big scholarly data, which are as follows:
1. Firstly, the storage and computing resources re-
quirements of this data are too high to be provi-
sioned by traditional architectures. For instance,
common scholarly applications like collaborator
discovery require services like author profiling
and disambiguation. This is a computing inten-
sive task, which requires the system to work on
‘big’ data. Moreover, one of the fundamental re-
quirements of this system is smart resource allo-
cation and scheduling.
2. Secondly, the data throughput requirements of
the system need a better data processing frame-
work and tools. The single pipeline system is the
bottleneck, particularly in the case of data inges-
tion.
Fig. 2. Big Scholarly Data Platform

4
3. Lastly, static crawling techniques do not provide
the coverage and data filtering accuracy that
such systems and applications require. Besides
this, existing document classifier systems per-
form basic classification, separating academic
documents from non-academic documents. For
advanced applications, more sophisticated classi-
fication, on the basis of document type and sub-
ject, is required.
In addition to the standard 3V characteristics, Wu et
al. [22] gave many new attributes, transforming the 3V
model into the multi-V model. Additional characteristics
include veracity, value, variability, validity, visibility and
verdict. A Venn diagram for the multi-V has been shown
in Fig. 3. The 3Vs value, visibility and verdict consti-
tute the business intelligence (BI) aspects of the data con-
cerned.
The visibility characteristic provides the foresight,
hindsight and insight of the data as opposed to the tradi-
tional 3Vs that only focus on insight. From the BI perspec-
tive, it is important to know if the data is capable of con-
tributing anything substantial, which defines the ‘value’
of data. On the basis of analysis of the problem and its
proposed solution, it is the decision makers’ job to give a
‘verdict’.
The statistical perspective on data is given by veracity,
validity and variability. Veracity defines the trustworthi-
ness of data while validity determines if the data has been
acquired ethically and without any bias. When data com-
plexity and variety are analyzed, the implied characteris-
tic that comes into being is ‘variability’.
It is important to note that there is limited research
performed on data veracity. Data quality has a direct im-
pact on the quality of analytics produced, which makes
veracity a significant big data characteristic, particularly
for critical applications [41][43]. In addition, the privacy
and security aspects of cloud-based big data solutions,
which are remarkably significant in view of the fact that
these facets are important user concerns [46] when work-
ing in the cloud environment, are also yet to be explored
in full.
Although, validity is a conceptual concept and holds
little significance in the present context, variability is par-
ticularly relevant to big scholarly data. The 3Vs associat-
ed with business intelligence perspective solely depend
on the ability of an organization to make use of the avail-
able data with the deployed solution. Moreover, there is
no existing literature that discusses big scholarly data
with respect to the statistical and business intelligence
perspective.
3.2 Data Acquisition and Integration
The first step of the data analytics process is data acquisi-
tion, as part of which data is collected from a single
source or multiple sources and integrated to form the da-
taset that serves as input to the analytics engine. A big
scholarly dataset is an integration of many types of doc-
uments, which has been illustrated in Fig. 4. These docu-
ments are retrieved from their respective sources. The
primary source of data is the web, with specialized data-
bases like DBLP [37]. Moreover, portals like arxiv [38] and
publishing houses like Elsevier [39] also provide APIs,
which can be used to extract data.
In order to extract data from the web, two tools can be
used namely, crawling and REST APIs. CiteSeerX uses
focused crawling [161], as it only requires academic doc-
uments [40]. Two crawlers, one of which performs sched-
uled crawling while the other crawls URLs submitted by
users, which is a source of rich and dependable data, ex-
tract only PDFs. Moreover, the former satisfies the data
freshness characteristic of data acquisition by keeping the
database updated with latest publications.
The crawling process yields PDFs. However, the clas-
sification of documents as academic or non-academic is
done as part of the document filtering process. The text of
the PDF is extracted and on the basis of presence or ab-
sence of Bibliography or References at the end of the text,
it is classified as academic or non-academic. Only aca-
demic documents are kept and the rest of the documents
are discarded.
One of the most important facets of data acquisition is
to determine if a single source is enough to get all the da-
ta required for providing accurate analysis. In order to
address this concern, there have been several efforts to
estimate the total size of big scholarly data, the value of
which is then compared with individual statistics pub-
Fig. 4. Big Scholarly Dataset Composition
Fig. 3. Venn diagram of 3
2
V Model

Citations
More filters
Journal ArticleDOI

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

TL;DR: A comparative picture of 12 of the most commonly used ASEBDs is provided by counting query hit data as an indicator of the number of accessible records and indicates that Google Scholar’s size might have been underestimated so far by more than 50%.
Journal ArticleDOI

A survey towards an integration of big data analytics to big insights for value-creation

TL;DR: This article presents a comprehensive, well-informed examination, and realistic analysis of deploying big data analytics successfully in companies and presents a methodical analysis for the usage of Big Data Analytics in various applications such as agriculture, healthcare, cyber security, and smart city.
Journal ArticleDOI

Academic social networks: Modeling, analysis, mining and applications

TL;DR: This study investigates the background, the current status, and trends of academic social networks, and systematically review representative research tasks in this domain from three levels: actor, relationship, and network.
Journal ArticleDOI

Big data adoption: State of the art and research challenges

TL;DR: According to the findings, Technology–Organization–Environment and Diffusion of Innovations are the most popular theoretical models used for big data adoption in various domains and forty-two factors in technology, organization, environment, and innovation that have a significant influence onbig data adoption are revealed.
Journal ArticleDOI

Exploring the Online Doctor-Patient Interaction on Patient Satisfaction Based on Text Mining and Empirical Analysis

TL;DR: The results indicate that the patient's activeness has a positive effect on a doctor's informational and emotional support, and the effect of emotional support on patient satisfaction is more significant than that of informational support.
References
More filters
Journal ArticleDOI

An index to quantify an individual's scientific research output

TL;DR: The index h, defined as the number of papers with citation number ≥h, is proposed as a useful index to characterize the scientific output of a researcher.
Journal ArticleDOI

Co-citation in the scientific literature: A new measure of the relationship between two documents

TL;DR: A new form of document coupling called co-citation is defined as the frequency with which two documents are cited together, and clusters of co- cited papers provide a new way to study the specialty structure of science.
Journal ArticleDOI

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

TL;DR: This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies currently adopt to deal with the Big Data problems.
BookDOI

Recommender Systems Handbook

TL;DR: This handbook illustrates how recommender systems can support the user in decision-making, planning and purchasing processes, and works for well known corporations such as Amazon, Google, Microsoft and AT&T.
Journal ArticleDOI

Big data analytics in healthcare: promise and potential

TL;DR: Big data analytics in healthcare is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs, and its potential is great; however there remain challenges to overcome.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Cloud-based big data management and analytics for scholarly resources: current trends, challenges and scope for future research" ?

With the shifting focus of organizations and governments towards digitization of academic and technical documents, there has been an increasing need to use this reserve of scholarly documents for developing applications that can facilitate and aid in better management of research. This research paper reviews the current trends and identifies the challenges existing in the architecture, services and applications of big scholarly data platform with a specific focus on directions for future research. 

This survey includes a detailed study of the current trends and existing challenges in the different subsystems of the big scholarly data platform, with specific focus on directions for future research in this area. Suggested future work in the area includes the development of solutions and APIs. Most of the future work in this direction includes creation of expressive languages that shall enable users to define their problem to the system keeping in view that operational efficiency of the system with the increasing data only needs to get better. While CiteSeerX exists as one of the most popular scholarly platforms, the services provided are rather limited in their functionality and can be further enhanced to include many scholarly applications like research management and optimized to provide added functionality like algorithm linking, time-evolution of research and recommendations. 

analytics for big scholarly data can be divided into four categories namely, research management, collaborator discovery, expert finder systems and recommender systems. 

Many other types of information like venues where the author has published or presented work and detailed author information derived from the professional author webpage can be used to form a comprehensive author profile, which can be useful for advanced scholarly analytics like collaborator discovery and expert finding [86][87][88]. 

Citation linking and matching are important step in the process in view of the fact that some fields of metadata that may have been incomplete or extracted incorrectly can be corrected and completed from the data provided by the linkage. 

In order to implement the techniques mentioned above, MapReduce and Hadoop [20] has been identified as the most effective and efficient framework. 

storing and processing unstructured data and performing these activities such that aggregating and correlating data from different sources become simpler, also require research attention. 

Computer Science research documents contain specific sections like pseudocodes and algorithms, which play an instrumental role in mapping research growth and evolution. 

The proposed approach captures the global coherence and local relatedness in the book by extracting concepts in each chapter and constructing concept hierarchy. 

Considering that data will be collected from heterogeneous sources and may exist in different formats, the concept of data linking can be used. 

Gao et al. [110] reviewed structure extraction in books and proposed that extraction of ToC and metadata from books can be seen as a matching problem on bipartite graph. 

Tuarob et al. [54] also proposed the use of algorithm co-citation network to detect algorithmic level of similarity, which can further be extended to implement algorithm recommendation engines. 

Challenges in data management can be further divided into four sub-categories: (i) big data characteristics (ii) data acquisition and integration (iii) information extraction (iii) data preprocessing (iv) data processing and resource management. 

Tuarob et al. [97] proposed a hybrid algorithm that can identify section boundaries, detect section headers and recognize the hierarchy of sections with good accuracy. 

Kim et al. [59] disambiguated the DBLP dataset using these three methods and compared their impact, concluding that author disambiguation can have a substantial influence on data quality and quality of service and analytics performed using the data.