What can be used to form a comprehensive author profile?

Many other types of information like venues where the author has published or presented work and detailed author information derived from the professional author webpage can be used to form a comprehensive author profile, which can be useful for advanced scholarly analytics like collaborator discovery and expert finding [86][87][88].

What is the importance of data linking and matching?

Citation linking and matching are important step in the process in view of the fact that some fields of metadata that may have been incomplete or extracted incorrectly can be corrected and completed from the data provided by the linkage.

What are the challenges of storing and processing data?

storing and processing unstructured data and performing these activities such that aggregating and correlating data from different sources become simpler, also require research attention.

What is the role of tables in computer science research?

Computer Science research documents contain specific sections like pseudocodes and algorithms, which play an instrumental role in mapping research growth and evolution.

What is the proposed approach to extracting concepts in books?

The proposed approach captures the global coherence and local relatedness in the book by extracting concepts in each chapter and constructing concept hierarchy.

What is the concept of data linking?

Considering that data will be collected from heterogeneous sources and may exist in different formats, the concept of data linking can be used.

What is the main idea behind the structure extraction in books?

Gao et al. [110] reviewed structure extraction in books and proposed that extraction of ToC and metadata from books can be seen as a matching problem on bipartite graph.

What is the main idea behind the use of algorithm co-citation network?

Tuarob et al. [54] also proposed the use of algorithm co-citation network to detect algorithmic level of similarity, which can further be extended to implement algorithm recommendation engines.

What is the main purpose of the hybrid algorithm?

Tuarob et al. [97] proposed a hybrid algorithm that can identify section boundaries, detect section headers and recognize the hierarchy of sections with good accuracy.

What is the impact of author disambiguation on the DBLP dataset?

Kim et al. [59] disambiguated the DBLP dataset using these three methods and compared their impact, concluding that author disambiguation can have a substantial influence on data quality and quality of service and analytics performed using the data.

(Open Access) A survey on scholarly data (2017) | Samiya Khan

Q: What future works have the authors mentioned in the paper "Cloud-based big data management and analytics for scholarly resources: current trends, challenges and scope for future research" ?

This survey includes a detailed study of the current trends and existing challenges in the different subsystems of the big scholarly data platform, with specific focus on directions for future research in this area. Suggested future work in the area includes the development of solutions and APIs. Most of the future work in this direction includes creation of expressive languages that shall enable users to define their problem to the system keeping in view that operational efficiency of the system with the increasing data only needs to get better. While CiteSeerX exists as one of the most popular scholarly platforms, the services provided are rather limited in their functionality and can be further enhanced to include many scholarly applications like research management and optimized to provide added functionality like algorithm linking, time-evolution of research and recommendations.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright

owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

 You may not further distribute the material or use it for any profit-making activity or commercial gain

 You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately

and investigate your claim.

Downloaded from orbit.dtu.dk on: Aug 10, 2022

A Survey of Scholarly Data: From Big Data Perspective

Khan, Samiya; Liu, Xiufeng; Shakil, Kashish A.; Alam, Mansaf

Published in:

Information Processing & Management

Link to article, DOI:

10.1016/j.ipm.2017.03.006

Publication date:

2017

Document Version

Peer reviewed version

Link back to DTU Orbit

Citation (APA):

Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A Survey of Scholarly Data: From Big Data Perspective.

Information Processing & Management, 53(4), 923-944. https://doi.org/10.1016/j.ipm.2017.03.006

Cloud-Based Big Data Management and

Analytics for Scholarly Resources: Current

Trends, Challenges and Scope for Future

Research

Samiya Khan, Kashish A. Shakil, and Mansaf Alam

Abstract—With the shifting focus of organizations and governments towards digitization of academic and technical documents,

there has been an increasing need to use this reserve of scholarly documents for developing applications that can facilitate and

aid in better management of research. In addition to this, the evolving nature of research problems has made them essentially

interdisciplinary. As a result, there is a growing need for scholarly applications like collaborator discovery, expert finding and

research recommendation systems. This research paper reviews the current trends and identifies the challenges existing in the

architecture, services and applications of big scholarly data platform with a specific focus on directions for future research.

Index Terms— Cloud-based Big Data Analytics, Scholarly Resources Big Scholarly Data, Big Scholarly Data Platform, Cloud-

based Big Data Management, Big Data Analytics

——————————



——————————

1 INTRODUCTION

HE digital world is facing the aftermath of data explo-

sion, which has led to the coining of terms like data

deluge. In simple terms, data deluge is a phrase used to

describe the excessively huge volume of data generated at

a regularly increasing basis in the world. Organizations

are overwhelmed by the processing and storage require-

ments of such large volumes of data. With that said, an-

other implication of the data deluge is that it has made

the scientific method completely obsolete.

Traditionally, the scientific method for solving a prob-

lem requires definition of the problem, proposal of a solu-

tion and collection of data that can solve or support a so-

lution to the problem. However, there is abundant, easily

accessible data, present today. In order to make use of

this reservoir of data, researchers need to ask the right

questions that this data can answer for them. Therefore,

the approach is changed from ‘ask the question; collect

data’ to ‘frame a question that the available data can an-

swer’. In order to support this new approach, particularly

for scholarly resources, big scholarly data analytics has

come into existence.

Scholarly documents are generated on a daily basis in

the form of research documents, project proposals, tech-

nical reports and academic papers, in addition to several

other types of documents, by researchers and students

from all over the world. Moreover, there have been sever-

al initiatives by Governments and Organizations to digit-

ize existing academic resources [7][8][9]. It is this huge

reservoir of academia data that is popularly referred to as

‘scholarly data’. However, it is important to note that this

is a generalized description and the definition may vary

from one scholarly community to another. For instance,

Google Scholar does not count patents as a scholarly re-

source.

With that said, the abundance of data sources makes

large-scale analysis of scholarly data possible and feasi-

ble. However, commercially available solutions in this

area are rather limited. There have been several research

efforts in the field of academic search engines. Some of

the popular search engines include CiteSeerX [1] and

Google Scholar [2]. In addition, assessment and bench-

marking tools like Microsoft Academic Search [3] and

AMiner [4] also exist. While these are primary sources of

scholarly data, BASE [5] or Q-Sensei Scholar [6] are ser-

vices that depend on secondary sources of preprocessed

data.

Big Scholarly Data Analytics have far-reaching impli-

cations on the ease with which research is performed.

Primarily, analytics for big scholarly data can be divided

into four categories namely, research management, col-

laborator discovery, expert finder systems and recom-

mender systems. Such analytics have gained immense

importance and relevance lately particularly with the ad-

vent of multi-disciplinary research projects.

Such projects have increased the scale and complexity

of research problems manifold and emphasize on the

pressing need for collaboration among researchers as well

as institutes or organizations. Research collaboration is

not a neo-concept. However, there has been a recent shift

in the manner in which collaborations are initiated. Tradi-

tionally, researchers and scholars used to meet periodical-

ly in conferences and symposiums to explore new re-

search domains and possibility for collaborations.

————————————————

 Samiya Khan is with the Department of Computer Science, Jamia Millia

Islamia, New Delhi, India. E-mail: samiyashaukat@yahoo.com.

 Kashish A. Shakil is with the Department of Computer Science, Jamia

Millia Islamia, New Delhi, India. E-mail: shakilkashish@yahoo.co.in.

 Mansaf Alam is with the Department of Computer Science, Jamia Millia

Islamia, New Delhi, India. E-mail: malam2@jmi.ac.in.

With the increasing popularity of Internet, these plat-

forms have been complemented with academic search-

oriented web engines like Google Scholar and academic

social networking portals like ResearchGate [35] and Ac-

ademia [36]. While these platforms allow researchers to

follow each other’s research activities and interests, they

have also created a sense of realization in the research

community that the final published article is merely a

milestone in research.

Other aspects of research like dataset used and sup-

porting material considered for the research are equally

important. This is one of the reasons for the staggering

rise of interest in research data management. Although,

research management, collaborator discovery and expert

finding remain popular analytics applications, several

other useful applications can be implemented to make

optimal use of the heaps of scholarly data available to

provide personal, local and global insights in the research

work performed in this area.

This research paper aims to study the current trends in

cloud-based data management and analytics of big schol-

arly data and identify the challenges that continue to exist

in the different phases of the system. Besides this, it shall

also give an analysis of the scope for future research in

this field. The rest of the paper has been organized in the

following manner: Section 2 gives an introduction to

cloud-based big data analytics and reviews existing plat-

form for big scholarly data, which also serves as the base

for future research work in big scholarly data analytics.

The trends, challenges and research directions have

been classified under three main categories namely, data

management, analytics and visualization. Section 3, Sec-

tion 4 and Section 5 cover these three categories in detail.

The challenges discussed in the three sections mentioned

above constitute only technical challenges. This field of

study also suffers from some non-technical challenges,

which have been described in a Section 6. The paper con-

cludes with a remark on the scope of research in this area

and future research directions.

2 BACKGROUND AND METHODOLOGY

Big data analytics is a vast field that has found appli-

cations in diverse domains and studies. Some of the most

impactful researches that have merged big data analytics

with other fields of study include business analytics, mul-

ti-scale climate data analytics [11], banking customer ana-

lytics [14], smart cities [16], recommender systems for

ecommerce [13], social media analytics [12], healthcare

data analytics [15], intelligent transport management sys-

tems [18] and railway assets management system [17].

Evidently, the type of data analytics required for ful-

fillment of the needs of specific fields is different. Chen

and Zhang [19] provided an extensive survey on the

tools, techniques and technologies used for big data ana-

lytics. The commonest mathematical tools used for analy-

sis of data include fundamental mathematical concepts,

statistical tools and methods for solving optimization

problems. On the other hand, analytical techniques re-

quired for making big data analytics feasible and usable

for the end users include machine learning, data mining,

signal processing, neural networks and visualization

methods.

In order to implement the techniques mentioned

above, MapReduce and Hadoop [20] has been identified

as the most effective and efficient framework. Hadoop is

an open-source implementation of the MapReduce pro-

gramming model that allows distributed processing of a

huge volume of heterogeneous data using commodity

machines. Although, the research work paid little heed to

deploying Hadoop on the Cloud, it has indicated that

Cloud Computing is one of the proposed technologies for

backing big data analytics applications.

Cloud computing promises to be a good solution to

the big data problem considering the scalability and elas-

ticity that it offers [25]. However, the viability of this syn-

ergistic model is yet to be explored and tested. Big data

computing, particularly in the cloud environment, itself

suffers from some inherent challenges [24][26].

Assuncao et al. [21] presented the technical and non-

technical challenges associated with cloud-based big data

analytics, with specific emphasis on the relevant work

that has been performed in each sub-area. While the latter

deals with issues concerning the management and adop-

tion of these solutions, the former has been further classi-

fied into three categories namely, data management,

model building and scoring and visualization and user

interaction. A typical workflow for big data analytics giv-

en by [21] has been illustrated in Fig. 1.

One of the pioneering research projects in the field of

Big Scholarly Data is CiteSeerX. Wu et al. [10] presented

the platform for big scholarly data, which proposes to

move the then-existing system of CiteSeer to a private

cloud. Teregowda and Giles [160] elaborated on this in a

detailed report on scaling SeerSuite in the cloud envi-

ronment. The platform is divided into three components

namely, architecture, services and applications. The sys-

tem makes use of Crawl Cluster, HDFS, NoSQL and

MapReduce for implementation.

Fig. 1. Workflow for Big Data Analytics

S. KHAN ET AL.: CLOUD-BASED BIG DATA MANAGEMENT AND ANALYTICS FOR SCHOLARLY RESOURCES 3

The proposed system can broadly be divided on the

basis of user interaction into two sections – frontend and

backend. The frontend includes load balancers and web

servers. This interface allows user to interact with the sys-

tem, takes their requests and communicates the results

back to the users. On the other hand, the backend per-

forms crawling of web sources for relevant data, extrac-

tion of information from raw data and ingestion of infor-

mation into the system to support applications like re-

search management, collaborator discovery and expert

finding, in addition to several others. An illustration of

the big scholarly data platform, proposed by Wu et al.

[10], has been presented in Fig. 2.

On the basis of the architecture, challenges and re-

search directions proposed by Wu et al. [10], this research

divides the challenges presented by cloud-based analytics

of big scholarly data into technical and non-technical

challenges. Research papers under each category have

been analyzed using the qualitative research methodolo-

gy to provide an extensive survey on cloud-based big

scholarly data platform. The technical challenges are fur-

ther divided on the basis of the functionality to which the

challenges belong. The three categories include data man-

agement, analytics and visualization, which have been

covered in the sections that follow.

3 SCHOLARLY DATA MANAGEMENT

Data is generated in many diverse forms in any scholarly

platform. One of the primary sources of data is the huge

reservoir of existing scholarly documents on the Internet.

In addition to this, there are author webpages, academic

social networks and secondary sources of scholarly in-

formation like institution and organization webpages that

also render significant data for a comprehensive analysis

of the scholarly community. Evidently, there are several

sources of data, providing different types of information.

Moreover, this data is continuously updated, appended

and removed. Challenges in data management can be

further divided into four sub-categories: (i) big data char-

acteristics (ii) data acquisition and integration (iii) infor-

mation extraction (iii) data preprocessing (iv) data pro-

cessing and resource management. The different facets of

data management of big scholarly data have been dis-

cussed below.

3.1 Big Scholarly Data Characteristics

Big data is traditionally characterized by three main fea-

tures namely volume, variety and velocity. It can be de-

rived from the meaning of these words that volume char-

acterizes the size of data, variety symbolizes the types of

data included and velocity indicates the rate of data gen-

eration.

The volume of data can be assessed by evaluating the

size of scholarly documents available on the web as raw

data. Khabsa and Giles [23] estimated that the number of

English scholarly documents available on the Internet is

approximately 114 million and this value is incremented

at a daily rate of tens of thousands. It is crucial mention

here that this is the lower bound value. It has also been

stated that the Google Scholar accommodates 87% of the

total [23]. Therefore, the number of English scholarly

documents on Google Scholar is around 100 million [23].

It is important to understand that the big scholarly da-

taset is not just limited to scholarly documents. Infor-

mation extracted from raw data and linked to create cita-

tion and knowledge graphs are also significant contribu-

tors to the size, variety and volume of big scholarly data.

Caragea et al. [32] gave an estimate of the big scholarly

dataset maintained at CiteSeerX until May 2013. The total

number of documents in the aforementioned system was

approximately 2.35 million. However, this count includes

duplicates and upon removal of the same, the approxi-

mate count is reduced to 1.9 million documents. In addi-

tion to this, the number of unique authors in the database

is 2.4 million while the number of citations, which in-

cludes repetitions, is about 52 million.

From data size perspective, Caragea et al. [32] estimat-

ed the size of CiteSeerX to be 6TB, which is growing at a

daily rate of 10-20GB. From the numbers stated above, it

can be implied that scholarly dataset is indeed ‘big’. Spe-

cifically, there are three main reasons why scholarly data

is called big scholarly data, which are as follows:

1. Firstly, the storage and computing resources re-

quirements of this data are too high to be provi-

sioned by traditional architectures. For instance,

common scholarly applications like collaborator

discovery require services like author profiling

and disambiguation. This is a computing inten-

sive task, which requires the system to work on

‘big’ data. Moreover, one of the fundamental re-

quirements of this system is smart resource allo-

cation and scheduling.

2. Secondly, the data throughput requirements of

the system need a better data processing frame-

work and tools. The single pipeline system is the

bottleneck, particularly in the case of data inges-

tion.

Fig. 2. Big Scholarly Data Platform

3. Lastly, static crawling techniques do not provide

the coverage and data filtering accuracy that

such systems and applications require. Besides

this, existing document classifier systems per-

form basic classification, separating academic

documents from non-academic documents. For

advanced applications, more sophisticated classi-

fication, on the basis of document type and sub-

ject, is required.

In addition to the standard 3V characteristics, Wu et

al. [22] gave many new attributes, transforming the 3V

model into the multi-V model. Additional characteristics

include veracity, value, variability, validity, visibility and

verdict. A Venn diagram for the multi-V has been shown

in Fig. 3. The 3Vs – value, visibility and verdict – consti-

tute the business intelligence (BI) aspects of the data con-

cerned.

The visibility characteristic provides the foresight,

hindsight and insight of the data as opposed to the tradi-

tional 3Vs that only focus on insight. From the BI perspec-

tive, it is important to know if the data is capable of con-

tributing anything substantial, which defines the ‘value’

of data. On the basis of analysis of the problem and its

proposed solution, it is the decision makers’ job to give a

‘verdict’.

The statistical perspective on data is given by veracity,

validity and variability. Veracity defines the trustworthi-

ness of data while validity determines if the data has been

acquired ethically and without any bias. When data com-

plexity and variety are analyzed, the implied characteris-

tic that comes into being is ‘variability’.

It is important to note that there is limited research

performed on data veracity. Data quality has a direct im-

pact on the quality of analytics produced, which makes

veracity a significant big data characteristic, particularly

for critical applications [41][43]. In addition, the privacy

and security aspects of cloud-based big data solutions,

which are remarkably significant in view of the fact that

these facets are important user concerns [46] when work-

ing in the cloud environment, are also yet to be explored

in full.

Although, validity is a conceptual concept and holds

little significance in the present context, variability is par-

ticularly relevant to big scholarly data. The 3Vs associat-

ed with business intelligence perspective solely depend

on the ability of an organization to make use of the avail-

able data with the deployed solution. Moreover, there is

no existing literature that discusses big scholarly data

with respect to the statistical and business intelligence

perspective.

3.2 Data Acquisition and Integration

The first step of the data analytics process is data acquisi-

tion, as part of which data is collected from a single

source or multiple sources and integrated to form the da-

taset that serves as input to the analytics engine. A big

scholarly dataset is an integration of many types of doc-

uments, which has been illustrated in Fig. 4. These docu-

ments are retrieved from their respective sources. The

primary source of data is the web, with specialized data-

bases like DBLP [37]. Moreover, portals like arxiv [38] and

publishing houses like Elsevier [39] also provide APIs,

which can be used to extract data.

In order to extract data from the web, two tools can be

used namely, crawling and REST APIs. CiteSeerX uses

focused crawling [161], as it only requires academic doc-

uments [40]. Two crawlers, one of which performs sched-

uled crawling while the other crawls URLs submitted by

users, which is a source of rich and dependable data, ex-

tract only PDFs. Moreover, the former satisfies the data

freshness characteristic of data acquisition by keeping the

database updated with latest publications.

The crawling process yields PDFs. However, the clas-

sification of documents as academic or non-academic is

done as part of the document filtering process. The text of

the PDF is extracted and on the basis of presence or ab-

sence of Bibliography or References at the end of the text,

it is classified as academic or non-academic. Only aca-

demic documents are kept and the rest of the documents

are discarded.

One of the most important facets of data acquisition is

to determine if a single source is enough to get all the da-

ta required for providing accurate analysis. In order to

address this concern, there have been several efforts to

estimate the total size of big scholarly data, the value of

which is then compared with individual statistics pub-

Fig. 4. Big Scholarly Dataset Composition

Fig. 3. Venn diagram of 3

V Model

A survey on scholarly data

Figures

Citations

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

A survey towards an integration of big data analytics to big insights for value-creation

Academic social networks: Modeling, analysis, mining and applications

Big data adoption: State of the art and research challenges

Exploring the Online Doctor-Patient Interaction on Patient Satisfaction Based on Text Mining and Empirical Analysis

References

An index to quantify an individual's scientific research output

Co-citation in the scientific literature: A new measure of the relationship between two documents

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

Recommender Systems Handbook

Big data analytics in healthcare: promise and potential

Related Papers (5)

Big Scholarly Data: A Survey

MVCWalker: Random Walk-Based Most Valuable Collaborators Recommendation Exploiting Academic Factors

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

The state of the art and taxonomy of big data analytics: view from new big data framework

ArnetMiner: extraction and mining of academic social networks

Frequently Asked Questions (15)

Q1. What are the contributions mentioned in the paper "Cloud-based big data management and analytics for scholarly resources: current trends, challenges and scope for future research" ?

Q2. What future works have the authors mentioned in the paper "Cloud-based big data management and analytics for scholarly resources: current trends, challenges and scope for future research" ?

Q3. What are the main categories of analytics for big scholarly data?

Q4. What can be used to form a comprehensive author profile?

Q5. What is the importance of data linking and matching?

Q6. What is the effective and efficient framework for big data analytics?

Q7. What are the challenges of storing and processing data?

Q8. What is the role of tables in computer science research?

Q9. What is the proposed approach to extracting concepts in books?

Q10. What is the concept of data linking?

Q11. What is the main idea behind the structure extraction in books?

Q12. What is the main idea behind the use of algorithm co-citation network?

Q13. What are the challenges in data management?

Q14. What is the main purpose of the hybrid algorithm?

Q15. What is the impact of author disambiguation on the DBLP dataset?