The anatomy of big data computing

doi:10.1002/SPE.2374

Home
/
Papers
/
The anatomy of big data computing

Journal Article•DOI•

The anatomy of big data computing

Raghavendra Kune¹, Pramod Kumar Konugurthi¹, Arun Agarwal², Raghavendra Rao Chillarige², Rajkumar Buyya³ - Show less +1 more•Institutions (3)

Department of Space¹, University UCINF², University of Melbourne³

01 Jan 2016-Software - Practice and Experience (John Wiley & Sons, Ltd)-Vol. 46, Iss: 1, pp 79-105

TL;DR: In this paper, the authors discuss the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big data and clouds known as big data clouds, layered architecture and components of big Data cloud, and finally open-technical challenges and future directions.

read less

Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical, and scientific studies are resulting in information/data explosion. Knowledge discovery and decision-making from such rapidly growing voluminous data are a challenging task in terms of data organization and processing, which is an emerging trend known as big data computing, a new paradigm that combines large-scale compute, new data-intensive techniques, and mathematical models to build data analytics. Big data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of big data computing and underpinning technologies, integrated platform of big data and clouds known as big data clouds, layered architecture and components of big data cloud, and finally open-technical challenges and future directions. Copyright © 2015 John Wiley & Sons, Ltd.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Critical analysis of Big Data challenges and analytical methods

[...]

Uthayasankar Sivarajah¹, Muhammad Kamal¹, Zahir Irani¹, Vishanth Weerakkody¹•Institutions (1)

Brunel University London¹

01 Jan 2017-Journal of Business Research

TL;DR: In this article, the authors present a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions.

...read moreread less

1,267 citations

Journal Article•DOI•

Big Data technologies: A survey

[...]

Ahmed Oussous¹, Fatima Zahra Benjelloun¹, Ayoub Ait Lahcen², Ayoub Ait Lahcen¹, Samir Belfkih¹ - Show less +1 more•Institutions (2)

Ibn Tofail University¹, Mohammed V University²

12 Jun 2017-Journal of King Saud University - Computer and Information Sciences

TL;DR: This paper is a review that survey recent technologies developed for Big Data and provides not only a global view of main Big Data technologies but also comparisons according to different system layers such as Data Storage Layer, Data Processing Layer, data Querying layer, Data Access Layer and Management Layer.

...read moreread less

600 citations

Journal Article•DOI•

Machine Learning With Big Data: Challenges and Approaches

[...]

Alexandra L'Heureux¹, Katarina Grolinger¹, Hany F. ElYamany¹, Miriam A. M. Capretz¹•Institutions (1)

University of Western Ontario¹

20 Apr 2017-IEEE Access

TL;DR: This paper compiles, summarizes, and organizes machine learning challenges with Big Data, highlighting the cause–effect relationship by organizing challenges according to Big Data Vs or dimensions that instigated the issue: volume, velocity, variety, or veracity.

...read moreread less

Abstract: The Big Data revolution promises to transform how we live, work, and think by enabling process optimization, empowering insight discovery and improving decision making. The realization of this grand potential relies on the ability to extract value from such massive data through data analytics; machine learning is at its core because of its ability to learn from data and provide data driven insights, decisions, and predictions. However, traditional machine learning approaches were developed in a different era, and thus are based upon multiple assumptions, such as the data set fitting entirely into memory, what unfortunately no longer holds true in this new context. These broken assumptions, together with the Big Data characteristics, are creating obstacles for the traditional techniques. Consequently, this paper compiles, summarizes, and organizes machine learning challenges with Big Data. In contrast to other research that discusses challenges, this work highlights the cause–effect relationship by organizing challenges according to Big Data Vs or dimensions that instigated the issue: volume, velocity, variety, or veracity. Moreover, emerging machine learning approaches and techniques are discussed in terms of how they are capable of handling the various challenges with the ultimate objective of helping practitioners select appropriate solutions for their use cases. Finally, a matrix relating the challenges and approaches is presented. Through this process, this paper provides a perspective on the domain, identifies research gaps and opportunities, and provides a strong foundation and encouragement for further research in the field of machine learning with Big Data.

...read moreread less

592 citations

Journal Article•DOI•

Creating Strategic Business Value from Big Data Analytics: A Research Framework

[...]

Varun Grover, Roger H. L. Chiang, Ting-Peng Liang, Dongsong Zhang

15 May 2018-Journal of Management Information Systems

TL;DR: This study describes the value proposition of BDA by delineating its components, then illustrates the framework through BDA applications in practice, and presents a problem-oriented view of the framework—where problems in BDA components can give rise to targeted research questions and areas for future study.

...read moreread less

Abstract: Despite the publicity regarding big data and analytics (BDA), the success rate of these projects and strategic value created from them are unclear. Most literature on BDA focuses on how it can be u...

...read moreread less

449 citations

Journal Article•DOI•

A Manifesto for Future Generation Cloud Computing: Research Directions for the Next Decade

[...]

Rajkumar Buyya¹, Satish Narayana Srirama², Giuliano Casale³, Rodrigo N. Calheiros⁴, Yogesh Simmhan⁵, Blesson Varghese⁶, Erol Gelenbe³, Bahman Javadi⁴, Luis M. Vaquero⁷, Marco A. S. Netto⁸, Adel Nadjaran Toosi⁹, Maria Alejandra Rodriguez¹, Ignacio M. Llorente¹⁰, Sabrina De Capitani di Vimercati¹¹, Pierangela Samarati¹¹, Dejan Milojicic¹², Carlos A. Varela¹³, Rami Bahsoon¹⁴, Marcos Dias De Assuncao, Omer Rana¹⁵, Wanlei Zhou¹⁶, Hai Jin¹⁷, Wolfgang Gentzsch, Albert Y. Zomaya⁴, Haiying Shen¹⁸ - Show less +21 more•Institutions (18)

University of Melbourne¹, University of Tartu², Imperial College London³, University of Sydney⁴, Indian Institute of Science⁵, Queen's University Belfast⁶, University of Bristol⁷, IBM⁸, Monash University, Clayton campus⁹, Complutense University of Madrid¹⁰, University of Milan¹¹, Hewlett-Packard¹², Rensselaer Polytechnic Institute¹³, University of Birmingham¹⁴, Cardiff University¹⁵, University of Technology, Sydney¹⁶, Huazhong University of Science and Technology¹⁷, University of Virginia¹⁸

19 Nov 2018-ACM Computing Surveys

TL;DR: The proposed manifesto addresses the major open challenges in Cloud computing by identifying themajor open challenges, emerging trends, and impact areas, and offers research directions for the next decade, thus helping in the realisation of Future Generation Cloud Computing.

...read moreread less

Abstract: The Cloud computing paradigm has revolutionised the computer science horizon during the past decade and has enabled the emergence of computing as the fifth utility. It has captured significant attention of academia, industries, and government bodies. Now, it has emerged as the backbone of modern economy by offering subscription-based services anytime, anywhere following a pay-as-you-go model. This has instigated (1) shorter establishment times for start-ups, (2) creation of scalable global enterprise applications, (3) better cost-to-value associativity for scientific and high-performance computing applications, and (4) different invocation/execution models for pervasive and ubiquitous applications. The recent technological developments and paradigms such as serverless computing, software-defined networking, Internet of Things, and processing at network edge are creating new opportunities for Cloud computing. However, they are also posing several new challenges and creating the need for new approaches and research strategies, as well as the re-evaluation of the models that were developed to address issues such as scalability, elasticity, reliability, security, sustainability, and application models. The proposed manifesto addresses them by identifying the major open challenges in Cloud computing, emerging trends, and impact areas. It then offers research directions for the next decade, thus helping in the realisation of Future Generation Cloud Computing.

...read moreread less

212 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2008-Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

...read moreread less

17,663 citations

Journal Article•DOI•

Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

[...]

Rajkumar Buyya¹, Chee Shin Yeo¹, Srikumar Venugopal¹, James Broberg¹, Ivona Brandic² - Show less +1 more•Institutions (2)

University of Melbourne¹, Vienna University of Technology²

01 Jun 2009-Future Generation Computer Systems

TL;DR: This paper defines Cloud computing and provides the architecture for creating Clouds with market-oriented resource allocation by leveraging technologies such as Virtual Machines (VMs), and provides insights on market-based resource management strategies that encompass both customer-driven service management and computational risk management to sustain Service Level Agreement (SLA) oriented resource allocation.

...read moreread less

5,850 citations

Proceedings Article•DOI•

Dynamo: amazon's highly available key-value store

[...]

Giuseppe deCandia¹, Deniz Hastorun¹, Madan Mohan Rao Jampani¹, Gunavardhan Kakulapati¹, Avinash Lakshman¹, Alex Pilchin¹, Swaminathan Sivasubramanian¹, Peter Sven Vosshall¹, Werner Vogels¹ - Show less +5 more•Institutions (1)

Amazon.com¹

14 Oct 2007

TL;DR: D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

...read moreread less

Abstract: Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

...read moreread less

4,349 citations

Journal Article•DOI•

Detecting influenza epidemics using search engine query data

[...]

Jeremy Ginsberg¹, Matthew H. Mohebbi¹, Rajan Patel¹, Lynnette Brammer², Mark S. Smolinski¹, Lawrence B. Brilliant¹ - Show less +2 more•Institutions (2)

Google¹, Centers for Disease Control and Prevention²

19 Feb 2009-Nature

TL;DR: A method of analysing large numbers of Google search queries to track influenza-like illness in a population and accurately estimate the current level of weekly influenza activity in each region of the United States with a reporting lag of about one day is presented.

...read moreread less

Abstract: This paper - first published on-line in November 2008 - draws on data from an early version of the Google Flu Trends search engine to estimate the levels of flu in a population. It introduces a computational model that converts raw search query data into a region-by-region real-time surveillance system that accurately estimates influenza activity with a lag of about one day - one to two weeks faster than the conventional reports published by the Centers for Disease Prevention and Control. This report introduces a computational model based on internet search queries for real-time surveillance of influenza-like illness (ILI), which reproduces the patterns observed in ILI data from the Centers for Disease Control and Prevention. Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year1. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities2. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza3,4. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.

...read moreread less

3,984 citations

Journal Article•DOI•

Bigtable: A Distributed Storage System for Structured Data

[...]

Fay W. Chang¹, Jeffrey Dean¹, Sanjay Ghemawat¹, Wilson C. Hsieh¹, Deborah A. Wallach¹, Michael Burrows¹, Tushar Deepak Chandra¹, Andrew Fikes¹, Robert E. Gruber¹ - Show less +5 more•Institutions (1)

Google¹

01 Jun 2008-ACM Transactions on Computer Systems

TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

...read moreread less

Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

...read moreread less

3,259 citations