Showing papers on "Data-intensive computing published in 2016"

PDF

Open Access

Journal Article•DOI•

[...]

Chih-Fong Tsai¹, Wei-Chao Lin², Shih Wen Ke³•Institutions (3)

National Central University¹, Asia University (Taiwan)², Chung Yuan Christian University³

01 Dec 2016-Journal of Systems and Software

TL;DR: The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset.

...read moreread less

69 citations

Journal Article•DOI•

MapReduce Parallel Programming Model: A State-of-the-Art Survey

[...]

Ren Li¹, Haibo Hu², Heng Li², Yunsong Wu², Jianxi Yang¹ - Show less +1 more•Institutions (2)

Chongqing Jiaotong University¹, Chongqing University²

01 Aug 2016-International Journal of Parallel Programming

TL;DR: This review assesses MapReduce to help researchers better understand these novel optimizations that have been taken to address its limitations and suggest several research works that should be carried out in the future.

...read moreread less

Abstract: With the development of information technologies, we have entered the era of Big Data. Google's MapReduce programming model and its open-source implementation in Apache Hadoop have become the dominant model for data-intensive processing because of its simplicity, scalability, and fault tolerance. However, several inherent limitations, such as lack of efficient scheduling and iteration computing mechanisms, seriously affect the efficiency and flexibility of MapReduce. To date, various approaches have been proposed to extend MapReduce model and improve runtime efficiency for different scenarios. In this review, we assess MapReduce to help researchers better understand these novel optimizations that have been taken to address its limitations. We first present the basic idea underlying MapReduce paradigm and describe several widely used open-source runtime systems. And then we discuss the main shortcomings of original MapReduce. We also review these MapReduce optimization approaches that have recently been put forward, and categorize them according to the characteristics and capabilities. Finally, we conclude the paper and suggest several research works that should be carried out in the future.

...read moreread less

64 citations

Journal Article•DOI•

End-to-End Optimization for Geo-Distributed MapReduce

[...]

Benjamin Heintz¹, Abhishek Chandra¹, Ramesh K. Sitaraman², Jon Weissman¹•Institutions (2)

University of Minnesota¹, University of Massachusetts Amherst²

01 Jul 2016

TL;DR: This paper develops a model-driven optimization that serves as an oracle, providing high-level insights and applies these insights to design cross-phase optimization techniques that are implemented and demonstrated in a real-world MapReduce implementation.

...read moreread less

Abstract: MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.

...read moreread less

59 citations

Journal Article•DOI•

Pegasus in the Cloud: Science Automation through Workflow Technologies

[...]

Ewa Deelman¹, Karan Vahi¹, Mats Rynge¹, Gideon Juve¹, Rajiv Mayani¹, Rafael Ferreira da Silva¹ - Show less +2 more•Institutions (1)

University of Southern California¹

01 Jan 2016-IEEE Internet Computing

TL;DR: Pegasus workflows are portable across different infrastructures, optimizable for performance and efficiency, and automatically map to many different storage systems and data flows, making Pegasus a powerful solution for executing scientific workflows in the cloud.

...read moreread less

Abstract: The Pegasus Workflow Management System maps abstract, resource-independent workflow descriptions onto distributed computing resources. As a result of this planning process, Pegasus workflows are portable across different infrastructures, optimizable for performance and efficiency, and automatically map to many different storage systems and data flows. This approach makes Pegasus a powerful solution for executing scientific workflows in the cloud.

...read moreread less

56 citations

Journal Article•DOI•

Performance analysis of data intensive cloud systems based on data management and replication: a survey

[...]

Saif Ur Rehman Malik¹, Samee U. Khan², Sam J. Ewen², Nikos Tziritas, Joanna Kolodziej³, Albert Y. Zomaya⁴, Sajjad A. Madani¹, Nasro Min-Allah⁵, Lizhe Wang⁶, Cheng-Zhong Xu⁷, Qutaibah M. Malluhi⁸, Johnatan E. Pecero⁹, Pavan Balaji¹⁰, Abhinav Vishnu¹¹, Rajiv Ranjan¹², Sherali Zeadally¹³, Hongxiang Li¹⁴ - Show less +13 more•Institutions (14)

COMSATS Institute of Information Technology¹, North Dakota State University², University of Bielsko-Biała³, University of Sydney⁴, University of Dammam⁵, Chinese Academy of Sciences⁶, Wayne State University⁷, Qatar University⁸, University of Luxembourg⁹, Argonne National Laboratory¹⁰, Pacific Northwest National Laboratory¹¹, Commonwealth Scientific and Industrial Research Organisation¹², University of the District of Columbia¹³, University of Louisville¹⁴

01 Jun 2016-Distributed and Parallel Databases

TL;DR: This paper surveys data management and replication approaches that are developed by both industrial and research communities from 2007 to 2011 to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies.

...read moreread less

Abstract: As we delve deeper into the `Digital Age', we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, etc. Such `Data Explosions' has led to one of the most challenging research issues of the current Information and Communication Technology era: how to optimally manage (e.g., store, replicated, filter, and the like) such large amount of data and identify new ways to analyze large amounts of data for unlocking information. It is clear that such large data streams cannot be managed by setting up on-premises enterprise database systems as it leads to a large up-front cost in buying and administering the hardware and software systems. Therefore, next generation data management systems must be deployed on cloud. The cloud computing paradigm provides scalable and elastic resources, such as data and services accessible over the Internet Every Cloud Service Provider must assure that data is efficiently processed and distributed in a way that does not compromise end-users' Quality of Service (QoS) in terms of data availability, data search delay, data analysis delay, and the like. In the aforementioned perspective, data replication is used in the cloud for improving the performance (e.g., read and write delay) of applications that access data. Through replication a data intensive application or system can achieve high availability, better fault tolerance, and data recovery. In this paper, we survey data management and replication approaches (from 2007 to 2011) that are developed by both industrial and research communities. The focus of the survey is to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies. Moreover, the breakdown of both influential expressions (data replication and management) to provide different QoS attributes is deliberated. Furthermore, the performance advantages and disadvantages of data replication and management approaches in the cloud computing environments are analyzed. Open issues and future challenges related to data consistency, scalability, load balancing, processing and placement are also reported.

...read moreread less

44 citations

Proceedings Article•DOI•

Big data management processing with Hadoop MapReduce and spark technology: A comparison

[...]

Ankush Verma¹, Ashik Hussain Mansuri¹, Neelesh Jain•Institutions (1)

Pacific University¹

18 Mar 2016

TL;DR: This paper extends Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform and shows the differences between Hadoops MapReduced and Spark through Map and Reduce phase individually.

...read moreread less

Abstract: Hadoop MapReduce is processed for analysis large volume of data through multiple nodes in parallel. However MapReduce has two function Map and Reduce, large data is stored through HDFS. Lack of facility involve in MapReduce so Spark is designed to run for real time stream data and for fast queries. Spark jobs perform work on Resilient Distributed Datasets and directed acyclic graph execution engine. In this paper, we extend Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform. We also show the differences between Hadoop MapReduce and Spark through Map and Reduce phase individually.

...read moreread less

42 citations

Journal Article•DOI•

In-Storage Computing for Hadoop MapReduce Framework: Challenges and Possibilities

[...]

Dong-Chul Park¹, Jianguo Wang, Yang-Suk Kee•Institutions (1)

Samsung¹

28 Jul 2016-IEEE Transactions on Computers

TL;DR: The experiment suggests such ISC augmented systems can provide a very promising computing model in terms of a system scalability and a remarkable performance gain compared to a typical Hadoop MapReduce system.

...read moreread less

Abstract: Solid State Drives (SSDs) were initially developed as faster storage devices intended to replace conventional magnetic Hard Disk Drives (HDDs). However, high computational capabilities enable SSDs to be computing nodes, not just faster storage devices. Such capability is generally called ”In-Storage Computing (ISC)”. Today’s Hadoop MapReduce framework has become a de facto standard for big data processing. This paper explores In-Storage Computing challenges and opportunities for the Hadoop MapReduce framework. For this, we integrate a Hadoop MapReduce system with ISC SSD devices that implement the Hadoop Mapper inside real SSD firmware. This offloads Map tasks from the host MapReduce system to the ISC SSDs. We additionally optimize the host Hadoop system to make the best use of our proposed ISC Hadoop system. Experimental results demonstrate our ISC Hadoop MapReduce system achieves a remarkable performance gain (2.3 faster) as well as significant energy savings (11.5 lower) compared to a typical Hadoop MapReduce system. Further, the experiment suggests such ISC augmented systems can provide a very promising computing model in terms of a system scalability.

...read moreread less

33 citations

Journal Article•DOI•

Processing Cassandra Datasets with Hadoop-Streaming Based Approaches

[...]

Elif Dede¹, Bedri Sendir¹, P. Kuzlu¹, J. Weachock¹, Madhusudhan Govindaraju¹, Lavanya Ramakrishnan² - Show less +2 more•Institutions (2)

Binghamton University¹, Lawrence Berkeley National Laboratory²

01 Jan 2016-IEEE Transactions on Services Computing

TL;DR: This work compares Hadoop Streaming alongside its own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores.

...read moreread less

Abstract: The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and costly. As the quantity and quality of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a “Big Data” processing platform such as MapReduce. Although MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java while Hadoop Streaming module allows users to define non-Java executables as map and reduce operations. When confronted with legacy C/C++ applications and other non-Java executables, there arises a further need to allow NoSQL data stores access to the features of Hadoop Streaming. We present approaches in solving the challenge of integrating NoSQL data stores with MapReduce under non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores. Our experiments also include Hadoop-C*, which is a setup where a Hadoop cluster is co-located with a Cassandra cluster in order to process data using Hadoop with non-java executables.

...read moreread less

33 citations

Journal Article•DOI•

Apache Hama: An Emerging Bulk Synchronous Parallel Computing Framework for Big Data Applications

[...]

Kamran Siddique¹, Zahid Akhtar², Edward J. Yoon³, Young-Sik Jeong¹, Dipankar Dasgupta⁴, Yangwoo Kim¹ - Show less +2 more•Institutions (4)

Dongguk University¹, Université du Québec², Samsung³, University of Memphis⁴

22 Nov 2016-IEEE Access

TL;DR: The comparative studies and empirical evaluations performed in this paper reveal Hama’s potential and efficacy in big data applications and show that the performance of Hama is better than Giraph in terms of scalability and computational speed.

...read moreread less

Abstract: In today’s highly intertwined network society, the demand for big data processing frameworks is continuously growing. The widely adopted model to process big data is parallel and distributed computing. This paper documents the significant progress achieved in the field of distributed computing frameworks, particularly Apache Hama, a top level project under the Apache Software Foundation, based on bulk synchronous parallel processing. The comparative studies and empirical evaluations performed in this paper reveal Hama’s potential and efficacy in big data applications. In particular, we present a benchmark evaluation of Hama’s graph package and Apache Giraph using PageRank algorithm. The results show that the performance of Hama is better than Giraph in terms of scalability and computational speed. However, despite great progress, a number of challenging issues continue to inhibit the full potential of Hama to be used at large scale. This paper also describes these challenges, analyzes solutions proposed to overcome them, and highlights research opportunities.

...read moreread less

33 citations

Journal Article•DOI•

Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales

[...]

Ke Wang¹, Kan Qiao², Iman Sadooghi¹, Xiaobing Zhou, Tonglin Li¹, Michael Lang³, Ioan Raicu⁴, Ioan Raicu¹ - Show less +4 more•Institutions (4)

Illinois Institute of Technology¹, Google², Los Alamos National Laboratory³, Argonne National Laboratory⁴

01 Jan 2016-Concurrency and Computation: Practice and Experience

TL;DR: An analytical suboptimal upper bound is devised of the proposed data‐aware work‐stealing technique to optimize both load balancing and data locality and results show that the technique is not only scalable but can achieve performance within 15% of theSuboptimal solution.

...read moreread less

Abstract: Data-driven programming models such as many-task computing MTC have been prevalent for running data-intensive scientific applications. MTC applies over-decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data-intensive applications. Our previous research proposed a data-aware work-stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key-value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.

...read moreread less

31 citations

Journal Article•DOI•

Cloud computing: a new computing paradigm

[...]

Lina Bao¹, Shifei Ding²•Institutions (2)

China University of Mining and Technology¹, Chinese Academy of Sciences²

22 Jun 2016

TL;DR: The definition, characteristics, service models and deployment models of cloud computing, a computing paradigm suitable for handling large-scale data, are introduced and some security challenges that cloud computing faces are presented.

...read moreread less

Abstract: In the 21st century, the era of big data comes out as the world enters the era of data explosion. It requires a large-scale parallel processing system, just like cloud computing, to deal with the various and massive data quickly and efficiently. Cloud computing is a computing paradigm that is not only a collection of distributed computing, parallel computing and grid computing, but also suitable for handling large-scale data. This paper mainly introduces the definition, characteristics, service models and deployment models of cloud computing. Afterwards, some security challenges that cloud computing faces are presented. Finally, a brief summary and outlook of cloud computing are given.

...read moreread less

Parallel Scientific Computing In C And Mpi A Seamless Approach To Parallel Algorithms And Their Implementation

[...]

Mathias Kluge

01 Jan 2016

TL;DR: Thank you for reading parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation, which may help people cope with some malicious bugs inside their desktop computer.

...read moreread less

Abstract: Thank you for reading parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation. Maybe you have knowledge that, people have look numerous times for their favorite readings like this parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some malicious bugs inside their desktop computer.

...read moreread less

Journal Article•DOI•

FP-Hadoop

[...]

Miguel Liroz-Gistau¹, Reza Akbarinia¹, Divyakant Agrawal², Patrick Valduriez¹•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of California, Santa Barbara²

01 Aug 2016-Information Systems

TL;DR: FP-Hadoop is presented, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew, and introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel.

...read moreread less

Proceedings Article•DOI•

Performance Evaluation of Edge Cloud Computing System for Big Data Applications

[...]

Mauro Femminella¹, Matteo Pergolesi¹, Gianluca Reali¹•Institutions (1)

University of Perugia¹

01 Oct 2016

TL;DR: Experimental results show that in spite of a predictable performance loss in virtualized environments with respect to the native one, it is still convenient to execute Hadoop in a small cloud.

...read moreread less

Abstract: Cloud computing is a convenient model to easily access large amounts of computing resources in order to implement platforms for data intensive applications. These platforms, such as Hadoop, are designed to run on large clusters. When the amount of computing and networking resources are limited, such as in the emergent paradigm of edge computing, maximizing their utilization is of paramount importance. In this paper we investigate the performance of a benchmark suite for Hadoop, running on both physical and virtual infrastructure in a testbed representative of an edge computing deployment. Experimental results show that in spite of a predictable performance loss in virtualized environments with respect to the native one, it is still convenient to execute Hadoop in a small cloud. This could be useful to pre-process data coming from sensors and/or mobile devices before sending them to a central cloud for further analysis.

...read moreread less

Proceedings Article•DOI•

MaPU: A novel mathematical computing architecture

[...]

Donglin Wang, Xueliang Du, Leizu Yin, Chen Lin, Hong Ma, Weili Ren, Wang Huijuan, Xingang Wang, Shaolin Xie, Wang Lei, Zijun Liu, Tao Wang¹, Zhonghua Pu, Guangxin Ding, Zhu Mengchen, Lipeng Yang, Ruoshan Guo, Zhiwei Zhang, Xiao Lin, Jie Hao, Yang Yongyong¹, Wenqin Sun, Fabiao Zhou, NuoZhou Xiao, Qian Cui, Xiaoqin Wang - Show less +22 more•Institutions (1)

Huawei¹

12 Mar 2016

TL;DR: The MaPU architecture is presented, a novel architecture which is suitable for data-intensive computing with great power efficiency and sustained computation throughput, and increases the actual power efficiency by an order of magnitude comparable with the traditional CPU and GPGPU.

...read moreread less

Abstract: As the feature size of the semiconductor process is scaling down to 10nm and below, it is possible to assemble systems with high performance processors that can theoretically provide computational power of up to tens of PLOPS. However, the power consumption of these systems is also rocketing up to tens of millions watts, and the actual performance is only around 60% of the theoretical performance. Today, power efficiency and sustained performance have become the main foci of processor designers. Traditional computing architecture such as superscalar and GPGPU are proven to be power inefficient, and there is a big gap between the actual and peak performance. In this paper, we present the MaPU architecture, a novel architecture which is suitable for data-intensive computing with great power efficiency and sustained computation throughput. To achieve this goal, MaPU attempts to optimize the application from a system perspective, including the hardware, algorithm and corresponding program model. It uses an innovative multi-granularity parallel memory system with intrinsic shuffle ability, cascading pipelines with wide SIMD data paths and a state-machine-based program model. When executing typical signal processing algorithms, a single MaPU core implemented with a 40nm process exhibits a sustained performance of 134 GLOPS while consuming only 2.8 W in power, which increases the actual power efficiency by an order of magnitude comparable with the traditional CPU and GPGPU.

...read moreread less

Journal Article•DOI•

A Scalable Product Recommendations Using Collaborative Filtering in Hadoop for Bigdata

[...]

P.A. Riyaz, Surekha Mariam Varghese

01 Jan 2016-Procedia Technology

TL;DR: A novel recommendations system using collaborative filtering algorithm is implemented in Apache Hadoop leveraging MapReduce paradigm for Big data, resulting in significant improvement in performance compared to conventional tools.

...read moreread less

Proceedings Article•DOI•

On efficient hierarchical storage for big data processing

[...]

K. R. Krish¹, Bharti Wadhwa¹, M. Safdar Iqbal¹, M. Mustafa Rafique², Ali R. Butt¹ - Show less +1 more•Institutions (2)

Virginia Tech¹, IBM²

16 May 2016

TL;DR: DUX is presented, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs.

...read moreread less

Abstract: A promising trend in storage management for big data frameworks, such as Hadoop and Spark, is the emergence of heterogeneous and hybrid storage systems that employ different types of storage devices, e.g. SSDs, RAMDisks, etc., alongside traditional HDDs. However, scheduling data accesses or requests to an appropriate storage device is non-trivial and depends on several factors such as data locality, device performance, and application compute and storage resources utilization. To this end, we present Dux, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs. The novelty of Dux lies in profiling application performance on SSDs and HDDs, analyzing the resulting I/O behavior, and considering the available SSDs at runtime to dynamically place data in an appropriate storage tier. Evaluation of Dux with trace-driven simulations using synthetic Facebook workloads shows that even when using 5.5× fewer SSDs compared to a SSD-only solution, Dux incurs only a small (5%) performance overhead, and thus offers an affordable and efficient storage tier management.

...read moreread less

Journal Article•DOI•

ActiveSort: Efficient external sorting using active SSDs in the MapReduce framework

[...]

Young-sik Lee¹, Luis Cavazos Quero², Sang-Hoon Kim¹, Jin-Soo Kim², Seungryoul Maeng¹ - Show less +1 more•Institutions (2)

KAIST¹, Sungkyunkwan University²

01 Dec 2016-Future Generation Computer Systems

TL;DR: This work proposes ActiveSort, a novel mechanism to improve the external sorting algorithm using the concept of active SSDs, which reduces the amount of I/O transfer and improves the performance of external sorting in Hadoop.

...read moreread less

Proceedings Article•DOI•

Load balancing in cloud computing: Challenges & issues

[...]

Shalini Joshi¹, Uma Kumari¹•Institutions (1)

Mody University of Science & Technology¹

01 Dec 2016

TL;DR: This paper presents cloud computing, cloud computing architecture, virtualization, load balancing, challenges and various currently available load balancing algorithms for cloud computing.

...read moreread less

Abstract: Cloud computing is latest emerging technology for large scale distributed computing and parallel computing. Cloud computing gives large pool of shared resources, software packages, information, storage and many different applications as per user demands at any instance of time. Cloud computing is emerging quickly; a large number of users are attracted towards cloud services for more satisfaction. Balancing the load has become more interesting research area in this field. Better load balancing algorithm in cloud system increases the performance and resources utilization by dynamically distributing work load among various nodes in the system. This paper presents cloud computing, cloud computing architecture, virtualization, load balancing, challenges and various currently available load balancing algorithms.

...read moreread less

Proceedings Article•DOI•

elephant56: Design and Implementation of a Parallel Genetic Algorithms Framework on Hadoop MapReduce

[...]

Pasquale Salza¹, Filomena Ferrucci¹, Federica Sarro²•Institutions (2)

University of Salerno¹, University College London²

20 Jul 2016

TL;DR: The design and the implementation details of the framework that supports three different models for parallel GAs, namely the global model, the grid model and the island model are described and a complete example of use is provided.

...read moreread less

Abstract: elephant56 is an open source framework for the development and execution of single and parallel Genetic Algorithms (GAs). It provides high level functionalities that can be reused by developers, who no longer need to worry about complex internal structures. In particular, it offers the possibility of distributing the GAs computation over a Hadoop MapReduce cluster of multiple computers. In this paper we describe the design and the implementation details of the framework that supports three different models for parallel GAs, namely the global model, the grid model and the island model. Moreover, we provide a complete example of use.

...read moreread less

Proceedings Article•DOI•

Big Data representation for grade analysis through Hadoop framework

[...]

Chitresh Verma¹, Rajiv Pandey¹•Institutions (1)

Amity University¹

01 Jan 2016

TL;DR: This paper suggests a Big Data representation for grade analytics in an educational context using Hadoop MapReduce, the software framework for computing large amount of data.

...read moreread less

Abstract: Big Data is a large dataset displaying the features of volume, velocity and variety in an OR relationship. Big Data as a large dataset is of no significance if it cannot be exposed to strategic analysis and utilization. There are many software and hardware solutions available in the technological landscape that enable capturing, storing and subsequently analysis of Big Data. Hadoop and its associated technological solution is one of them. Hadoop is the software framework for computing large amount of data. It is made up of four main modules. These modules are Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Hadoop MapReduce divides large problem into smaller sub problems under the control of JobTracker. This paper suggests a Big Data representation for grade analytics in an educational context. The study and the experiments can be implemented on R or AWS the cloud infrastructure provided by Amazon.

...read moreread less

Proceedings Article•DOI•

Efficient Data Redistribution to Speedup Big Data Analytics in Large Systems

[...]

Long Cheng¹, Tao Li²•Institutions (2)

Dresden University of Technology¹, Florida International University²

01 Dec 2016

TL;DR: This paper proposes near-join, a network-aware redistribution approach targeting to efficiently reduce both network traffic and communication time of join executions, which is lightweight and adaptable to processing large datasets over large systems.

...read moreread less

Abstract: The performance of parallel data analytics systems becomes increasingly important with the rise of Big Data. An essential operation in such environment is parallel join, which always incurs significant cost on network communication. State-of-the-art approaches have achieved performance improvements over conventional implementations through minimizing network traffic or communication time. However, these approaches still face performance issues in the presence of big data and/or large-scale systems, due to their heavy overhead of data redistribution scheduling. In this paper, we propose near-join, a network-aware redistribution approach targeting to efficiently reduce both network traffic and communication time of join executions. Particularly, near-join is lightweight and adaptable to processing large datasets over large systems. We present the details of our algorithm and its implementation. The experiments performed on a cluster of up to 400 nodes and datasets of about 100GB have demonstrated that our scheduling algorithm is much faster than the state-of-the-art methods. Moreover, our join implementation can also achieve speedups over the conventional approaches.

...read moreread less

Journal Article•DOI•

Tolhit – A Scheduling Algorithm for Hadoop Cluster

[...]

M. Brahmwar¹, Mohit Kumar¹, Geeta Sikka¹•Institutions (1)

Dr. B. R. Ambedkar National Institute of Technology Jalandhar¹

01 Jan 2016-Procedia Computer Science

TL;DR: A new scheme is introduced to aid the scheduler in identifying the nodes on which stragglers can be executed and makes use of resource utilization and network information of cluster nodes in finding the most optimal node for scheduling the speculative copy of a slow task.

...read moreread less

Journal Article•DOI•

Recent Developments on Security and Reliability in Large-Scale Data Processing with MapReduce

[...]

Christian Esposito¹, Massimo Ficco•Institutions (1)

University of Naples Federico II¹

01 Jan 2016-International Journal of Data Warehousing and Mining

TL;DR: The authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop, and illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure.

...read moreread less

Abstract: The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.

...read moreread less

Proceedings Article•DOI•

Tasklets: Overcoming Heterogeneity in Distributed Computing Systems

[...]

Dominik Schafer¹, Janick Edinger¹, Sebastian VanSyckel¹, Justin Mazzola Paluska¹, Christian Becker² - Show less +1 more•Institutions (2)

University of Mannheim¹, Massachusetts Institute of Technology²

27 Jun 2016

TL;DR: The vision of a comprehensive distributed computing system is drawn and where existing frameworks fall short in dealing with the heterogeneity of distributed computing is shown and the Tasklet system is presented, an approach for a distributed computing framework that tackles the different dimensions of heterogeneity.

...read moreread less

Abstract: Distributed computing is a good alternative to expensive supercomputers. There are plenty of frameworks that enable programmers to harvest remote computing power. However, until today, much computation power in the edges of the Internet remains unused. While idle devices could contribute to a distributed environment as generic computation resources, computation-intense applications could use this pool of resources to enhance their execution quality. In this paper, we identify heterogeneity as a major burden for distributed and edge computing. Heterogeneity is present in multiple forms. We draw our vision of a comprehensive distributed computing system and show where existing frameworks fall short in dealing with the heterogeneity of distributed computing. Afterwards, we present the Tasklet system, our approach for a distributed computing framework. Tasklets are fine-grained computation units that can be issued for remote and local execution. We tackle the different dimensions of heterogeneity and show how to make use of available computation power in edge resources. In our prototype, we use middleware and virtualization technologies as well as a host language concept.

...read moreread less

Proceedings Article•DOI•

Hadoop-MapReduce Job Scheduling Algorithms Survey

[...]

Ehab Mohamed, Zheng Hong¹•Institutions (1)

Beihang University¹

01 Nov 2016

TL;DR: This paper introduces a survey of the previous work done in the Hadoop-MapReduce scheduling and gives some suggestion for the improvement of it.

...read moreread less

Abstract: The big data computing era is coming to be a fact in all daily life. As data-intensive become a reality in many of scientific branches, finding an efficient strategy for massive data computing systems has become a multi-objective improvement. Processing these huge data on the distributed hardware clusters as Clouds needs a powerful computation model like Hadoop-MapReduce. In this paper, we studied various schedulers developed in Hadoop in Cloud Environments, features and issues. Most existing studies considered the improvement in the performance from the single point of view (scheduling, locality of data, the correctness of the data, etc) but very few literature involved multi-objectives improvements (quality requirements, scheduling entities, and dynamic environment adaptation), especially in heterogeneous parallel and distributed systems. Hadoop and MapReduce are two important aspects in big data for handling structured and unstructured data. The Creation of an algorithm for node selection is essential to improve and optimize the performance of the MapReduce. This paper introduces a survey of the previous work done in the Hadoop-MapReduce scheduling and gives some suggestion for the improvement of it.

...read moreread less

Proceedings Article•DOI•

H2F: a hierarchical hadoop framework for big data processing in geo-distributed environments

[...]

Marco Cavallo¹, Carmelo Polito¹, Giuseppe Di Modica¹, Orazio Tomarchio¹•Institutions (1)

University of Catania¹

06 Dec 2016

TL;DR: This work developed a Hierarchical Hadoop Framework (H2F) specifically designed to work on geodistributed data and compares the performance of H2F with that of a plain Hadoops implementation.

...read moreread less

Abstract: Big data analysis requires adequate infrastructure and programming paradigms capable of processing large amount of data. Hadoop, the most known open-source implementation of the MapReduce paradigm, is widely employed in big data analysis frameworks. However, in many recent application scenarios data are natively distributed over different geographic regions in data centers which are inter-connected through network links with very lower bandwidth than those of the computing environments where traditionally Hadoop deployments are supposed to work. In such a context, Hadoop applications perform very poorly. To cope with these issues, we developed a Hierarchical Hadoop Framework (H2F) specifically designed to work on geodistributed data. In this work, we compare the performance of H2F with that of a plain Hadoop implementation. First results show that for very large amount of data the H2F solution performs better than the Hadoop.

...read moreread less

Journal Article•DOI•

A distributed load balancing algorithm for climate big data processing over a multi-core CPU cluster

[...]

Yuzhu Wang¹, Jinrong Jiang¹, Huang Ye¹, Juanxiong He¹•Institutions (1)

Chinese Academy of Sciences¹

01 Oct 2016-Concurrency and Computation: Practice and Experience

TL;DR: Numerical experiments show that compared to before optimization, the optimization algorithm can solve the load imbalance of the METGRID, and the computation speed of theMETGRID and REAL modules after optimization on 64 CPU cores is about 7.2 times faster than before.

...read moreread less

Abstract: Load imbalance is a common problem to be tackled urgently in large scale data-driven simulation systems or data intensive computing. According to the coupler, the Chinese Academy of Sciences-Earth System Model CAS-ESM implements one-way nesting of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric General Circulation Model version 4.0 IAP AGCM4.0 and Weather Research and Forecasting model WRF. The METGRID meteorological grid and REAL program modules in the WRF are used to process meteorological data. In the CAS-ESM, the load of the METGRID module is seriously unbalanced on many CPU cores. The load imbalance has a serious impact on the processing speed of meteorological data, so this study designs an optimization algorithm to solve the problem. Numerical experiments show that compared to before optimization, the optimization algorithm can solve the load imbalance of the METGRID, and the computation speed of the METGRID and REAL modules after optimization on 64 CPU cores is about 7.2 times faster than before. Meanwhile, the whole computation speed of the CAS-ESM can improve by 217.53%. In addition, results indicate that they also can reach to a similar speedup on different numbers of CPU cores. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

Enabling collaborative MapReduce on the Cloud with a single-sign-on mechanism

[...]

Jiaqi Zhao¹, Jie Tao², Achim Streit²•Institutions (2)

Changchun University¹, Karlsruhe Institute of Technology²

01 Jan 2016-Computing

TL;DR: A software framework for individual virtual machines to execute a MapReduce application in a parallel/collaborative way without the necessity of installing a middleware or specific software package for system management is developed.

...read moreread less

Abstract: Cloud Computing introduces a novel computing paradigm that allows the users to run their applications on a customized environment using on-demand resources. This novel computing concept is enabled by several technologies including the Web, virtualization, distributed file systems as well as parallel programming models. For parallel computing on the Cloud, MapReduce is currently the first choice for Cloud providers to deliver data analysis services because this model is specially designed for data-intensive applications while a Cloud centre is actually also a data centre hosting a huge amount of data usually in Petascale. The current deployment of MapReduce on the Cloud, however, follows the traditional execution model of MapReduce that needs the support of a cluster manager. This means that the single virtual machines created on the Cloud have to be organized into a cluster in order to be capable of running a MapReduce application. This is not only a burden for system management but also prohibits inter-Cloud computing that can involve the resources of different Clouds to solve large problems with big data or distributed data. We developed a software framework for individual virtual machines to execute a MapReduce application in a parallel/collaborative way without the necessity of installing a middleware or specific software package for system management. A focus of this research work is a Single-Sign-On (SSON) mechanism that enables the remote access to the individual machines. We validated the SSON mechanism together with the entire MapReduce framework using a private Cloud. Experimental results show both the functionality and the feasibility of our approach.

...read moreread less

Proceedings Article•

Comparative study of encryption algorithm over big data in cloud systems

[...]

K. Sekar, M. Padmavathamma¹•Institutions (1)

Cork College of Commerce¹

16 Mar 2016

TL;DR: An integrated approach is introduced to encrypt and decrypt the data before sending on cloud to achieve better performance and security performance analysis on different techniques can be applied based on different parameters.

...read moreread less

Abstract: Big data refers to data that is too large and complex to be processed. Big data handles voluminous amount of structured, semi structured and unstructured data with standard tools. Big Data also refers to the data where the volume, velocity or variety of data. It combines the historic data with the present data to predict the outcomes. In this regard, providing security for these data is a challenging task. Apache Hadoop was one of the tool designed to handle big data. Apache Hadoop along with other software products were used to process and interpret the results of big data. Hadoop includes various main components like Map reduce and HDFS for handling big data. Cloud computing is the technology that provides the online data storage. But here providing security is the key issue. In this paper, an integrated approach is introduced to encrypt and decrypt the data before sending on cloud. To achieve better performance and security performance analysis on different techniques can be applied based on different parameters.

...read moreread less

Collapse