scispace - formally typeset
Search or ask a question

Showing papers on "Data-intensive computing published in 2016"


Journal ArticleDOI
TL;DR: The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset.

69 citations


Journal ArticleDOI
TL;DR: This review assesses MapReduce to help researchers better understand these novel optimizations that have been taken to address its limitations and suggest several research works that should be carried out in the future.
Abstract: With the development of information technologies, we have entered the era of Big Data. Google's MapReduce programming model and its open-source implementation in Apache Hadoop have become the dominant model for data-intensive processing because of its simplicity, scalability, and fault tolerance. However, several inherent limitations, such as lack of efficient scheduling and iteration computing mechanisms, seriously affect the efficiency and flexibility of MapReduce. To date, various approaches have been proposed to extend MapReduce model and improve runtime efficiency for different scenarios. In this review, we assess MapReduce to help researchers better understand these novel optimizations that have been taken to address its limitations. We first present the basic idea underlying MapReduce paradigm and describe several widely used open-source runtime systems. And then we discuss the main shortcomings of original MapReduce. We also review these MapReduce optimization approaches that have recently been put forward, and categorize them according to the characteristics and capabilities. Finally, we conclude the paper and suggest several research works that should be carried out in the future.

64 citations


Journal ArticleDOI
01 Jul 2016
TL;DR: This paper develops a model-driven optimization that serves as an oracle, providing high-level insights and applies these insights to design cross-phase optimization techniques that are implemented and demonstrated in a real-world MapReduce implementation.
Abstract: MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7-18 percent depending on the execution environment and application.

59 citations


Journal ArticleDOI
TL;DR: Pegasus workflows are portable across different infrastructures, optimizable for performance and efficiency, and automatically map to many different storage systems and data flows, making Pegasus a powerful solution for executing scientific workflows in the cloud.
Abstract: The Pegasus Workflow Management System maps abstract, resource-independent workflow descriptions onto distributed computing resources. As a result of this planning process, Pegasus workflows are portable across different infrastructures, optimizable for performance and efficiency, and automatically map to many different storage systems and data flows. This approach makes Pegasus a powerful solution for executing scientific workflows in the cloud.

56 citations


Journal ArticleDOI
TL;DR: This paper surveys data management and replication approaches that are developed by both industrial and research communities from 2007 to 2011 to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies.
Abstract: As we delve deeper into the `Digital Age', we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, etc. Such `Data Explosions' has led to one of the most challenging research issues of the current Information and Communication Technology era: how to optimally manage (e.g., store, replicated, filter, and the like) such large amount of data and identify new ways to analyze large amounts of data for unlocking information. It is clear that such large data streams cannot be managed by setting up on-premises enterprise database systems as it leads to a large up-front cost in buying and administering the hardware and software systems. Therefore, next generation data management systems must be deployed on cloud. The cloud computing paradigm provides scalable and elastic resources, such as data and services accessible over the Internet Every Cloud Service Provider must assure that data is efficiently processed and distributed in a way that does not compromise end-users' Quality of Service (QoS) in terms of data availability, data search delay, data analysis delay, and the like. In the aforementioned perspective, data replication is used in the cloud for improving the performance (e.g., read and write delay) of applications that access data. Through replication a data intensive application or system can achieve high availability, better fault tolerance, and data recovery. In this paper, we survey data management and replication approaches (from 2007 to 2011) that are developed by both industrial and research communities. The focus of the survey is to discuss and characterize the existing approaches of data replication and management that tackle the resource usage and QoS provisioning with different levels of efficiencies. Moreover, the breakdown of both influential expressions (data replication and management) to provide different QoS attributes is deliberated. Furthermore, the performance advantages and disadvantages of data replication and management approaches in the cloud computing environments are analyzed. Open issues and future challenges related to data consistency, scalability, load balancing, processing and placement are also reported.

44 citations


Proceedings ArticleDOI
18 Mar 2016
TL;DR: This paper extends Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform and shows the differences between Hadoops MapReduced and Spark through Map and Reduce phase individually.
Abstract: Hadoop MapReduce is processed for analysis large volume of data through multiple nodes in parallel. However MapReduce has two function Map and Reduce, large data is stored through HDFS. Lack of facility involve in MapReduce so Spark is designed to run for real time stream data and for fast queries. Spark jobs perform work on Resilient Distributed Datasets and directed acyclic graph execution engine. In this paper, we extend Hadoop MapReduce working and Spark architecture with supporting kind of operation to perform. We also show the differences between Hadoop MapReduce and Spark through Map and Reduce phase individually.

42 citations


Journal ArticleDOI
TL;DR: The experiment suggests such ISC augmented systems can provide a very promising computing model in terms of a system scalability and a remarkable performance gain compared to a typical Hadoop MapReduce system.
Abstract: Solid State Drives (SSDs) were initially developed as faster storage devices intended to replace conventional magnetic Hard Disk Drives (HDDs). However, high computational capabilities enable SSDs to be computing nodes, not just faster storage devices. Such capability is generally called ”In-Storage Computing (ISC)”. Today’s Hadoop MapReduce framework has become a de facto standard for big data processing. This paper explores In-Storage Computing challenges and opportunities for the Hadoop MapReduce framework. For this, we integrate a Hadoop MapReduce system with ISC SSD devices that implement the Hadoop Mapper inside real SSD firmware. This offloads Map tasks from the host MapReduce system to the ISC SSDs. We additionally optimize the host Hadoop system to make the best use of our proposed ISC Hadoop system. Experimental results demonstrate our ISC Hadoop MapReduce system achieves a remarkable performance gain (2.3 faster) as well as significant energy savings (11.5 lower) compared to a typical Hadoop MapReduce system. Further, the experiment suggests such ISC augmented systems can provide a very promising computing model in terms of a system scalability.

33 citations


Journal ArticleDOI
TL;DR: This work compares Hadoop Streaming alongside its own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores.
Abstract: The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and costly. As the quantity and quality of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a “Big Data” processing platform such as MapReduce. Although MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java while Hadoop Streaming module allows users to define non-Java executables as map and reduce operations. When confronted with legacy C/C++ applications and other non-Java executables, there arises a further need to allow NoSQL data stores access to the features of Hadoop Streaming. We present approaches in solving the challenge of integrating NoSQL data stores with MapReduce under non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores. Our experiments also include Hadoop-C*, which is a setup where a Hadoop cluster is co-located with a Cassandra cluster in order to process data using Hadoop with non-java executables.

33 citations


Journal ArticleDOI
TL;DR: The comparative studies and empirical evaluations performed in this paper reveal Hama’s potential and efficacy in big data applications and show that the performance of Hama is better than Giraph in terms of scalability and computational speed.
Abstract: In today’s highly intertwined network society, the demand for big data processing frameworks is continuously growing. The widely adopted model to process big data is parallel and distributed computing. This paper documents the significant progress achieved in the field of distributed computing frameworks, particularly Apache Hama, a top level project under the Apache Software Foundation, based on bulk synchronous parallel processing. The comparative studies and empirical evaluations performed in this paper reveal Hama’s potential and efficacy in big data applications. In particular, we present a benchmark evaluation of Hama’s graph package and Apache Giraph using PageRank algorithm. The results show that the performance of Hama is better than Giraph in terms of scalability and computational speed. However, despite great progress, a number of challenging issues continue to inhibit the full potential of Hama to be used at large scale. This paper also describes these challenges, analyzes solutions proposed to overcome them, and highlights research opportunities.

33 citations


Journal ArticleDOI
TL;DR: An analytical suboptimal upper bound is devised of the proposed data‐aware work‐stealing technique to optimize both load balancing and data locality and results show that the technique is not only scalable but can achieve performance within 15% of theSuboptimal solution.
Abstract: Data-driven programming models such as many-task computing MTC have been prevalent for running data-intensive scientific applications. MTC applies over-decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data-intensive applications. Our previous research proposed a data-aware work-stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key-value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.

31 citations


Journal ArticleDOI
22 Jun 2016
TL;DR: The definition, characteristics, service models and deployment models of cloud computing, a computing paradigm suitable for handling large-scale data, are introduced and some security challenges that cloud computing faces are presented.
Abstract: In the 21st century, the era of big data comes out as the world enters the era of data explosion. It requires a large-scale parallel processing system, just like cloud computing, to deal with the various and massive data quickly and efficiently. Cloud computing is a computing paradigm that is not only a collection of distributed computing, parallel computing and grid computing, but also suitable for handling large-scale data. This paper mainly introduces the definition, characteristics, service models and deployment models of cloud computing. Afterwards, some security challenges that cloud computing faces are presented. Finally, a brief summary and outlook of cloud computing are given.

01 Jan 2016
TL;DR: Thank you for reading parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation, which may help people cope with some malicious bugs inside their desktop computer.
Abstract: Thank you for reading parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation. Maybe you have knowledge that, people have look numerous times for their favorite readings like this parallel scientific computing in c and mpi a seamless approach to parallel algorithms and their implementation, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they cope with some malicious bugs inside their desktop computer.

Journal ArticleDOI
TL;DR: FP-Hadoop is presented, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew, and introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: Experimental results show that in spite of a predictable performance loss in virtualized environments with respect to the native one, it is still convenient to execute Hadoop in a small cloud.
Abstract: Cloud computing is a convenient model to easily access large amounts of computing resources in order to implement platforms for data intensive applications. These platforms, such as Hadoop, are designed to run on large clusters. When the amount of computing and networking resources are limited, such as in the emergent paradigm of edge computing, maximizing their utilization is of paramount importance. In this paper we investigate the performance of a benchmark suite for Hadoop, running on both physical and virtual infrastructure in a testbed representative of an edge computing deployment. Experimental results show that in spite of a predictable performance loss in virtualized environments with respect to the native one, it is still convenient to execute Hadoop in a small cloud. This could be useful to pre-process data coming from sensors and/or mobile devices before sending them to a central cloud for further analysis.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: The MaPU architecture is presented, a novel architecture which is suitable for data-intensive computing with great power efficiency and sustained computation throughput, and increases the actual power efficiency by an order of magnitude comparable with the traditional CPU and GPGPU.
Abstract: As the feature size of the semiconductor process is scaling down to 10nm and below, it is possible to assemble systems with high performance processors that can theoretically provide computational power of up to tens of PLOPS. However, the power consumption of these systems is also rocketing up to tens of millions watts, and the actual performance is only around 60% of the theoretical performance. Today, power efficiency and sustained performance have become the main foci of processor designers. Traditional computing architecture such as superscalar and GPGPU are proven to be power inefficient, and there is a big gap between the actual and peak performance. In this paper, we present the MaPU architecture, a novel architecture which is suitable for data-intensive computing with great power efficiency and sustained computation throughput. To achieve this goal, MaPU attempts to optimize the application from a system perspective, including the hardware, algorithm and corresponding program model. It uses an innovative multi-granularity parallel memory system with intrinsic shuffle ability, cascading pipelines with wide SIMD data paths and a state-machine-based program model. When executing typical signal processing algorithms, a single MaPU core implemented with a 40nm process exhibits a sustained performance of 134 GLOPS while consuming only 2.8 W in power, which increases the actual power efficiency by an order of magnitude comparable with the traditional CPU and GPGPU.

Journal ArticleDOI
TL;DR: A novel recommendations system using collaborative filtering algorithm is implemented in Apache Hadoop leveraging MapReduce paradigm for Big data, resulting in significant improvement in performance compared to conventional tools.

Proceedings ArticleDOI
K. R. Krish1, Bharti Wadhwa1, M. Safdar Iqbal1, M. Mustafa Rafique2, Ali R. Butt1 
16 May 2016
TL;DR: DUX is presented, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs.
Abstract: A promising trend in storage management for big data frameworks, such as Hadoop and Spark, is the emergence of heterogeneous and hybrid storage systems that employ different types of storage devices, e.g. SSDs, RAMDisks, etc., alongside traditional HDDs. However, scheduling data accesses or requests to an appropriate storage device is non-trivial and depends on several factors such as data locality, device performance, and application compute and storage resources utilization. To this end, we present Dux, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs. The novelty of Dux lies in profiling application performance on SSDs and HDDs, analyzing the resulting I/O behavior, and considering the available SSDs at runtime to dynamically place data in an appropriate storage tier. Evaluation of Dux with trace-driven simulations using synthetic Facebook workloads shows that even when using 5.5× fewer SSDs compared to a SSD-only solution, Dux incurs only a small (5%) performance overhead, and thus offers an affordable and efficient storage tier management.

Journal ArticleDOI
TL;DR: This work proposes ActiveSort, a novel mechanism to improve the external sorting algorithm using the concept of active SSDs, which reduces the amount of I/O transfer and improves the performance of external sorting in Hadoop.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper presents cloud computing, cloud computing architecture, virtualization, load balancing, challenges and various currently available load balancing algorithms for cloud computing.
Abstract: Cloud computing is latest emerging technology for large scale distributed computing and parallel computing. Cloud computing gives large pool of shared resources, software packages, information, storage and many different applications as per user demands at any instance of time. Cloud computing is emerging quickly; a large number of users are attracted towards cloud services for more satisfaction. Balancing the load has become more interesting research area in this field. Better load balancing algorithm in cloud system increases the performance and resources utilization by dynamically distributing work load among various nodes in the system. This paper presents cloud computing, cloud computing architecture, virtualization, load balancing, challenges and various currently available load balancing algorithms.

Proceedings ArticleDOI
20 Jul 2016
TL;DR: The design and the implementation details of the framework that supports three different models for parallel GAs, namely the global model, the grid model and the island model are described and a complete example of use is provided.
Abstract: elephant56 is an open source framework for the development and execution of single and parallel Genetic Algorithms (GAs). It provides high level functionalities that can be reused by developers, who no longer need to worry about complex internal structures. In particular, it offers the possibility of distributing the GAs computation over a Hadoop MapReduce cluster of multiple computers. In this paper we describe the design and the implementation details of the framework that supports three different models for parallel GAs, namely the global model, the grid model and the island model. Moreover, we provide a complete example of use.

Proceedings ArticleDOI
01 Jan 2016
TL;DR: This paper suggests a Big Data representation for grade analytics in an educational context using Hadoop MapReduce, the software framework for computing large amount of data.
Abstract: Big Data is a large dataset displaying the features of volume, velocity and variety in an OR relationship. Big Data as a large dataset is of no significance if it cannot be exposed to strategic analysis and utilization. There are many software and hardware solutions available in the technological landscape that enable capturing, storing and subsequently analysis of Big Data. Hadoop and its associated technological solution is one of them. Hadoop is the software framework for computing large amount of data. It is made up of four main modules. These modules are Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Hadoop MapReduce divides large problem into smaller sub problems under the control of JobTracker. This paper suggests a Big Data representation for grade analytics in an educational context. The study and the experiments can be implemented on R or AWS the cloud infrastructure provided by Amazon.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper proposes near-join, a network-aware redistribution approach targeting to efficiently reduce both network traffic and communication time of join executions, which is lightweight and adaptable to processing large datasets over large systems.
Abstract: The performance of parallel data analytics systems becomes increasingly important with the rise of Big Data. An essential operation in such environment is parallel join, which always incurs significant cost on network communication. State-of-the-art approaches have achieved performance improvements over conventional implementations through minimizing network traffic or communication time. However, these approaches still face performance issues in the presence of big data and/or large-scale systems, due to their heavy overhead of data redistribution scheduling. In this paper, we propose near-join, a network-aware redistribution approach targeting to efficiently reduce both network traffic and communication time of join executions. Particularly, near-join is lightweight and adaptable to processing large datasets over large systems. We present the details of our algorithm and its implementation. The experiments performed on a cluster of up to 400 nodes and datasets of about 100GB have demonstrated that our scheduling algorithm is much faster than the state-of-the-art methods. Moreover, our join implementation can also achieve speedups over the conventional approaches.

Journal ArticleDOI
TL;DR: A new scheme is introduced to aid the scheduler in identifying the nodes on which stragglers can be executed and makes use of resource utilization and network information of cluster nodes in finding the most optimal node for scheduling the speculative copy of a slow task.

Journal ArticleDOI
TL;DR: The authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop, and illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure.
Abstract: The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: The vision of a comprehensive distributed computing system is drawn and where existing frameworks fall short in dealing with the heterogeneity of distributed computing is shown and the Tasklet system is presented, an approach for a distributed computing framework that tackles the different dimensions of heterogeneity.
Abstract: Distributed computing is a good alternative to expensive supercomputers. There are plenty of frameworks that enable programmers to harvest remote computing power. However, until today, much computation power in the edges of the Internet remains unused. While idle devices could contribute to a distributed environment as generic computation resources, computation-intense applications could use this pool of resources to enhance their execution quality. In this paper, we identify heterogeneity as a major burden for distributed and edge computing. Heterogeneity is present in multiple forms. We draw our vision of a comprehensive distributed computing system and show where existing frameworks fall short in dealing with the heterogeneity of distributed computing. Afterwards, we present the Tasklet system, our approach for a distributed computing framework. Tasklets are fine-grained computation units that can be issued for remote and local execution. We tackle the different dimensions of heterogeneity and show how to make use of available computation power in edge resources. In our prototype, we use middleware and virtualization technologies as well as a host language concept.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper introduces a survey of the previous work done in the Hadoop-MapReduce scheduling and gives some suggestion for the improvement of it.
Abstract: The big data computing era is coming to be a fact in all daily life. As data-intensive become a reality in many of scientific branches, finding an efficient strategy for massive data computing systems has become a multi-objective improvement. Processing these huge data on the distributed hardware clusters as Clouds needs a powerful computation model like Hadoop-MapReduce. In this paper, we studied various schedulers developed in Hadoop in Cloud Environments, features and issues. Most existing studies considered the improvement in the performance from the single point of view (scheduling, locality of data, the correctness of the data, etc) but very few literature involved multi-objectives improvements (quality requirements, scheduling entities, and dynamic environment adaptation), especially in heterogeneous parallel and distributed systems. Hadoop and MapReduce are two important aspects in big data for handling structured and unstructured data. The Creation of an algorithm for node selection is essential to improve and optimize the performance of the MapReduce. This paper introduces a survey of the previous work done in the Hadoop-MapReduce scheduling and gives some suggestion for the improvement of it.

Proceedings ArticleDOI
06 Dec 2016
TL;DR: This work developed a Hierarchical Hadoop Framework (H2F) specifically designed to work on geodistributed data and compares the performance of H2F with that of a plain Hadoops implementation.
Abstract: Big data analysis requires adequate infrastructure and programming paradigms capable of processing large amount of data. Hadoop, the most known open-source implementation of the MapReduce paradigm, is widely employed in big data analysis frameworks. However, in many recent application scenarios data are natively distributed over different geographic regions in data centers which are inter-connected through network links with very lower bandwidth than those of the computing environments where traditionally Hadoop deployments are supposed to work. In such a context, Hadoop applications perform very poorly. To cope with these issues, we developed a Hierarchical Hadoop Framework (H2F) specifically designed to work on geodistributed data. In this work, we compare the performance of H2F with that of a plain Hadoop implementation. First results show that for very large amount of data the H2F solution performs better than the Hadoop.

Journal ArticleDOI
TL;DR: Numerical experiments show that compared to before optimization, the optimization algorithm can solve the load imbalance of the METGRID, and the computation speed of theMETGRID and REAL modules after optimization on 64 CPU cores is about 7.2 times faster than before.
Abstract: Load imbalance is a common problem to be tackled urgently in large scale data-driven simulation systems or data intensive computing. According to the coupler, the Chinese Academy of Sciences-Earth System Model CAS-ESM implements one-way nesting of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric General Circulation Model version 4.0 IAP AGCM4.0 and Weather Research and Forecasting model WRF. The METGRID meteorological grid and REAL program modules in the WRF are used to process meteorological data. In the CAS-ESM, the load of the METGRID module is seriously unbalanced on many CPU cores. The load imbalance has a serious impact on the processing speed of meteorological data, so this study designs an optimization algorithm to solve the problem. Numerical experiments show that compared to before optimization, the optimization algorithm can solve the load imbalance of the METGRID, and the computation speed of the METGRID and REAL modules after optimization on 64 CPU cores is about 7.2 times faster than before. Meanwhile, the whole computation speed of the CAS-ESM can improve by 217.53%. In addition, results indicate that they also can reach to a similar speedup on different numbers of CPU cores. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A software framework for individual virtual machines to execute a MapReduce application in a parallel/collaborative way without the necessity of installing a middleware or specific software package for system management is developed.
Abstract: Cloud Computing introduces a novel computing paradigm that allows the users to run their applications on a customized environment using on-demand resources. This novel computing concept is enabled by several technologies including the Web, virtualization, distributed file systems as well as parallel programming models. For parallel computing on the Cloud, MapReduce is currently the first choice for Cloud providers to deliver data analysis services because this model is specially designed for data-intensive applications while a Cloud centre is actually also a data centre hosting a huge amount of data usually in Petascale. The current deployment of MapReduce on the Cloud, however, follows the traditional execution model of MapReduce that needs the support of a cluster manager. This means that the single virtual machines created on the Cloud have to be organized into a cluster in order to be capable of running a MapReduce application. This is not only a burden for system management but also prohibits inter-Cloud computing that can involve the resources of different Clouds to solve large problems with big data or distributed data. We developed a software framework for individual virtual machines to execute a MapReduce application in a parallel/collaborative way without the necessity of installing a middleware or specific software package for system management. A focus of this research work is a Single-Sign-On (SSON) mechanism that enables the remote access to the individual machines. We validated the SSON mechanism together with the entire MapReduce framework using a private Cloud. Experimental results show both the functionality and the feasibility of our approach.

Proceedings Article
16 Mar 2016
TL;DR: An integrated approach is introduced to encrypt and decrypt the data before sending on cloud to achieve better performance and security performance analysis on different techniques can be applied based on different parameters.
Abstract: Big data refers to data that is too large and complex to be processed. Big data handles voluminous amount of structured, semi structured and unstructured data with standard tools. Big Data also refers to the data where the volume, velocity or variety of data. It combines the historic data with the present data to predict the outcomes. In this regard, providing security for these data is a challenging task. Apache Hadoop was one of the tool designed to handle big data. Apache Hadoop along with other software products were used to process and interpret the results of big data. Hadoop includes various main components like Map reduce and HDFS for handling big data. Cloud computing is the technology that provides the online data storage. But here providing security is the key issue. In this paper, an integrated approach is introduced to encrypt and decrypt the data before sending on cloud. To achieve better performance and security performance analysis on different techniques can be applied based on different parameters.