G-Hadoop: MapReduce across distributed data centers for data-intensive computing

doi:10.1016/J.FUTURE.2012.09.001

Home
/
Papers
/
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Journal Article•DOI•

G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Lizhe Wang¹, Jie Tao², Rajiv Ranjan³, Holger Marten², Achim Streit², Jingying Chen⁴, Dan Chen⁵ - Show less +3 more•Institutions (5)

Chinese Academy of Sciences¹, Karlsruhe Institute of Technology², Commonwealth Scientific and Industrial Research Organisation³, Central China Normal University⁴, China University of Geosciences (Wuhan)⁵

01 Mar 2013-Future Generation Computer Systems (North-Holland)-Vol. 29, Iss: 3, pp 739-750

TL;DR: The design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters is presented.

read less

About: This article is published in Future Generation Computer Systems.The article was published on 2013-03-01. It has received 319 citations till now. The article focuses on the topics: Data-intensive computing & Distributed design patterns.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Crowdsourcing Based Description of Urban Emergency Events Using Social Media Big Data

[...]

Zheng Xu¹, Yunhuai Liu², Neil Y. Yen³, Lin Mei², Xiangfeng Luo⁴, Xiao Wei⁵, Chuanping Hu² - Show less +3 more•Institutions (5)

Tsinghua University¹, Chinese Ministry of Public Security², University of Aizu³, Shanghai University⁴, Shanghai Institute of Technology⁵

01 Apr 2020-IEEE Transactions on Cloud Computing

TL;DR: In this paper, in order to detect and describe the real time urban emergency event, the 5W (What, Where, When, Who, and Why) model is proposed and results show the accuracy and efficiency of the proposed method.

...read moreread less

Abstract: Crowdsourcing is a process of acquisition, integration, and analysis of big and heterogeneous data generated by a diversity of sources in urban spaces, such as sensors, devices, vehicles, buildings, and human. Especially, nowadays, no countries, no communities, and no person are immune to urban emergency events. Detection about urban emergency events, e.g., fires, storms, traffic jams is of great importance to protect the security of humans. Recently, social media feeds are rapidly emerging as a novel platform for providing and dissemination of information that is often geographic. The content from social media usually includes references to urban emergency events occurring at, or affecting specific locations. In this paper, in order to detect and describe the real time urban emergency event, the 5W (What, Where, When, Who, and Why) model is proposed. Firstly, users of social media are set as the target of crowd sourcing. Secondly, the spatial and temporal information from the social media are extracted to detect the real time event. Thirdly, a GIS based annotation of the detected urban emergency event is shown. The proposed method is evaluated with extensive case studies based on real urban emergency events. The results show the accuracy and efficiency of the proposed method.

...read moreread less

206 citations

Journal Article•DOI•

Semantic Link Network-Based Model for Organizing Multimedia Big Data

[...]

Chuanping Hu¹, Zheng Xu¹, Yunhuai Liu¹, Lin Mei¹, Lan Chen¹, Xiangfeng Luo² - Show less +2 more•Institutions (2)

Chinese Ministry of Public Security¹, Shanghai University²

10 Apr 2014-IEEE Transactions on Emerging Topics in Computing

TL;DR: A whole model for generating the association relation between multimedia resources using semantic link network model is proposed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.

...read moreread less

Abstract: Recent research shows that multimedia resources in the wild are growing at a staggering rate. The rapid increase number of multimedia resources has brought an urgent need to develop intelligent methods to organize and process them. In this paper, the semantic link network model is used for organizing multimedia resources. A whole model for generating the association relation between multimedia resources using semantic link network model is proposed. The definitions, modules, and mechanisms of the semantic link network are used in the proposed method. The integration between the semantic link network and multimedia resources provides a new prospect for organizing them with their semantics. The tags and the surrounding texts of multimedia resources are used to measure their semantic association. The hierarchical semantic of multimedia resources is defined by their annotated tags and surrounding texts. The semantics of tags and surrounding texts are different in the proposed framework. The modules of semantic link network model are implemented to measure association relations. A real data set including 100 thousand images with social tags from Flickr is used in our experiments. Two evaluation methods, including clustering and retrieval, are performed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.

...read moreread less

147 citations

Journal Article•DOI•

Evolutionary features of academic articles co-keyword network and keywords co-occurrence network: Based on two-mode affiliation network

[...]

Huajiao Li, Haizhong An¹, Haizhong An², Yue Wang¹, Yue Wang², Jiachen Huang², Xiangyun Gao², Xiangyun Gao¹ - Show less +4 more•Institutions (2)

Ministry of Land and Resources of the People's Republic of China¹, China University of Geosciences (Beijing)²

15 May 2016-Physica A-statistical Mechanics and Its Applications

TL;DR: This paper seeks to integrate statistics, text mining, complex networks and visualization to analyze all of the academic articles on one given theme, complex network(s), and provides a useful tool and process for successfully achieving in-depth analysis and rapid understanding of the trends and relationships of articles in a holistic perspective.

...read moreread less

Abstract: Keeping abreast of trends in the articles and rapidly grasping a body of article’s key points and relationship from a holistic perspective is a new challenge in both literature research and text mining. As the important component, keywords can present the core idea of the academic article. Usually, articles on a single theme or area could share one or some same keywords, and we can analyze topological features and evolution of the articles co-keyword networks and keywords co-occurrence networks to realize the in-depth analysis of the articles. This paper seeks to integrate statistics, text mining, complex networks and visualization to analyze all of the academic articles on one given theme, complex network(s). All 5944 “complex networks” articles that were published between 1990 and 2013 and are available on the Web of Science are extracted. Based on the two-mode affiliation network theory, a new frontier of complex networks, we constructed two different networks, one taking the articles as nodes, the co-keyword relationships as edges and the quantity of co-keywords as the weight to construct articles co-keyword network, and another taking the articles’ keywords as nodes, the co-occurrence relationships as edges and the quantity of simultaneous co-occurrences as the weight to construct keyword co-occurrence network. An integrated method for analyzing the topological features and evolution of the articles co-keyword network and keywords co-occurrence networks is proposed, and we also defined a new function to measure the innovation coefficient of the articles in annual level. This paper provides a useful tool and process for successfully achieving in-depth analysis and rapid understanding of the trends and relationships of articles in a holistic perspective.

...read moreread less

147 citations

Journal Article•DOI•

A comprehensive view of Hadoop research—A systematic literature review

[...]

Ivanilton Polato¹, Ivanilton Polato², Reginaldo Ré², Alfredo Goldman¹, Fabio Kon¹ - Show less +1 more•Institutions (2)

University of São Paulo¹, Federal University of Technology - Paraná²

01 Nov 2014-Journal of Network and Computer Applications

TL;DR: It is demonstrated that Hadoop has evolved into a solid platform to process large datasets, but the systematic review was able to spot promising areas and suggest topics for future research within the framework.

...read moreread less

141 citations

Cites methods from "G-Hadoop: MapReduce across distribu..."

...Using the previously presented GFarm file system, Wang et al. (2013) have proposed G-Hadoop, a MapReduce framework that enables large-scale distributed computing across multiple clusters....
[...]
...…al. (2011), Ko et al. (2010), Mao et al. (2012), Park et al. (2012), Tang et al. (2012), He et al. (2012), Lama et al. (2012), Zhou et al. (2013), Wang et al. (2013), Ahmad et al. (2013) Zhang et al. (2011a, 2011b, 2012b), Hammoud et al. (2012), Khan and Hamlen (2012), Nguyen and Shi (2010),…...
[...]
...…Tan et al. (2012b), Tang et al. (2012), Tao et al. (2011), You et al. (2011), Costa et al. (2013), Kondikoppa et al. (2012), Lama et al. (2012), Wang et al. (2013) Storage & replication Wei et al. (2010), Eltabakh et al. (2011), Zhou et al. (2012a), Bajda-Pawlikowski et al. (2011), Goiri et…...
[...]
...…Tan et al. (2012b), Tang et al. (2012), Tao et al. (2011), You et al. (2011), Costa et al. (2013), Kondikoppa et al. (2012), Lama et al. (2012), Wang et al. (2013) Tian et al. (2009), Zhu and Chen (2011), Zhang et al. (2011b, 2011c, 2011d, 2012b), Verma et al. (2011), Guo et al. (2011), Kumar…...
[...]
...…al. (2011), Ko et al. (2010), Mao et al. (2012), Park et al. (2012), Tang et al. (2012), He et al. (2012), Lama et al. (2012), Zhou et al. (2013), Wang et al. (2013), Ahmad et al. (2013) Indexing Dong et al. (2010), Dittrich et al. (2010), An et al. (2010), Liao et al. (2010), Dittrich et al.…...
[...]

Journal Article•DOI•

Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark

[...]

Ilias Mavridis¹, Helen D. Karatza¹•Institutions (1)

Aristotle University of Thessaloniki¹

01 Mar 2017-Journal of Systems and Software

TL;DR: This paper developed realistic log file analysis applications in both frameworks and performed SQL-type queries in real Apache Web Server log files and proposed a power consumption model and an utilization-based cost estimation.

...read moreread less

134 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

...read moreread less

20,309 citations

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2008-Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

...read moreread less

17,663 citations

Journal Article•DOI•

Pegasus: A framework for mapping complex scientific workflows onto distributed systems

[...]

Ewa Deelman¹, Gurmeet Singh¹, Mei-Hui Su¹, Jim Blythe¹, Yolanda Gil¹, Carl Kesselman¹, Gaurang Mehta¹, Karan Vahi¹, G. Bruce Berriman², John C. Good², Anastasia C. Laity², Joseph C. Jacob², Daniel S. Katz² - Show less +9 more•Institutions (2)

University of Southern California¹, California Institute of Technology²

01 Jul 2005-Scientific Programming

TL;DR: The results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities are presented.

...read moreread less

Abstract: This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

...read moreread less

1,324 citations

"G-Hadoop: MapReduce across distribu..." refers methods in this paper

...Currently data-intensive workflow systems, such as DAGMan [6], Pegasus [7], Swift [8], Kepler [9], Virtual Workflow [10,11], Virtual Data System [12] and Taverna [13], are used for distributed data processing across multiple data centers....
[...]

Proceedings Article•DOI•

Evaluating MapReduce for Multi-core and Multiprocessor Systems

[...]

C. Ranger¹, R. Raghuraman¹, A. Penmetsa¹, Gary Bradski¹, Christos Kozyrakis¹ - Show less +1 more•Institutions (1)

Stanford University¹

10 Feb 2007

TL;DR: It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.

...read moreread less

Abstract: This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code

...read moreread less

1,058 citations

"G-Hadoop: MapReduce across distribu..." refers methods in this paper

...Other MapReduce implementations are available for various architectures, such as for CUDA [26], in a multicore architecture [27], in FPGA platforms [28], for amultiprocessor architecture [29], in a large-scale shared-memory system [30], in a large-scale cluster [31], in multiple virtual machines [32], in a....
[...]

Proceedings Article•DOI•

DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

[...]

Yuan Yu¹, Michael Isard¹, Dennis Fetterly¹, Mihai Budiu¹, Úlfar Erlingsson², Pradeep Kumar Gunda¹, Jon Currey¹ - Show less +3 more•Institutions (2)

Microsoft¹, Reykjavík University²

08 Dec 2008

TL;DR: It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.

...read moreread less

Abstract: DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan.We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as we vary the number of computers used for a job.

...read moreread less

873 citations

"G-Hadoop: MapReduce across distribu..." refers methods in this paper

...There have been some successful paradigms and models for data intensive computing, for example, All-Pairs [20], Sector/ Sphere [21], DryadLINQ [22], and Mortar [23]....
[...]