scispace - formally typeset
Search or ask a question
Journal ArticleDOI

G-Hadoop: MapReduce across distributed data centers for data-intensive computing

TL;DR: The design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters is presented.
About: This article is published in Future Generation Computer Systems.The article was published on 2013-03-01. It has received 319 citations till now. The article focuses on the topics: Data-intensive computing & Distributed design patterns.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, in order to detect and describe the real time urban emergency event, the 5W (What, Where, When, Who, and Why) model is proposed and results show the accuracy and efficiency of the proposed method.
Abstract: Crowdsourcing is a process of acquisition, integration, and analysis of big and heterogeneous data generated by a diversity of sources in urban spaces, such as sensors, devices, vehicles, buildings, and human. Especially, nowadays, no countries, no communities, and no person are immune to urban emergency events. Detection about urban emergency events, e.g., fires, storms, traffic jams is of great importance to protect the security of humans. Recently, social media feeds are rapidly emerging as a novel platform for providing and dissemination of information that is often geographic. The content from social media usually includes references to urban emergency events occurring at, or affecting specific locations. In this paper, in order to detect and describe the real time urban emergency event, the 5W (What, Where, When, Who, and Why) model is proposed. Firstly, users of social media are set as the target of crowd sourcing. Secondly, the spatial and temporal information from the social media are extracted to detect the real time event. Thirdly, a GIS based annotation of the detected urban emergency event is shown. The proposed method is evaluated with extensive case studies based on real urban emergency events. The results show the accuracy and efficiency of the proposed method.

206 citations

Journal ArticleDOI
TL;DR: A whole model for generating the association relation between multimedia resources using semantic link network model is proposed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.
Abstract: Recent research shows that multimedia resources in the wild are growing at a staggering rate. The rapid increase number of multimedia resources has brought an urgent need to develop intelligent methods to organize and process them. In this paper, the semantic link network model is used for organizing multimedia resources. A whole model for generating the association relation between multimedia resources using semantic link network model is proposed. The definitions, modules, and mechanisms of the semantic link network are used in the proposed method. The integration between the semantic link network and multimedia resources provides a new prospect for organizing them with their semantics. The tags and the surrounding texts of multimedia resources are used to measure their semantic association. The hierarchical semantic of multimedia resources is defined by their annotated tags and surrounding texts. The semantics of tags and surrounding texts are different in the proposed framework. The modules of semantic link network model are implemented to measure association relations. A real data set including 100 thousand images with social tags from Flickr is used in our experiments. Two evaluation methods, including clustering and retrieval, are performed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.

147 citations

Journal ArticleDOI
TL;DR: This paper seeks to integrate statistics, text mining, complex networks and visualization to analyze all of the academic articles on one given theme, complex network(s), and provides a useful tool and process for successfully achieving in-depth analysis and rapid understanding of the trends and relationships of articles in a holistic perspective.
Abstract: Keeping abreast of trends in the articles and rapidly grasping a body of article’s key points and relationship from a holistic perspective is a new challenge in both literature research and text mining. As the important component, keywords can present the core idea of the academic article. Usually, articles on a single theme or area could share one or some same keywords, and we can analyze topological features and evolution of the articles co-keyword networks and keywords co-occurrence networks to realize the in-depth analysis of the articles. This paper seeks to integrate statistics, text mining, complex networks and visualization to analyze all of the academic articles on one given theme, complex network(s). All 5944 “complex networks” articles that were published between 1990 and 2013 and are available on the Web of Science are extracted. Based on the two-mode affiliation network theory, a new frontier of complex networks, we constructed two different networks, one taking the articles as nodes, the co-keyword relationships as edges and the quantity of co-keywords as the weight to construct articles co-keyword network, and another taking the articles’ keywords as nodes, the co-occurrence relationships as edges and the quantity of simultaneous co-occurrences as the weight to construct keyword co-occurrence network. An integrated method for analyzing the topological features and evolution of the articles co-keyword network and keywords co-occurrence networks is proposed, and we also defined a new function to measure the innovation coefficient of the articles in annual level. This paper provides a useful tool and process for successfully achieving in-depth analysis and rapid understanding of the trends and relationships of articles in a holistic perspective.

147 citations

Journal ArticleDOI
TL;DR: It is demonstrated that Hadoop has evolved into a solid platform to process large datasets, but the systematic review was able to spot promising areas and suggest topics for future research within the framework.

141 citations


Cites methods from "G-Hadoop: MapReduce across distribu..."

  • ...Using the previously presented GFarm file system, Wang et al. (2013) have proposed G-Hadoop, a MapReduce framework that enables large-scale distributed computing across multiple clusters....

    [...]

  • ...…al. (2011), Ko et al. (2010), Mao et al. (2012), Park et al. (2012), Tang et al. (2012), He et al. (2012), Lama et al. (2012), Zhou et al. (2013), Wang et al. (2013), Ahmad et al. (2013) Zhang et al. (2011a, 2011b, 2012b), Hammoud et al. (2012), Khan and Hamlen (2012), Nguyen and Shi (2010),…...

    [...]

  • ...…Tan et al. (2012b), Tang et al. (2012), Tao et al. (2011), You et al. (2011), Costa et al. (2013), Kondikoppa et al. (2012), Lama et al. (2012), Wang et al. (2013) Storage & replication Wei et al. (2010), Eltabakh et al. (2011), Zhou et al. (2012a), Bajda-Pawlikowski et al. (2011), Goiri et…...

    [...]

  • ...…Tan et al. (2012b), Tang et al. (2012), Tao et al. (2011), You et al. (2011), Costa et al. (2013), Kondikoppa et al. (2012), Lama et al. (2012), Wang et al. (2013) Tian et al. (2009), Zhu and Chen (2011), Zhang et al. (2011b, 2011c, 2011d, 2012b), Verma et al. (2011), Guo et al. (2011), Kumar…...

    [...]

  • ...…al. (2011), Ko et al. (2010), Mao et al. (2012), Park et al. (2012), Tang et al. (2012), He et al. (2012), Lama et al. (2012), Zhou et al. (2013), Wang et al. (2013), Ahmad et al. (2013) Indexing Dong et al. (2010), Dittrich et al. (2010), An et al. (2010), Liao et al. (2010), Dittrich et al.…...

    [...]

Journal ArticleDOI
TL;DR: This paper developed realistic log file analysis applications in both frameworks and performed SQL-type queries in real Apache Web Server log files and proposed a power consumption model and an utilization-based cost estimation.

134 citations

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
TL;DR: The results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities are presented.
Abstract: This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

1,324 citations


"G-Hadoop: MapReduce across distribu..." refers methods in this paper

  • ...Currently data-intensive workflow systems, such as DAGMan [6], Pegasus [7], Swift [8], Kepler [9], Virtual Workflow [10,11], Virtual Data System [12] and Taverna [13], are used for distributed data processing across multiple data centers....

    [...]

Proceedings ArticleDOI
C. Ranger1, R. Raghuraman1, A. Penmetsa1, Gary Bradski1, Christos Kozyrakis1 
10 Feb 2007
TL;DR: It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.
Abstract: This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code

1,058 citations


"G-Hadoop: MapReduce across distribu..." refers methods in this paper

  • ...Other MapReduce implementations are available for various architectures, such as for CUDA [26], in a multicore architecture [27], in FPGA platforms [28], for amultiprocessor architecture [29], in a large-scale shared-memory system [30], in a large-scale cluster [31], in multiple virtual machines [32], in a....

    [...]

Proceedings ArticleDOI
08 Dec 2008
TL;DR: It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.
Abstract: DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan.We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as we vary the number of computers used for a job.

873 citations


"G-Hadoop: MapReduce across distribu..." refers methods in this paper

  • ...There have been some successful paradigms and models for data intensive computing, for example, All-Pairs [20], Sector/ Sphere [21], DryadLINQ [22], and Mortar [23]....

    [...]