A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

doi:10.14778/2536274.2536283

Home
/
Papers
/
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Journal Article•DOI•

A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Ahmed Eldawy¹, Mohamed F. Mokbel¹•Institutions (1)

University of Minnesota¹

01 Aug 2013-Vol. 6, Iss: 12, pp 1230-1233

TL;DR: This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data and demonstrates a real system prototype of Spatial Hadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap.

read less

Abstract: This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that pushes spatial data inside the core functionality of Hadoop. SpatialHadoop runs existing Hadoop programs as is, yet, it achieves order(s) of magnitude better performance than Hadoop when dealing with spatial data. SpatialHadoop employs a simple spatial high level language, a two-level spatial index structure, basic spatial components built inside the MapReduce layer, and three basic spatial operations: range queries, k-NN queries, and spatial join. Other spatial operations can be similarly deployed in SpatialHadoop. We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Big Data and cloud computing: innovation opportunities and challenges

[...]

Chaowei Yang¹, Qunying Huang², Zhenlong Li³, Kai Liu¹, Fei Hu¹ - Show less +1 more•Institutions (3)

George Mason University¹, University of Wisconsin-Madison², University of South Carolina³

02 Jan 2017-International Journal of Digital Earth

TL;DR: This review introduces future innovations and a research agenda for cloud computing supporting the transformation of the volume, velocity, variety and veracity into values of Big Data for local to global digital earth science and applications.

...read moreread less

Abstract: Big Data has emerged in the past few years as a new paradigm providing abundant data and opportunities to improve and/or enable research and decision-support applications with unprecedented value for digital earth applications including business, sciences and engineering. At the same time, Big Data presents challenges for digital earth to store, transport, process, mine and serve the data. Cloud computing provides fundamental support to address the challenges with shared computing resources including computing, storage, networking and analytical software; the application of these resources has fostered impressive Big Data advancements. This paper surveys the two frontiers – Big Data and cloud computing – and reviews the advantages and consequences of utilizing cloud computing to tackling Big Data in the digital earth and relevant science domains. From the aspects of a general introduction, sources, challenges, technology status and research opportunities, the following observations are offered: (i...

...read moreread less

545 citations

Journal Article•DOI•

Geospatial Big Data

[...]

Jae-Gil Lee¹, Minseo Kang¹•Institutions (1)

KAIST¹

01 Jun 2015-Big Data Research

TL;DR: Several case studies are introduced to show the importance and benefits of the analytics of geospatial big data, including fuel and time saving, revenue increase, urban planning, and health care, and new emerging platforms for sharing the collected data and for tracking human mobility via mobile devices.

...read moreread less

339 citations

Proceedings Article•DOI•

GeoSpark: a cluster computing framework for processing large-scale spatial data

[...]

Jia Yu¹, Jinxuan Wu¹, Mohamed Sarwat¹•Institutions (1)

Arizona State University¹

03 Nov 2015

TL;DR: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data that achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

...read moreread less

Abstract: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

...read moreread less

332 citations

Proceedings Article•DOI•

Large-scale spatial join query processing in Cloud

[...]

Simin You¹, Jianting Zhang², Le Gruenwald³•Institutions (3)

The Graduate Center, CUNY¹, City College of New York², University of Oklahoma³

13 Apr 2015

TL;DR: The designs and implementations of two prototype systems that are ready for Cloud deployments are reported: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala, which support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation.

...read moreread less

Abstract: The rapidly increasing amount of location data available in many applications has made it desirable to process their large-scale spatial queries in Cloud for performance and scalability. We report our designs and implementations of two prototype systems that are ready for Cloud deployments: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala. Both systems support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation. Experiments on the pickup locations of ∼170 million taxi trips in New York City and ∼10 million global species occurrences records have demonstrated both efficiency and scalability using Amazon EC2 clusters.

...read moreread less

180 citations

Cites background or methods from "A demonstration of SpatialHadoop: a..."

...By providing a custom FileInputFormat based on the spatial layouts of the partitions of both sides, SpatialHadoop pairs up spatially overlapping partitions, which are subsequently assigned to map tasks for parallel and distributed execution....
[...]
...Clearly, the implementation is much more concise than the equivalent implementations in both SpatialHadoop and HadoopGIS....
[...]
...SpatialHadoop [4], HadoopGIS [5] and ESRI Spatial Framework for Hadoop11 are three open source systems that are designed to process largescale spatial data on Hadoop....
[...]
...In contrast, alternative techniques, such as SpatialHadoop [4] and HadoopGIS [5], aim at utilizing existing mature Cloud computing techniques and tools (Hadoop/MapReduce in particular) and adapt traditional serial designs and implementations for easy parallelization and Cloud deployment....
[...]
...In SpatialHadoop, both sides in a spatial join are partitioned and spatial join is implemented as a map-only job....
[...]

Journal Article•DOI•

Spatial partitioning techniques in SpatialHadoop

[...]

Ahmed Eldawy¹, Louai Alarabi¹, Mohamed F. Mokbel¹•Institutions (1)

University of Minnesota¹

01 Aug 2015

TL;DR: This study describes seven alternative partitioning techniques and experimentally studies their effect on the quality of the generated index and the performance of range and spatial join queries to assist researchers in choosing a good spatial partitioning technique in distributed environments.

...read moreread less

Abstract: SpatialHadoop is an extended MapReduce framework that supports global indexing that spatial partitions the data across machines providing orders of magnitude speedup, compared to traditional Hadoop. In this paper, we describe seven alternative partitioning techniques and experimentally study their effect on the quality of the generated index and the performance of range and spatial join queries. We found that using a 1% sample is enough to produce high quality partitions. Also, we found that the total area of partitions is a reasonable measure of the quality of indexes when running spatial join. This study will assist researchers in choosing a good spatial partitioning technique in distributed environments.

...read moreread less

114 citations

Cites methods from "A demonstration of SpatialHadoop: a..."

...SpatialHadoop [2, 3] provides a generic indexing algorithm which was used to implement grid, R-tree, and R+-tree based p artitioning....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

R-trees: a dynamic index structure for spatial searching

[...]

Antonin Guttman¹•Institutions (1)

University of California, Berkeley¹

01 Jun 1984

TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.

...read moreread less

Abstract: In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, traditional indexing methods are not well suited to data objects of non-zero size located m multi-dimensional spaces In this paper we describe a dynamic index structure called an R-tree which meets this need, and give algorithms for searching and updating it. We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications

...read moreread less

7,336 citations

"A demonstration of SpatialHadoop: a..." refers methods in this paper

...Following is an example that calculates the 100 nearest houses to the query point query loc. houses = LOAD ’houses’ AS (id:int, loc:point); nearest_houses = KNN houses WITH_K=100 USING Distance(loc, query_loc);...
[...]
...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....
[...]
...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....
[...]

Proceedings Article•DOI•

Pig latin: a not-so-foreign language for data processing

[...]

Christopher Olston¹, Benjamin Reed¹, Utkarsh Srivastava¹, Ravi Kumar¹, Andrew Tomkins¹ - Show less +1 more•Institutions (1)

Yahoo!¹

09 Jun 2008

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.

...read moreread less

Abstract: There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

...read moreread less

2,058 citations

"A demonstration of SpatialHadoop: a..." refers background in this paper

...The language layer provides a simple high level SQL-like language that supports spatial data types and operations....
[...]

Proceedings Article•DOI•

The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

[...]

Timos Sellis, Nick Roussopoulos, Christos Faloutsos¹•Institutions (1)

Carnegie Mellon University¹

01 Sep 1987

TL;DR: A variation to Guttman’s Rtrees (R+-trees) that avoids overlapping rectangles in intermediate nodes of the tree is introduced and analytical results indicate that R+-Trees achieve up to 50% savings in disk accesses compared to an R-tree when searching files of thousands of rectangles.

...read moreread less

Abstract: The problem of indexing multidimensional objects is considered. First, a classification of existing methods is given along with a discussion of the major issues involved in multidimensional data indexing. Second, a variation to Guttman’s Rtrees (R+-trees) that avoids overlapping rectangles in intermediate nodes of the tree is introduced. Algorithms for searching, updating, initial packing and reorganization of the structure are discussed in detail. Finally, we provide analytical results indicating that R+-trees achieve up to 50% savings in disk accesses compared to an R-tree when searching files of thousands of rectangles.

...read moreread less

1,481 citations

"A demonstration of SpatialHadoop: a..." refers methods in this paper

...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....
[...]
...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....
[...]
...In particular, SpatialHadoop language overrides the keywords FILTER and JOIN, when their parameters have spatial predicate(s), to perform range query and spatial join, respectively....
[...]

Journal Article•DOI•

The Grid File: An Adaptable, Symmetric Multikey File Structure

[...]

Jürg Nievergelt, Hans Hinterberger, Kenneth C. Sevcik¹•Institutions (1)

University of Toronto¹

23 Mar 1984-ACM Transactions on Database Systems

TL;DR: This work discusses in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.

...read moreread less

Abstract: Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for single-key access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of file structures that treat all keys symmetrically, that is, file structures which avoid the distinction between primary and secondary keys. We start from a bitmap approach and treat the problem of file design as one of data compression of a large sparse matrix. This leads to the notions of a grid partition of the search space and of a grid directory, which are the keys to a dynamic file structure called the grid file. This file system adapts gracefully to its contents under insertions and deletions, and thus achieves an upper bound of two disk accesses for single record retrieval; it also handles range queries and partially specified queries efficiently. We discuss in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.

...read moreread less

1,222 citations

"A demonstration of SpatialHadoop: a..." refers methods in this paper

...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....
[...]
...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....
[...]

Proceedings Article•DOI•

SystemML: Declarative machine learning on MapReduce

[...]

Amol Ghoting¹, Rajasekar Krishnamurthy¹, Edwin P. D. Pednault¹, Berthold Reinwald¹, Vikas Sindhwani¹, Shirish Tatikonda¹, Yuanyuan Tian¹, Shivakumar Vaithyanathan¹ - Show less +4 more•Institutions (1)

IBM¹

11 Apr 2011

TL;DR: This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.

...read moreread less

Abstract: MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.

...read moreread less

342 citations