scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

01 Aug 2013-Vol. 6, Iss: 12, pp 1230-1233
TL;DR: This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data and demonstrates a real system prototype of Spatial Hadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap.
Abstract: This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that pushes spatial data inside the core functionality of Hadoop. SpatialHadoop runs existing Hadoop programs as is, yet, it achieves order(s) of magnitude better performance than Hadoop when dealing with spatial data. SpatialHadoop employs a simple spatial high level language, a two-level spatial index structure, basic spatial components built inside the MapReduce layer, and three basic spatial operations: range queries, k-NN queries, and spatial join. Other spatial operations can be similarly deployed in SpatialHadoop. We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This review introduces future innovations and a research agenda for cloud computing supporting the transformation of the volume, velocity, variety and veracity into values of Big Data for local to global digital earth science and applications.
Abstract: Big Data has emerged in the past few years as a new paradigm providing abundant data and opportunities to improve and/or enable research and decision-support applications with unprecedented value for digital earth applications including business, sciences and engineering. At the same time, Big Data presents challenges for digital earth to store, transport, process, mine and serve the data. Cloud computing provides fundamental support to address the challenges with shared computing resources including computing, storage, networking and analytical software; the application of these resources has fostered impressive Big Data advancements. This paper surveys the two frontiers – Big Data and cloud computing – and reviews the advantages and consequences of utilizing cloud computing to tackling Big Data in the digital earth and relevant science domains. From the aspects of a general introduction, sources, challenges, technology status and research opportunities, the following observations are offered: (i...

545 citations

Journal ArticleDOI
Jae-Gil Lee1, Minseo Kang1
TL;DR: Several case studies are introduced to show the importance and benefits of the analytics of geospatial big data, including fuel and time saving, revenue increase, urban planning, and health care, and new emerging platforms for sharing the collected data and for tracking human mobility via mobile devices.

339 citations

Proceedings ArticleDOI
03 Nov 2015
TL;DR: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data that achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).
Abstract: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

332 citations

Proceedings ArticleDOI
13 Apr 2015
TL;DR: The designs and implementations of two prototype systems that are ready for Cloud deployments are reported: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala, which support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation.
Abstract: The rapidly increasing amount of location data available in many applications has made it desirable to process their large-scale spatial queries in Cloud for performance and scalability. We report our designs and implementations of two prototype systems that are ready for Cloud deployments: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala. Both systems support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation. Experiments on the pickup locations of ∼170 million taxi trips in New York City and ∼10 million global species occurrences records have demonstrated both efficiency and scalability using Amazon EC2 clusters.

180 citations


Cites background or methods from "A demonstration of SpatialHadoop: a..."

  • ...By providing a custom FileInputFormat based on the spatial layouts of the partitions of both sides, SpatialHadoop pairs up spatially overlapping partitions, which are subsequently assigned to map tasks for parallel and distributed execution....

    [...]

  • ...Clearly, the implementation is much more concise than the equivalent implementations in both SpatialHadoop and HadoopGIS....

    [...]

  • ...SpatialHadoop [4], HadoopGIS [5] and ESRI Spatial Framework for Hadoop11 are three open source systems that are designed to process largescale spatial data on Hadoop....

    [...]

  • ...In contrast, alternative techniques, such as SpatialHadoop [4] and HadoopGIS [5], aim at utilizing existing mature Cloud computing techniques and tools (Hadoop/MapReduce in particular) and adapt traditional serial designs and implementations for easy parallelization and Cloud deployment....

    [...]

  • ...In SpatialHadoop, both sides in a spatial join are partitioned and spatial join is implemented as a map-only job....

    [...]

Journal ArticleDOI
01 Aug 2015
TL;DR: This study describes seven alternative partitioning techniques and experimentally studies their effect on the quality of the generated index and the performance of range and spatial join queries to assist researchers in choosing a good spatial partitioning technique in distributed environments.
Abstract: SpatialHadoop is an extended MapReduce framework that supports global indexing that spatial partitions the data across machines providing orders of magnitude speedup, compared to traditional Hadoop. In this paper, we describe seven alternative partitioning techniques and experimentally study their effect on the quality of the generated index and the performance of range and spatial join queries. We found that using a 1% sample is enough to produce high quality partitions. Also, we found that the total area of partitions is a reasonable measure of the quality of indexes when running spatial join. This study will assist researchers in choosing a good spatial partitioning technique in distributed environments.

114 citations


Cites methods from "A demonstration of SpatialHadoop: a..."

  • ...SpatialHadoop [2, 3] provides a generic indexing algorithm which was used to implement grid, R-tree, and R+-tree based p artitioning....

    [...]

References
More filters
Proceedings ArticleDOI
01 Jun 1984
TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.
Abstract: In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, traditional indexing methods are not well suited to data objects of non-zero size located m multi-dimensional spaces In this paper we describe a dynamic index structure called an R-tree which meets this need, and give algorithms for searching and updating it. We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications

7,336 citations


"A demonstration of SpatialHadoop: a..." refers methods in this paper

  • ...Following is an example that calculates the 100 nearest houses to the query point query loc. houses = LOAD ’houses’ AS (id:int, loc:point); nearest_houses = KNN houses WITH_K=100 USING Distance(loc, query_loc);...

    [...]

  • ...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....

    [...]

  • ...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....

    [...]

Proceedings ArticleDOI
Christopher Olston1, Benjamin Reed1, Utkarsh Srivastava1, Ravi Kumar1, Andrew Tomkins1 
09 Jun 2008
TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Abstract: There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

2,058 citations


"A demonstration of SpatialHadoop: a..." refers background in this paper

  • ...The language layer provides a simple high level SQL-like language that supports spatial data types and operations....

    [...]

Proceedings ArticleDOI
01 Sep 1987
TL;DR: A variation to Guttman’s Rtrees (R+-trees) that avoids overlapping rectangles in intermediate nodes of the tree is introduced and analytical results indicate that R+-Trees achieve up to 50% savings in disk accesses compared to an R-tree when searching files of thousands of rectangles.
Abstract: The problem of indexing multidimensional objects is considered. First, a classification of existing methods is given along with a discussion of the major issues involved in multidimensional data indexing. Second, a variation to Guttman’s Rtrees (R+-trees) that avoids overlapping rectangles in intermediate nodes of the tree is introduced. Algorithms for searching, updating, initial packing and reorganization of the structure are discussed in detail. Finally, we provide analytical results indicating that R+-trees achieve up to 50% savings in disk accesses compared to an R-tree when searching files of thousands of rectangles.

1,481 citations


"A demonstration of SpatialHadoop: a..." refers methods in this paper

  • ...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....

    [...]

  • ...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....

    [...]

  • ...In particular, SpatialHadoop language overrides the keywords FILTER and JOIN, when their parameters have spatial predicate(s), to perform range query and spatial join, respectively....

    [...]

Journal ArticleDOI
TL;DR: This work discusses in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.
Abstract: Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for single-key access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of file structures that treat all keys symmetrically, that is, file structures which avoid the distinction between primary and secondary keys. We start from a bitmap approach and treat the problem of file design as one of data compression of a large sparse matrix. This leads to the notions of a grid partition of the search space and of a grid directory, which are the keys to a dynamic file structure called the grid file. This file system adapts gracefully to its contents under insertions and deletions, and thus achieves an upper bound of two disk accesses for single record retrieval; it also handles range queries and partially specified queries efficiently. We discuss in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.

1,222 citations


"A demonstration of SpatialHadoop: a..." refers methods in this paper

  • ...For example, when the FILTER keyword is used with the Overlaps predicate, SpatialHadoop reroutes its processing to the range query operation....

    [...]

  • ...We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively....

    [...]

Proceedings ArticleDOI
11 Apr 2011
TL;DR: This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
Abstract: MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.

342 citations


Additional excerpts

  • ..., machine learning [3], tera-byte sorting [9], and graph processing [1]....

    [...]