scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

01 Aug 2013-Vol. 6, Iss: 11, pp 1009-1020
TL;DR: Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop and integrated into Hive to support declarative spatial queries with an integrated architecture is presented.
Abstract: Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This review introduces future innovations and a research agenda for cloud computing supporting the transformation of the volume, velocity, variety and veracity into values of Big Data for local to global digital earth science and applications.
Abstract: Big Data has emerged in the past few years as a new paradigm providing abundant data and opportunities to improve and/or enable research and decision-support applications with unprecedented value for digital earth applications including business, sciences and engineering. At the same time, Big Data presents challenges for digital earth to store, transport, process, mine and serve the data. Cloud computing provides fundamental support to address the challenges with shared computing resources including computing, storage, networking and analytical software; the application of these resources has fostered impressive Big Data advancements. This paper surveys the two frontiers – Big Data and cloud computing – and reviews the advantages and consequences of utilizing cloud computing to tackling Big Data in the digital earth and relevant science domains. From the aspects of a general introduction, sources, challenges, technology status and research opportunities, the following observations are offered: (i...

545 citations


Cites methods from "Hadoop GIS: a high performance spat..."

  • ...Search, query, indexing and data model design Performance is critical in Big Data era, and accurately and quickly locating data requires a new generation of search engines and query systems (Miyano and Uehara 2012; Aji et al. 2013)....

    [...]

Proceedings ArticleDOI
13 Apr 2015
TL;DR: SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoan layer, namely, the language, storage, MapReduce, and operations layers, with orders of magnitude better performance than Hadoops for spatial data processing.
Abstract: This paper describes SpatialHadoop; a full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer, namely, the language, storage, MapReduce, and operations layers. In the language layer, SpatialHadoop adds a simple and expressive high level language for spatial data types and operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R+-tree, to form a two-level spatial index. SpatialHadoop enriches the MapReduce layer by two new components, SpatialFileSplitter and SpatialRecordReader, for efficient and scalable spatial data processing. In the operations layer, SpatialHadoop is already equipped with a dozen of operations, including range query, kNN, and spatial join. Other spatial operations are also implemented following a similar approach. Extensive experiments on real system prototype and real datasets show that SpatialHadoop achieves orders of magnitude better performance than Hadoop for spatial data processing.

475 citations


Cites result from "Hadoop GIS: a high performance spat..."

  • ...Similar to Hadoop, a SpatialHadoop cluster contains one master node that breaks a map-reduce job into smaller tasks, carried out by slave nodes....

    [...]

Journal ArticleDOI
TL;DR: A brief overview on the Big Data and data-intensive problems, including the analysis of RS Big Data, Big Data challenges, current techniques and works for processing RS Big data is given.

460 citations


Cites methods from "Hadoop GIS: a high performance spat..."

  • ...In addition, the Hadoop–GIS [57] system for large-scale spatial data processing, search and accessing is also build upon the Hadoop system....

    [...]

Proceedings ArticleDOI
03 Nov 2015
TL;DR: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data that achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).
Abstract: This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

332 citations

Proceedings ArticleDOI
14 Jun 2016
TL;DR: Simba is a scalable and efficient in-memory spatial query processing and analytics for big spatial data that extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API.
Abstract: Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial analytics systems are disk-based and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba's superior performance compared against other spatial analytics system.

228 citations


Cites background from "Hadoop GIS: a high performance spat..."

  • ...Hadoop GIS [11] is a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop....

    [...]

  • ...For join operations (using 3 million records in each table), as shown in Figure 11, Simba runs distance join 1.5x faster than SpatialSpark, 25x faster than Hadoop GIS, and 26x faster than DBMS X. Note that distance join over point objects is not natively supported in SpatialHadoop....

    [...]

  • ...To make the matter worse, if we want to retrieve (or do analyses over) the intersection of results from multiple kNN queries, more complex expressions such as nested sub-queries will be involved....

    [...]

  • ...Note that other spatial analytics systems (GeoSpark, SpatialSpark, SpatialHadoop, and Hadoop GIS) do not support more than two dimensions....

    [...]

  • ...For example, Simba builds its index (which uses R-tree for both local indexes and the global index) over 1 billion records (60GB in file size) in around 25 minutes, which is 2.5x faster than SpatialHadoop, 3x faster than SpatialSpark, 12x faster than Hadoop GIS, and 15x faster than Geomesa....

    [...]

References
More filters
Proceedings ArticleDOI
01 May 1990
TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.
Abstract: The R-tree, one of the most popular access methods for rectangles, is based on the heuristic optimization of the area of the enclosing rectangle in each inner node. By running numerous experiments in a standardized testbed under highly varying data, queries and operations, we were able to design the R*-tree which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory. Using our standardized testbed in an exhaustive performance comparison, it turned out that the R*-tree clearly outperforms the existing R-tree variants. Guttman's linear and quadratic R-tree and Greene's variant of the R-tree. This superiority of the R*-tree holds for different types of queries and operations, such as map overlay, for both rectangles and multidimensional points in all experiments. From a practical point of view the R*-tree is very attractive because of the following two reasons 1 it efficiently supports point and spatial data at the same time and 2 its implementation cost is only slightly higher than that of other R-trees.

4,686 citations


"Hadoop GIS: a high performance spat..." refers methods in this paper

  • ...The spatial filtering component performs MBR based spatial join filtering with the two R*-Trees, and refinement on the spatial join condition is further performed on the polygon pairs through geometric computations....

    [...]

  • ...Bulk spatial index building is performed on each dataset to generate index files – here we use R*-Trees [12]....

    [...]

Proceedings ArticleDOI
Christopher Olston1, Benjamin Reed1, Utkarsh Srivastava1, Ravi Kumar1, Andrew Tomkins1 
09 Jun 2008
TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Abstract: There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.

2,058 citations


"Hadoop GIS: a high performance spat..." refers background in this paper

  • ...MapReduce systems with high-level declarative languages include Pig Latin/Pig [25, 19], SCOPE [17], and HiveQL/Hive [29]....

    [...]

Journal ArticleDOI
01 Aug 2009
TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
Abstract: The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.

1,785 citations


"Hadoop GIS: a high performance spat..." refers background in this paper

  • ...MapReduce systems with high-level declarative languages include Pig Latin/Pig [25, 19], SCOPE [17], and HiveQL/Hive [29]....

    [...]

  • ...Hive [29] is an open source MapReduce based query system that...

    [...]

  • ...Declarative query interfaces such as Hive [29], Pig [19], and Scope [17] have brought the large scale data analysis one step closer to the common users by providing high level, easy to use programming abstractions to MapReduce....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.
Abstract: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

1,293 citations


"Hadoop GIS: a high performance spat..." refers background in this paper

  • ...Comparisons of MapReduce and parallel databases for structured data are discussed in [29, 20, 30]....

    [...]

Proceedings ArticleDOI
29 Jun 2009
TL;DR: A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
Abstract: There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

1,188 citations


"Hadoop GIS: a high performance spat..." refers background or methods in this paper

  • ...Data loading speed is a major bottleneck for SDBMS based solutions [26], especially for...

    [...]

  • ...The high data loading overhead is another major bottleneck for SDBMS based solutions [26]....

    [...]

  • ...However, this approach is highly expensive on software licensing and dedicated hardware, and requires sophisticated tuning and maintenance efforts [26]....

    [...]

  • ...We have previously developed a parallel SDBMS based approach PAIS [30, 31, 7] based on DB2 DPF with reasonable scalability, but the approach is highly expensive on software license and hardware requirement[26], and requires sophisticated tuning and maintenance....

    [...]