ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]

doi:10.14778/2733085.2733087

Journal ArticleDOI

ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]

Peng Lu, +4 more

- Vol. 7, Iss: 14, pp 1797-1808

Chats0

TLDR

ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time is presented.

Abstract:

MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time. ScalaGiST provides extensibility in terms of data and query types, hence is able to support unconventional queries (e.g., multi-dimensional range and k-NN queries) in MapReduce systems, and can be dynamically deployed in large cluster environments for handling big users and data.We have built ScalaGiST and demonstrated that it can be easily instantiated to common B+-tree and R-tree indexes yet for dynamic distributed environments. Our extensive performance study shows that ScalaGiST can provide efficient write and read performance, elastic scaling property, as well as effective support for MapReduce execution of ad-hoc analytic queries. Performance comparisions with recent proposals of specialized distributed index structures, such as SpatialHadoop, Data Mapping, and RT-CAN further confirm its efficiency.

ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]

Citations

ST-Hadoop: a MapReduce framework for spatio-temporal data

The Era of Big Spatial Data: A Survey

Big spatial vector data management: a review

The era of big spatial data

The era of big spatial data

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Spark: cluster computing with working sets

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

Dynamo: amazon's highly available key-value store

Related Papers (5)

SpatialHadoop: A MapReduce framework for spatial data

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Efficient processing of k nearest neighbor joins using MapReduce

MapReduce: simplified data processing on large clusters

Voronoi-Based Geospatial Query Processing with MapReduce