scispace - formally typeset
Journal ArticleDOI

ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]

Reads0
Chats0
TLDR
ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time is presented.
Abstract
MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time. ScalaGiST provides extensibility in terms of data and query types, hence is able to support unconventional queries (e.g., multi-dimensional range and k-NN queries) in MapReduce systems, and can be dynamically deployed in large cluster environments for handling big users and data.We have built ScalaGiST and demonstrated that it can be easily instantiated to common B+-tree and R-tree indexes yet for dynamic distributed environments. Our extensive performance study shows that ScalaGiST can provide efficient write and read performance, elastic scaling property, as well as effective support for MapReduce execution of ad-hoc analytic queries. Performance comparisions with recent proposals of specialized distributed index structures, such as SpatialHadoop, Data Mapping, and RT-CAN further confirm its efficiency.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

ST-Hadoop: a MapReduce framework for spatio-temporal data

TL;DR: The key idea behind the performance gained in ST-Hadoop is its ability in indexing spatio-temporal data within Hadoop Distributed File System.
Book

The Era of Big Spatial Data: A Survey

TL;DR: This survey summarizes the state-of-the-art work in the area of big spatial data according to approach, architecture, language, indexing, querying, and visualization, and gives case studies of real application systems that make use of these systems to provide services for end users.
Journal ArticleDOI

Big spatial vector data management: a review

TL;DR: A review that surveys recent studies and research work in the data management field for BSVD and concludes systematically not only the most recent published literatures but also a global view of main spatial technologies of BSVD, including data storage and organization, spatial index, processing methods, and spatial analysis.
Journal ArticleDOI

The era of big spatial data

TL;DR: This paper discusses the main features and components that needs to be supported in a system to handle big spatial data efficiently, namely, language, indexing, query processing, and visualization, and reviews the recent work according to these four components.
Proceedings ArticleDOI

The era of big spatial data

TL;DR: This tutorial goes beyond the use of existing systems as-is, and digs deep into the core components of big systems to describe how they are designed to handle big spatial data.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article

Spark: cluster computing with working sets

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings Article

Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).

TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Proceedings ArticleDOI

Dynamo: amazon's highly available key-value store

TL;DR: D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Related Papers (5)