Journal ArticleDOI
ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]
Peng Lu,Gang Chen,Beng Chin Ooi,Hoang Tam Vo,Sai Wu +4 more
- Vol. 7, Iss: 14, pp 1797-1808
Reads0
Chats0
TLDR
ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time is presented.Abstract:
MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time. ScalaGiST provides extensibility in terms of data and query types, hence is able to support unconventional queries (e.g., multi-dimensional range and k-NN queries) in MapReduce systems, and can be dynamically deployed in large cluster environments for handling big users and data.We have built ScalaGiST and demonstrated that it can be easily instantiated to common B+-tree and R-tree indexes yet for dynamic distributed environments. Our extensive performance study shows that ScalaGiST can provide efficient write and read performance, elastic scaling property, as well as effective support for MapReduce execution of ad-hoc analytic queries. Performance comparisions with recent proposals of specialized distributed index structures, such as SpatialHadoop, Data Mapping, and RT-CAN further confirm its efficiency.read more
Citations
More filters
Journal ArticleDOI
ST-Hadoop: a MapReduce framework for spatio-temporal data
TL;DR: The key idea behind the performance gained in ST-Hadoop is its ability in indexing spatio-temporal data within Hadoop Distributed File System.
Book
The Era of Big Spatial Data: A Survey
Ahmed Eldawy,Mohamed F. Mokbel +1 more
TL;DR: This survey summarizes the state-of-the-art work in the area of big spatial data according to approach, architecture, language, indexing, querying, and visualization, and gives case studies of real application systems that make use of these systems to provide services for end users.
Journal ArticleDOI
Big spatial vector data management: a review
Xiaochuang Yao,Guoqing Li +1 more
TL;DR: A review that surveys recent studies and research work in the data management field for BSVD and concludes systematically not only the most recent published literatures but also a global view of main spatial technologies of BSVD, including data storage and organization, spatial index, processing methods, and spatial analysis.
Journal ArticleDOI
The era of big spatial data
Ahmed Eldawy,Mohamed F. Mokbel +1 more
TL;DR: This paper discusses the main features and components that needs to be supported in a system to handle big spatial data efficiently, namely, language, indexing, query processing, and visualization, and reviews the recent work according to these four components.
Proceedings ArticleDOI
The era of big spatial data
Ahmed Eldawy,Mohamed F. Mokbel +1 more
TL;DR: This tutorial goes beyond the use of existing systems as-is, and digs deep into the core components of big systems to describe how they are designed to handle big spatial data.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings Article
Spark: cluster computing with working sets
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings Article
Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!).
Fay W. Chang,Jeffrey Dean,Sanjay Ghemawat,Wilson C. Hsieh,Deborah A. Wallach,Michael Burrows,Tushar Deepak Chandra,Andrew Fikes,Robert Gruber +8 more
TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Proceedings ArticleDOI
Dynamo: amazon's highly available key-value store
Giuseppe deCandia,Deniz Hastorun,Madan Mohan Rao Jampani,Gunavardhan Kakulapati,Avinash Lakshman,Alex Pilchin,Swaminathan Sivasubramanian,Peter Sven Vosshall,Werner Vogels +8 more
TL;DR: D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.