Home
/
Authors
/
Alan Gates

Author

Alan Gates

Bio: Alan Gates is an academic researcher from Yahoo!. The author has contributed to research in topics: Big data & Data warehouse. The author has an hindex of 6, co-authored 6 publications receiving 628 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

[...]

Alan Gates¹, Olga Natkovich¹, Shubham Chopra¹, Pradeep Kamath¹, Shravan Narayanamurthy¹, Christopher Olston¹, Benjamin Reed¹, Santhosh Srinivasan¹, Utkarsh Srivastava¹ - Show less +5 more•Institutions (1)

Yahoo!¹

01 Aug 2009

TL;DR: Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce, and performance comparisons between Pig execution and raw Map- Reduce execution are reported.

...read moreread less

Abstract: Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation.This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.

...read moreread less

452 citations

Proceedings Article•DOI•

Major technical advancements in apache hive

[...]

Yin Huai¹, Ashutosh Chauhan, Alan Gates, Günther Hagleitner, Eric N. Hanson², Owen O'Malley, Jitendra Pandey, Yuan Yuan¹, Rubao Lee¹, Xiaodong Zhang¹ - Show less +6 more•Institutions (2)

Ohio State University¹, Microsoft²

18 Jun 2014

TL;DR: A community-based effort on technical advancements in Hive provides significant improvements on storage efficiency and query execution performance and shows how academic research lays a foundation for Hive to improve its daily operations.

...read moreread less

Abstract: Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

...read moreread less

131 citations

Patent•

Clustered query support for a database query engine

[...]

Olga Natkovich¹, Jonathan Cao¹, Alan Gates¹•Institutions (1)

Yahoo!¹

22 Dec 2006

TL;DR: In this article, the plurality of queries is transformed into a plurality of parse trees, and a determination is made whether the plurality operates on at least the same portion of the same table.

...read moreread less

Abstract: A device, system, and method are directed towards combining a plurality of queries to a database into a combined execution plan. The plurality of queries is received. The queries may be Structured Query Language (SQL) statements. The database may be a relational database. The plurality of queries is transformed into a plurality of parse trees. A determination is made whether the plurality of queries operates on at least the same portion of the same table. If so, then the plurality of query trees is query-optimized. The plurality of query trees are combined into a master query tree based on similar nodes in the plurality of query trees. A split node in the master query tree represents non-similarities between the plurality of query trees. The master query tree is transformed into an execution plan. The execution plan is applied to a database to return at least one result.

...read moreread less

35 citations

Proceedings Article•DOI•

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

[...]

Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, Günther Hagleitner - Show less +17 more

25 Jun 2019

TL;DR: Apache Hive as mentioned in this paper is an open-source relational database system for analytic big-data workloads that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications.

...read moreread less

Abstract: Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

...read moreread less

34 citations

Book•

Programming Pig: Dataflow Scripting with Hadoop

[...]

Alan Gates, Daniel Dai

09 Nov 2016

TL;DR: This second edition of the Apache Pig scripting platform guide is the ideal learning tool for new and experienced users alike, with comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell.

...read moreread less

Abstract: For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets. Updated with use cases and programming examples, this second edition is the ideal learning tool for new and experienced users alike. Youll find comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell. When you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig. Delve into Pigs data model, including scalar and complex data types Write Pig Latin scripts to sort, group, join, project, and filter your data Use Grunt to work with the Hadoop Distributed File System (HDFS) Build complex data processing pipelines with Pigs macros and modularity features Embed Pig Latin in Python for iterative processing and other advanced tasks Use Pig with Apache Tez to build high-performance batch and interactive data processing applications Create your own load and store functions to handle data formats and storage mechanisms

...read moreread less

10 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

The Hadoop Distributed File System

[...]

Konstantin Shvachko¹, Hairong Kuang¹, Sanjay Radia¹, Robert J. Chansler¹•Institutions (1)

Yahoo!¹

03 May 2010

TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

...read moreread less

Abstract: The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

...read moreread less

5,005 citations

Journal Article•DOI•

Big Data: A Survey

[...]

Min Chen¹, Shiwen Mao², Yunhao Liu³•Institutions (3)

Huazhong University of Science and Technology¹, Auburn University², Tsinghua University³

01 Apr 2014-Mobile Networks and Applications

TL;DR: The background and state-of-the-art of big data are reviewed, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid, as well as related technologies.

...read moreread less

Abstract: In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

...read moreread less

2,303 citations

Journal Article•DOI•

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

[...]

Han Hu¹, Yonggang Wen², Tat-Seng Chua¹, Xuelong Li³•Institutions (3)

National University of Singapore¹, Nanyang Technological University², Chinese Academy of Sciences³

24 Jun 2014-IEEE Access

TL;DR: This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.

...read moreread less

Abstract: Recent technological advancements have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

...read moreread less

1,002 citations

Journal Article•DOI•

Parallel data processing with MapReduce: a survey

[...]

Kyong-Ha Lee¹, Yoon-Joon Lee¹, Hyunsik Choi², Yon Dohn Chung², Bongki Moon³ - Show less +1 more•Institutions (3)

KAIST¹, Korea University², University of Arizona³

11 Jan 2012

TL;DR: In this survey, the MapReduce framework is characterized and its inherent pros and cons are discussed, and its optimization strategies reported in the recent literature are introduced.

...read moreread less

Abstract: A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.

...read moreread less

663 citations

Journal Article•DOI•

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

[...]

Ablimit Aji¹, Fusheng Wang¹, Hoang Vo¹, Rubao Lee², Qiaoling Liu¹, Xiaodong Zhang², Joel H. Saltz¹ - Show less +3 more•Institutions (2)

Emory University¹, Ohio State University²

01 Aug 2013

TL;DR: Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop and integrated into Hive to support declarative spatial queries with an integrated architecture is presented.

...read moreread less

Abstract: Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

...read moreread less

571 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

Collapse