Institution

Cloudera

About: Cloudera is a based out in . It is known for research contribution in the topics: Big data & SQL. The organization has 88 authors who have published 76 publications receiving 3500 citations. The organization is also known as: Cloudera, Inc..

...read moreread less

Topics: Big data, SQL, Distributed database, Scheduling (computing), Scalability ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•

MLlib: machine learning in apache spark

[...]

Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks¹, Shivaram Venkataraman¹, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen², Doris Xin³, Reynold Xin, Michael J. Franklin¹, Reza Bosagh Zadeh⁴, Matei Zaharia⁵, Ameet Talwalkar⁶ - Show less +12 more•Institutions (6)

University of California, Berkeley¹, Cloudera², Urbana University³, Stanford University⁴, Massachusetts Institute of Technology⁵, University of California, Los Angeles⁶

01 Jan 2016-Journal of Machine Learning Research

TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

...read moreread less

Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

...read moreread less

1,551 citations

Proceedings Article•

Impala: A Modern, Open-Source SQL Engine for Hadoop.

[...]

Marcel Kornacker¹, Alexander Behm¹, Victor Bittorf², Taras Bobrovytsky¹, Casey Ching¹, Alan Choi¹, Justin Erickson¹, Martin Grund³, Daniel Hecht¹, Matthew Jacobs⁴, Ishaan Joshi¹, Lenni Kuff¹, Dileep Kumar¹, Alex Leblang¹, Nong Li¹, Ippokratis Pandis⁵, Henry Noel Robinson¹, David Rorke¹, Silvius Rus⁶, John Russell, Dimitris Tsirogiannis¹, Skye Wanderman-Milne¹, Michael Yoder¹ - Show less +19 more•Institutions (6)

Cloudera¹, University of Wisconsin-Madison², University of Fribourg³, Microsoft⁴, IBM⁵, Google⁶

01 Jan 2015

TL;DR: This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and briefly demonstrates its superior performance compared against other popular SQL-on-Hadoop systems.

...read moreread less

Abstract: Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and briefly demonstrates its superior performance compared against other popular SQL-on-Hadoop systems.

...read moreread less

350 citations

Journal Article•DOI•

AsterixDB: a scalable, open source BDMS

[...]

Sattam Alsubaiee¹, Yasser Altowim¹, Hotham Altwaijry¹, Alexander Behm², Vinayak Borkar¹, Yingyi Bu¹, Michael J. Carey¹, Inci Cetindil¹, Madhusudan Cheelangi³, Khurram Faraaz⁴, Eugenia Gabrielova¹, Raman Grover¹, Zachary Heilbron¹, Young-Seok Kim¹, Chen Li¹, Guangqiang Li, Ji Mahn Ok¹, Nicola Onose, Pouria Pirzadeh¹, Vassilis J. Tsotras⁵, Rares Vernica⁶, Jian Wen⁷, Till Westmann⁷ - Show less +19 more•Institutions (7)

University of California, Irvine¹, Cloudera², Google³, IBM⁴, University of California, Riverside⁵, Hewlett-Packard⁶, Oracle Corporation⁷

01 Oct 2014

TL;DR: AsterixDB as mentioned in this paper is a full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem.

...read moreread less

Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store.Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.

...read moreread less

185 citations

Posted Content•

AsterixDB: A Scalable, Open Source BDMS

[...]

University of California, Irvine¹, Cloudera², Google³, IBM⁴, University of California, Riverside⁵, Hewlett-Packard⁶, Oracle Corporation⁷

02 Jul 2014-arXiv: Databases

TL;DR: This paper is the first complete description of the resulting open source AsterixDB system, covering the system's data model, its query language, and its software architecture.

...read moreread less

Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.

...read moreread less

168 citations

Journal Article•DOI•

Big Data’s Role in Precision Public Health

[...]

Shawn Dolley¹•Institutions (1)

Cloudera¹

07 Mar 2018-Frontiers in Public Health

TL;DR: This review article aims to identify the precision public health use cases where big data has added value, identify classes of value that big data may bring, and outline the risks inherent in using big data in Precision public health efforts.

...read moreread less

Abstract: Precision public health is an emerging practice to more granularly predict and understand public health risks and customize treatments for more specific and homogeneous sub-populations, often using new data, technologies, and methods. Big data is one element that has consistently helped to achieve these goals, through its ability to deliver to practitioners a volume and variety of structured or unstructured data not previously possible. Big data has enabled more widespread and specific research and trials of stratifying and segmenting populations at risk for a variety of health problems. Examples of success using big data are surveyed in surveillance and signal detection, predicting future risk, targeted interventions, and understanding disease. Using novel big data or big data approaches has risks that remain to be resolved. The continued growth in volume and variety of available data, decreased costs of data capture, and emerging computational methods mean big data success will likely be a required pillar of precision public health into the future. This review article aims to identify the precision public health use cases where big data has added value, identify classes of value that big data may bring, and outline the risks inherent in using big data in precision public health efforts.

...read moreread less

141 citations

Collapse

Authors

Showing all 88 results

Name	H-index	Papers	Citations
Ippokratis Pandis	22	75	2695
Uri Laserson	22	35	3034
Nishkam Ravi	19	40	2474
Dimitris Tsirogiannis	13	17	1469
Ariel Rabkin	12	21	16215
Victor Bittorf	11	14	1456
Marcel Kornacker	11	16	828
Lei Lu	10	19	320
Eduardo A. Garcia	10	21	354
Alexander Behm	9	13	712
Henry Noel Robinson	8	9	614
Lei Xu	8	10	225
Victor Dibia	7	20	204
Karthik Kambatla	6	10	978
Patrick David Hunt	6	6	1943