scispace - formally typeset
Search or ask a question
Author

Shivaram Venkataraman

Bio: Shivaram Venkataraman is an academic researcher from University of Wisconsin-Madison. The author has contributed to research in topics: Scheduling (computing) & Computer science. The author has an hindex of 31, co-authored 86 publications receiving 6423 citations. Previous affiliations of Shivaram Venkataraman include Microsoft & University of Illinois at Urbana–Champaign.


Papers
More filters
Journal ArticleDOI
TL;DR: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Abstract: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

1,776 citations

Journal Article
TL;DR: MLlib as mentioned in this paper is an open-source distributed machine learning library for Apache Spark that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
Abstract: Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLLIB provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLLIB supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLLIB has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

1,551 citations

Proceedings ArticleDOI
15 Feb 2011
TL;DR: This paper presents Consistent and Durable Data Structures (CDDSs), a single-level data store that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies.
Abstract: The predicted shift to non-volatile, byte-addressable memory (e.g., Phase Change Memory and Memristor), the growth of "big data", and the subsequent emergence of frameworks such as memcached and NoSQL systems require us to rethink the design of data stores. To derive the maximum performance from these new memory technologies, this paper proposes the use of single-level data stores. For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging. The same versioning scheme also enables rollback for failure recovery. When compared to a memory-backed Berkeley DB B-Tree, our prototype-based results show that a CDDS B-Tree can increase put and get throughput by 74% and 138%. When compared to Cassandra, a two-level data store, Tembo, a CDDS B-Tree enabled distributed Key-Value system, increases throughput by up to 250%-286%.

403 citations

Proceedings Article
16 Mar 2016
TL;DR: Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.
Abstract: Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads on cloud computing infrastructure. However, efficiently running these applications on shared infrastructure is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration. Our insight is that a number of jobs have predictable structure in terms of computation and communication. Thus we can build performance models based on the behavior of the job on small samples of data and then predict its performance on larger datasets and cluster sizes. To minimize the time and resources spent in building a model, we use optimal experiment design, a statistical technique that allows us to collect as few training points as required. We have built Ernest, a performance prediction framework for large scale analytics and our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

401 citations

Proceedings ArticleDOI
24 Sep 2017
TL;DR: Stateless functions are a natural fit for data processing in future computing environments as mentioned in this paper, based on recent trends in network bandwidth and the advent of disaggregated storage, and stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity.
Abstract: Distributed computing remains inaccessible to a large number of users, in spite of many open source platforms and extensive commercial offerings. While distributed computation frameworks have moved beyond a simple map-reduce model, many users are still left to struggle with complex cluster management and configuration tools, even for running simple embarrassingly parallel jobs. We argue that stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity. Furthermore, using our prototype implementation, PyWren, we show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Extrapolating from recent trends in network bandwidth and the advent of disaggregated storage, we suggest that stateless functions are a natural fit for data processing in future computing environments.

369 citations


Cited by
More filters
Proceedings ArticleDOI
13 Aug 2016
TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

14,872 citations

Proceedings ArticleDOI
TL;DR: This paper proposes a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning and provides insights on cache access patterns, data compression and sharding to build a scalable tree boosting system called XGBoost.
Abstract: Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

13,333 citations

Proceedings ArticleDOI
24 Oct 2016
TL;DR: In this paper, the authors develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy, and demonstrate that they can train deep neural networks with nonconvex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
Abstract: Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.

2,944 citations

Proceedings ArticleDOI
TL;DR: This work develops new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy, and demonstrates that deep neural networks can be trained with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
Abstract: Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.

1,777 citations

Journal ArticleDOI
TL;DR: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Abstract: This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications

1,776 citations