Top 11 papers published by Ameet Talwalkar from Carnegie Mellon University in 2013

Journal Article•DOI•

A large-scale evaluation of computational protein function prediction

[...]

Predrag Radivojac¹, Wyatt T. Clark¹, Tal Ronnen Oron², Alexandra M. Schnoes³, Tobias Wittkop², Artem Sokolov⁴, Artem Sokolov⁵, Kiley Graim⁵, Christopher S. Funk⁶, Karin Verspoor⁶, Asa Ben-Hur⁵, Gaurav Pandey⁷, Gaurav Pandey⁸, Jeffrey M. Yunes⁸, Ameet Talwalkar⁸, Susanna Repo⁸, Susanna Repo⁹, Michael L Souza⁸, Damiano Piovesan¹⁰, Rita Casadio¹⁰, Zheng Wang¹¹, Jianlin Cheng¹¹, Hai Fang, Julian Gough¹², Patrik Koskinen¹³, Petri Törönen¹³, Jussi Nokso-Koivisto¹³, Liisa Holm¹³, Domenico Cozzetto¹⁴, Daniel W. A. Buchan¹⁴, Kevin Bryson¹⁴, David T. Jones¹⁴, Bhakti Limaye¹⁵, Harshal Inamdar¹⁵, Avik Datta¹⁵, Sunitha K Manjari¹⁵, Rajendra Joshi¹⁵, Meghana Chitale¹⁶, Daisuke Kihara¹⁶, Andreas Martin Lisewski¹⁷, Serkan Erdin¹⁷, Eric Venner¹⁷, Olivier Lichtarge¹⁷, Robert Rentzsch¹⁴, Haixuan Yang¹⁸, Alfonso E. Romero¹⁸, Prajwal Bhat¹⁸, Alberto Paccanaro¹⁸, Tobias Hamp¹⁹, Rebecca Kaßner¹⁹, Stefan Seemayer¹⁹, Esmeralda Vicedo¹⁹, Christian Schaefer¹⁹, Dominik Achten¹⁹, Florian Auer¹⁹, Ariane Boehm¹⁹, Tatjana Braun¹⁹, Maximilian Hecht¹⁹, Mark Heron¹⁹, Peter Hönigschmid¹⁹, Thomas A. Hopf¹⁹, Stefanie Kaufmann¹⁹, Michael Kiening¹⁹, Denis Krompass¹⁹, Cedric Landerer¹⁹, Yannick Mahlich¹⁹, Manfred Roos¹⁹, Jari Björne²⁰, Tapio Salakoski²⁰, Andrew Wong²¹, Hagit Shatkay²¹, Hagit Shatkay²², Fanny Gatzmann²³, Ingolf Sommer²³, Mark N. Wass²⁴, Michael J.E. Sternberg²⁴, Nives Škunca, Fran Supek, Matko Bošnjak, Panče Panov, Sašo Džeroski, Tomislav Šmuc, Yiannis A. I. Kourmpetis²⁵, Yiannis A. I. Kourmpetis²⁶, Aalt D. J. van Dijk²⁶, Cajo J. F. ter Braak²⁶, Yuanpeng Zhou²⁷, Qingtian Gong²⁷, Xinran Dong²⁷, Weidong Tian²⁷, Marco Falda²⁸, Paolo Fontana, Enrico Lavezzo²⁸, Barbara Di Camillo²⁸, Stefano Toppo²⁸, Liang Lan²⁹, Nemanja Djuric²⁹, Yuhong Guo²⁹, Slobodan Vucetic²⁹, Amos Marc Bairoch³⁰, Amos Marc Bairoch³¹, Michal Linial³², Patricia C. Babbitt³, Steven E. Brenner⁸, Christine A. Orengo¹⁴, Burkhard Rost¹⁹, Sean D. Mooney², Iddo Friedberg³³ - Show less +104 more•Institutions (33)

Indiana University¹, Buck Institute for Research on Aging², University of California, San Francisco³, University of California, Santa Cruz⁴, Colorado State University⁵, University of Colorado Denver⁶, Icahn School of Medicine at Mount Sinai⁷, University of California, Berkeley⁸, European Bioinformatics Institute⁹, University of Bologna¹⁰, University of Missouri¹¹, University of Bristol¹², University of Helsinki¹³, University College London¹⁴, Centre for Development of Advanced Computing¹⁵, Purdue University¹⁶, Baylor College of Medicine¹⁷, Royal Holloway, University of London¹⁸, Technische Universität München¹⁹, University of Turku²⁰, Queen's University²¹, University UCINF²², Max Planck Society²³, Imperial College London²⁴, Nestlé²⁵, Wageningen University and Research Centre²⁶, Fudan University²⁷, University of Padua²⁸, Temple University²⁹, University of Geneva³⁰, Swiss Institute of Bioinformatics³¹, Hebrew University of Jerusalem³², Miami University³³

01 Mar 2013-Nature Methods

TL;DR: Today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets, and there is considerable need for improvement of currently available tools.

...read moreread less

Abstract: Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

...read moreread less

859 citations

Proceedings Article•

MLbase: A Distributed Machine-learning System

[...]

Tim Kraska¹, Ameet Talwalkar², John C. Duchi², Rean Griffith³, Michael J. Franklin², Michael I. Jordan² - Show less +2 more•Institutions (3)

Brown University¹, University of California, Berkeley², VMware³

01 Jan 2013

TL;DR: This work presents the vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers, which provides a simple declarative way to specify ML tasks and a novel optimizer to select and dynamically adapt the choice of learning algorithm.

...read moreread less

Abstract: Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming|many users do not understand the trade-os and challenges of parameterizing and choosing between dierent learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

...read moreread less

359 citations

Posted Content•

MLI: An API for Distributed Machine Learning

[...]

Evan R. Sparks¹, Ameet Talwalkar¹, Virginia Smith¹, Jey Kottalam¹, Xinghao Pan¹, Joseph E. Gonzalez¹, Michael J. Franklin¹, Michael I. Jordan¹, Tim Kraska² - Show less +5 more•Institutions (2)

University of California, Berkeley¹, Brown University²

21 Oct 2013-arXiv: Learning

TL;DR: The initial results show that this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

...read moreread less

Abstract: MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

...read moreread less

166 citations

Proceedings Article•DOI•

MLI: An API for Distributed Machine Learning

[...]

Evan R. Sparks¹, Ameet Talwalkar¹, Virginia Smith¹, Jey Kottalam¹, Xinghao Pan¹, Joseph E. Gonzalez¹, Michael J. Franklin¹, Michael I. Jordan¹, Tim Kraska² - Show less +5 more•Institutions (2)

University of California, Berkeley¹, Brown University²

01 Dec 2013

TL;DR: MLI as discussed by the authors is an application programming interface designed to address the challenges of building machine learning algorithms in a distributed setting based on data-centric computing, and its primary goal is to simplify the development of high-performance, scalable, distributed algorithms.

...read moreread less

Abstract: MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

...read moreread less

113 citations

Journal Article•

Large-scale SVD and manifold learning

[...]

Ameet Talwalkar¹, Sanjiv Kumar², Mehryar Mohri³, Henry Allan Rowley²•Institutions (3)

University of California, Berkeley¹, Google², Courant Institute of Mathematical Sciences³

01 Jan 2013-Journal of Machine Learning Research

TL;DR: The authors' comparisons show that the Nystrom approximation is superior to the Column sampling method for this task, and approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set.

...read moreread less

Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystrom and Column sampling methods. We present a theoretical comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the large-scale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystrom approximation is superior to the Column sampling method for this task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set.

...read moreread less

67 citations

Posted Content•

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

[...]

Ameet Talwalkar¹, Jesse Liptrap¹, Julie Newcomb¹, Christopher Hartl¹, Jonathan Terhorst¹, Kristal Curtis¹, Ma'ayan Bresler¹, Yun S. Song¹, Michael I. Jordan¹, David A. Patterson¹ - Show less +6 more•Institutions (1)

Broad Institute¹

31 Oct 2013-arXiv: Genomics

TL;DR: SMaSH as mentioned in this paper is a benchmarking methodology for evaluating human genome variant calling algorithms, including single nucleotide polymorphism (SNP), indel, and structural variant calling.

...read moreread less

Abstract: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

...read moreread less

46 citations

Proceedings Article•DOI•

Distributed Low-Rank Subspace Segmentation

[...]

Ameet Talwalkar¹, Lester Mackey², Yadong Mu³, Shih-Fu Chang³, Michael I. Jordan¹ - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Stanford University², Columbia University³

01 Dec 2013

TL;DR: In this paper, a divide-and-conquer algorithm is proposed for large-scale subspace segmentation that can cope with low-rank representation's non-decomposable constraints and maintains LRR's strong recovery guarantees.

...read moreread less

Abstract: Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), a convex formulation of the subspace segmentation problem, is provably and empirically accurate on small problems but does not scale to the massive sizes of modern vision datasets. Moreover, past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non-decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large-scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. We then introduce novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging. In each case, we obtain state-of-the-art results and order-of-magnitude speed ups.

...read moreread less

36 citations

Proceedings Article•DOI•

A general bootstrap performance diagnostic

[...]

Ariel Kleiner¹, Ameet Talwalkar¹, Sameer Agarwal¹, Ion Stoica¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

11 Aug 2013

TL;DR: This work presents here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstrap's outputs, determining whether or not thebootstrap is performing satisfactorily when applied to a given dataset and estimator.

...read moreread less

Abstract: As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quality of estimates, much as we are able to automatically train and evaluate predictive models such as classifiers. However, despite the fundamental importance of estimator quality assessment in data analysis, this task has eluded highly automatic solutions. While the bootstrap provides perhaps the most promising step in this direction, its level of automation is limited by the difficulty of evaluating its finite sample performance and even its asymptotic consistency. Thus, we present here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstrap's outputs, determining whether or not the bootstrap is performing satisfactorily when applied to a given dataset and estimator. We show that our proposed diagnostic is effective via an extensive empirical evaluation on a variety of estimators and simulated and real datasets, including a real-world query workload from Conviva, Inc. involving 1.7TB of data (i.e., approximately 0.5 billion data points).

...read moreread less

27 citations

Posted Content•

Distributed Low-rank Subspace Segmentation

[...]

Ameet Talwalkar¹, Lester Mackey², Yadong Mu³, Shih-Fu Chang³, Michael I. Jordan¹ - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Stanford University², Columbia University³

20 Apr 2013-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging and proposes a novel divide-and-conquer algorithm that can cope with LRR's non-decomposable constraints and maintains L RR's strong recovery guarantees.

...read moreread less

Abstract: Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), a convex formulation of the subspace segmentation problem, is provably and empirically accurate on small problems but does not scale to the massive sizes of modern vision datasets. Moreover, past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non-decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large-scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. We then introduce novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging. In each case, we obtain state-of-the-art results and order-of-magnitude speed ups.

...read moreread less

4 citations

Posted Content•

SMASH: A Benchmarking Toolkit for Variant Calling

[...]

Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma'ayan Bresler, Yun S. Song, Michael I. Jordan, David A. Patterson - Show less +6 more

31 Oct 2013

TL;DR: In this article, the authors propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms, including single nucleotide polymorphism (SNP), indel, and structural variant calling.

...read moreread less

Abstract: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at this http URL.

...read moreread less

3 citations

Posted Content•

Divide-and-Conquer Subspace Segmentation

[...]

Ameet Talwalkar, Lester Mackey, Yadong Mu, Shih-Fu Chang, Michael I. Jordan - Show less +1 more

20 Apr 2013

TL;DR: This work introduces the DFC-LRR algorithm as a scalable solution to the subspace segmentation problem, and presents results that illustrate the scalability and accuracy of DFC relative to LRR.

...read moreread less

Abstract: Several important computer vision tasks have recently been formulated as lowrank problems, with the Low-Rank Representation method (LRR) being one recent and prominent formulation. Although the method is framed as a convex program, available solutions to this program are inherently sequential and costly, thus limiting its scalability. In this work, we explore the effectiveness of a recently introduced divide-and-conquer framework, entitled DFC, in the context of LRR. We introduce the DFC-LRR algorithm as a scalable solution to the subspace segmentation problem, presenting results that illustrate the scalability and accuracy of DFC relative to LRR. We further present a detailed theoretical analysis that shows that the recovery guarantees of DFC-LRR are comparable to those of LRR.

...read moreread less

Showing papers by "Ameet Talwalkar published in 2013"