scispace - formally typeset
Search or ask a question

Showing papers by "Ameet Talwalkar published in 2013"


Journal ArticleDOI
Predrag Radivojac1, Wyatt T. Clark1, Tal Ronnen Oron2, Alexandra M. Schnoes3, Tobias Wittkop2, Artem Sokolov4, Artem Sokolov5, Kiley Graim5, Christopher S. Funk6, Karin Verspoor6, Asa Ben-Hur5, Gaurav Pandey7, Gaurav Pandey8, Jeffrey M. Yunes8, Ameet Talwalkar8, Susanna Repo8, Susanna Repo9, Michael L Souza8, Damiano Piovesan10, Rita Casadio10, Zheng Wang11, Jianlin Cheng11, Hai Fang, Julian Gough12, Patrik Koskinen13, Petri Törönen13, Jussi Nokso-Koivisto13, Liisa Holm13, Domenico Cozzetto14, Daniel W. A. Buchan14, Kevin Bryson14, David T. Jones14, Bhakti Limaye15, Harshal Inamdar15, Avik Datta15, Sunitha K Manjari15, Rajendra Joshi15, Meghana Chitale16, Daisuke Kihara16, Andreas Martin Lisewski17, Serkan Erdin17, Eric Venner17, Olivier Lichtarge17, Robert Rentzsch14, Haixuan Yang18, Alfonso E. Romero18, Prajwal Bhat18, Alberto Paccanaro18, Tobias Hamp19, Rebecca Kaßner19, Stefan Seemayer19, Esmeralda Vicedo19, Christian Schaefer19, Dominik Achten19, Florian Auer19, Ariane Boehm19, Tatjana Braun19, Maximilian Hecht19, Mark Heron19, Peter Hönigschmid19, Thomas A. Hopf19, Stefanie Kaufmann19, Michael Kiening19, Denis Krompass19, Cedric Landerer19, Yannick Mahlich19, Manfred Roos19, Jari Björne20, Tapio Salakoski20, Andrew Wong21, Hagit Shatkay21, Hagit Shatkay22, Fanny Gatzmann23, Ingolf Sommer23, Mark N. Wass24, Michael J.E. Sternberg24, Nives Škunca, Fran Supek, Matko Bošnjak, Panče Panov, Sašo Džeroski, Tomislav Šmuc, Yiannis A. I. Kourmpetis25, Yiannis A. I. Kourmpetis26, Aalt D. J. van Dijk26, Cajo J. F. ter Braak26, Yuanpeng Zhou27, Qingtian Gong27, Xinran Dong27, Weidong Tian27, Marco Falda28, Paolo Fontana, Enrico Lavezzo28, Barbara Di Camillo28, Stefano Toppo28, Liang Lan29, Nemanja Djuric29, Yuhong Guo29, Slobodan Vucetic29, Amos Marc Bairoch30, Amos Marc Bairoch31, Michal Linial32, Patricia C. Babbitt3, Steven E. Brenner8, Christine A. Orengo14, Burkhard Rost19, Sean D. Mooney2, Iddo Friedberg33 
TL;DR: Today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets, and there is considerable need for improvement of currently available tools.
Abstract: Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

859 citations


Proceedings Article
01 Jan 2013
TL;DR: This work presents the vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers, which provides a simple declarative way to specify ML tasks and a novel optimizer to select and dynamically adapt the choice of learning algorithm.
Abstract: Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming|many users do not understand the trade-os and challenges of parameterizing and choosing between dierent learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.

359 citations


Posted Content
TL;DR: The initial results show that this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
Abstract: MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

166 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: MLI as discussed by the authors is an application programming interface designed to address the challenges of building machine learning algorithms in a distributed setting based on data-centric computing, and its primary goal is to simplify the development of high-performance, scalable, distributed algorithms.
Abstract: MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

113 citations


Journal Article
TL;DR: The authors' comparisons show that the Nystrom approximation is superior to the Column sampling method for this task, and approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set.
Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystrom and Column sampling methods. We present a theoretical comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the large-scale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystrom approximation is superior to the Column sampling method for this task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set.

67 citations


Posted Content
TL;DR: SMaSH as mentioned in this paper is a benchmarking methodology for evaluating human genome variant calling algorithms, including single nucleotide polymorphism (SNP), indel, and structural variant calling.
Abstract: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

46 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: In this paper, a divide-and-conquer algorithm is proposed for large-scale subspace segmentation that can cope with low-rank representation's non-decomposable constraints and maintains LRR's strong recovery guarantees.
Abstract: Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), a convex formulation of the subspace segmentation problem, is provably and empirically accurate on small problems but does not scale to the massive sizes of modern vision datasets. Moreover, past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non-decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large-scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. We then introduce novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging. In each case, we obtain state-of-the-art results and order-of-magnitude speed ups.

36 citations


Proceedings ArticleDOI
11 Aug 2013
TL;DR: This work presents here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstrap's outputs, determining whether or not thebootstrap is performing satisfactorily when applied to a given dataset and estimator.
Abstract: As datasets become larger, more complex, and more available to diverse groups of analysts, it would be quite useful to be able to automatically and generically assess the quality of estimates, much as we are able to automatically train and evaluate predictive models such as classifiers. However, despite the fundamental importance of estimator quality assessment in data analysis, this task has eluded highly automatic solutions. While the bootstrap provides perhaps the most promising step in this direction, its level of automation is limited by the difficulty of evaluating its finite sample performance and even its asymptotic consistency. Thus, we present here a general diagnostic procedure which directly and automatically evaluates the accuracy of the bootstrap's outputs, determining whether or not the bootstrap is performing satisfactorily when applied to a given dataset and estimator. We show that our proposed diagnostic is effective via an extensive empirical evaluation on a variety of estimators and simulated and real datasets, including a real-world query workload from Conviva, Inc. involving 1.7TB of data (i.e., approximately 0.5 billion data points).

27 citations


Posted Content
TL;DR: This work introduces novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging and proposes a novel divide-and-conquer algorithm that can cope with LRR's non-decomposable constraints and maintains L RR's strong recovery guarantees.
Abstract: Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data. Low-Rank Representation (LRR), a convex formulation of the subspace segmentation problem, is provably and empirically accurate on small problems but does not scale to the massive sizes of modern vision datasets. Moreover, past work aimed at scaling up low-rank matrix factorization is not applicable to LRR given its non-decomposable constraints. In this work, we propose a novel divide-and-conquer algorithm for large-scale subspace segmentation that can cope with LRR's non-decomposable constraints and maintains LRR's strong recovery guarantees. This has immediate implications for the scalability of subspace segmentation, which we demonstrate on a benchmark face recognition dataset and in simulations. We then introduce novel applications of LRR-based subspace segmentation to large-scale semi-supervised learning for multimedia event detection, concept detection, and image tagging. In each case, we obtain state-of-the-art results and order-of-magnitude speed ups.

4 citations


Posted Content
31 Oct 2013
TL;DR: In this article, the authors propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms, including single nucleotide polymorphism (SNP), indel, and structural variant calling.
Abstract: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at this http URL.

3 citations


Posted Content
20 Apr 2013
TL;DR: This work introduces the DFC-LRR algorithm as a scalable solution to the subspace segmentation problem, and presents results that illustrate the scalability and accuracy of DFC relative to LRR.
Abstract: Several important computer vision tasks have recently been formulated as lowrank problems, with the Low-Rank Representation method (LRR) being one recent and prominent formulation. Although the method is framed as a convex program, available solutions to this program are inherently sequential and costly, thus limiting its scalability. In this work, we explore the effectiveness of a recently introduced divide-and-conquer framework, entitled DFC, in the context of LRR. We introduce the DFC-LRR algorithm as a scalable solution to the subspace segmentation problem, presenting results that illustrate the scalability and accuracy of DFC relative to LRR. We further present a detailed theoretical analysis that shows that the recovery guarantees of DFC-LRR are comparable to those of LRR.