scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2009"


Journal ArticleDOI
TL;DR: A penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix, and establishes connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006).
Abstract: SUMMARY We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as ˆ X = � K=1 dkukv T , where dk, uk, and

1,540 citations


Journal ArticleDOI
TL;DR: The sparse canonical correlation analysis (sparse CCA) is a method for identifying sparse linear combinations of the two sets of variables that are correlated with each other and associated with the outcome as mentioned in this paper.
Abstract: In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

417 citations


Journal ArticleDOI
TL;DR: It is shown that ridge regression, the lasso and the elastic net are special cases of covariance‐regularized regression, and it is demonstrated that certain previously unexplored forms of covariant regularized regression can outperform existing methods in a range of situations.
Abstract: In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.

247 citations


Book ChapterDOI
01 Jan 2009
TL;DR: The generalization performance of a learning method relates to its prediction capability on independent test data, and gives a measure of the quality of the ultimately chosen model.
Abstract: The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model.

220 citations


Book ChapterDOI
01 Jan 2009
TL;DR: Boosting is one of the most powerful learning ideas introduced in the last ten years, but as will be seen in this chapter, it can profitably be extended to regression as well.
Abstract: Boosting is one of the most powerful learning ideas introduced in the last ten years. It was originally designed for classification problems, but as will be seen in this chapter, it can profitably be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” From this perspective boosting bears a resemblance to bagging and other committee-based approaches (Section 8.8). However we shall see that the connection is at best superficial and that boosting is fundamentally different.

192 citations


Journal Article
TL;DR: An approximate procedure based on the pseudo-likelihood of Besag (1975) is implemented and this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a) and only slightly less accurate.
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

189 citations


Journal ArticleDOI
15 Jul 2009-JAMA
TL;DR: A network model of a cooperative genetic landscape in gliomas shows that mutations, which are likely due to nonrandom selection of a distinct genetic landscape during gliomagenesis, are associated with patient prognosis.
Abstract: Context Gliomas, particularly glioblastomas, are among the deadliest of human tumors. Gliomas emerge through the accumulation of recurrent chromosomal alterations, some of which target yet-to-be-discovered cancer genes. A persistent question concerns the biological basis for the coselection of these alterations during gliomagenesis. Objectives To describe a network model of a cooperative genetic landscape in gliomas and to evaluate its clinical relevance. Design, Setting, and Patients Multidimensional genomic profiles and clinical profiles of 501 patients with gliomas (45 tumors in an initial discovery set collected between 2001 and 2004 and 456 tumors in validation sets made public between 2006 and 2008) from multiple academic centers in the United States and The Cancer Genome Atlas Pilot Project (TCGA). Main Outcome Measures Identification of genes with coincident genetic alterations, correlated gene dosage and gene expression, and multiple functional interactions; association between those genes and patient survival. Results Gliomas select for a nonrandom genetic landscape—a consistent pattern of chromosomal alterations—that involves altered regions (“territories”) on chromosomes 1p, 7, 8q, 9p, 10, 12q, 13q, 19q, 20, and 22q (false-discovery rate–corrected P POLD2 , CYCS , MYC , AKR1C3 , YME1L1 , ANXA7 , and PDCD4 ) is associated with the duration of overall survival in 189 glioblastoma samples from TCGA (global log-rank P = .02 comparing 3 survival curves for patients with 0-2, 3-4, and 5-7 dosage-altered genes). Groups of patients with 0 to 2 (low-risk group) and 5 to 7 (high-risk group) dosage-altered genes experienced 49.24 and 79.56 deaths per 100 person-years (hazard ratio [HR], 1.63; 95% confidence interval [CI], 1.10-2.40; Cox regression model P = .02), respectively. These associations with survival are validated using gene expression data in 3 independent glioma studies, comprising 76 (global log-rank P = .003; 47.89 vs 15.13 deaths per 100 person-years for high risk vs low risk; Cox model HR, 3.04; 95% CI, 1.49-6.20; P = .002) and 70 (global log-rank P = .008; 83.43 vs 16.14 deaths per 100 person-years for high risk vs low risk; HR, 3.86; 95% CI, 1.59-9.35; P = .003) high-grade gliomas and 191 glioblastomas (global log-rank P = .002; 83.23 vs 34.16 deaths per 100 person-years for high risk vs low risk; HR, 2.27; 95% CI, 1.44-3.58; P Conclusions The alteration of multiple networking genes by recurrent chromosomal aberrations in gliomas deregulates critical signaling pathways through multiple, cooperative mechanisms. These mutations, which are likely due to nonrandom selection of a distinct genetic landscape during gliomagenesis, are associated with patient prognosis.

184 citations


Journal ArticleDOI
TL;DR: A large‐scale analysis of disease‐associated experiments obtained from NCBI GEO finds evidence for a general, pathophysiological concordance between experiments measuring the same disease condition, and that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases.
Abstract: Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.

122 citations


Journal ArticleDOI
TL;DR: Although further validation in prospective and larger cohorts is needed, the observations demonstrate that multiplex characterization of autoantibodies and cytokines provides clinical utility for predicting response to the anti-TNF therapy etanercept in RA patients.
Abstract: Anti-TNF therapies have revolutionized the treatment of rheumatoid arthritis (RA), a common systemic autoimmune disease involving destruction of the synovial joints. However, in the practice of rheumatology approximately one-third of patients demonstrate no clinical improvement in response to treatment with anti-TNF therapies, while another third demonstrate a partial response, and one-third an excellent and sustained response. Since no clinical or laboratory tests are available to predict response to anti-TNF therapies, great need exists for predictive biomarkers. Here we present a multi-step proteomics approach using arthritis antigen arrays, a multiplex cytokine assay, and conventional ELISA, with the objective to identify a biomarker signature in three ethnically diverse cohorts of RA patients treated with the anti-TNF therapy etanercept. We identified a 24-biomarker signature that enabled prediction of a positive clinical response to etanercept in all three cohorts (positive predictive values 58 to 72%; negative predictive values 63 to 78%). We identified a multi-parameter protein biomarker that enables pretreatment classification and prediction of etanercept responders, and tested this biomarker using three independent cohorts of RA patients. Although further validation in prospective and larger cohorts is needed, our observations demonstrate that multiplex characterization of autoantibodies and cytokines provides clinical utility for predicting response to the anti-TNF therapy etanercept in RA patients.

119 citations


Journal ArticleDOI
TL;DR: A simple method is proposed for the estimation of the minimum value of the cross-validation error which can be biased downward as an estimate of the test error at that samevalue of the tuning parameter.
Abstract: Tuning parameters in supervised learning problems are often estimated by cross-validation. The minimum value of the cross-validation error can be biased downward as an estimate of the test error at that same value of the tuning parameter. We propose a simple method for the estimation of this bias that uses information from the cross-validation process. As a result, it requires essentially no additional computation. We apply our bias estimate to a number of popular classifiers in various settings, and examine its performance.

103 citations


Journal ArticleDOI
TL;DR: Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.
Abstract: Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so-called transposable regularized covariance models allow for maximum likelihood estimation of the mean and nonsingular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

Journal ArticleDOI
TL;DR: The procedure is called Cox univariate shrinkage and has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics.
Abstract: We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.

Journal ArticleDOI
26 Nov 2009-Blood
TL;DR: These data demonstrated GEPs distinguishing MDS patients from healthy and between those with differing clinical outcomes (tMDS vs those whose disease remained stable) and cytogenetics and molecular criteria refining prognostic categorization and associated biologic processes in MDS.

01 Jan 2009
TL;DR: This paper develops sparse multiple CCA, an extension to sparse CCA which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome, and demonstrates these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.
Abstract: In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

Book ChapterDOI
01 Jan 2009
TL;DR: This chapter begins the discussion of some specific methods for supervised learning by describing five related techniques: generalized additive models, trees, multivariate adaptive regression splines, the patient rule induction method, and hierarchical mixtures of experts.
Abstract: In this chapter we begin our discussion of some specific methods for supervised learning. These techniques each assume a (different) structured form for the unknown regression function, and by doing so they finesse the curse of dimensionality. Of course, they pay the possible price of misspecifying the model, and so in each case there is a tradeoff that has to be made. They take off where Chapters 3–6 left off. We describe five related techniques: generalized additive models, trees, multivariate adaptive regression splines, the patient rule induction method, and hierarchical mixtures of experts.

Journal ArticleDOI
TL;DR: A strategy that incorporates a stable isotope (18)O-labeled "universal" reference sample as a comprehensive set of internal standards for analyzing large sample sets quantitatively and is demonstrated by its application to a set of 18 plasma samples from severe burn patients.
Abstract: The quantitative comparison of protein abundances across a large number of biological or patient samples represents an important proteomics challenge that needs to be addressed for proteomics discovery applications. Herein, we describe a strategy that incorporates a stable isotope (18)O-labeled "universal" reference sample as a comprehensive set of internal standards for analyzing large sample sets quantitatively. As a pooled sample, the (18)O-labeled "universal" reference sample is spiked into each individually processed unlabeled biological sample and the peptide/protein abundances are quantified based on (16)O/(18)O isotopic peptide pair abundance ratios that compare each unlabeled sample to the identical reference sample. This approach also allows for the direct application of label-free quantitation across the sample set simultaneously along with the labeling-approach (i.e., dual-quantitation) since each biological sample is unlabeled except for the labeled reference sample that is used as internal standards. The effectiveness of this approach for large-scale quantitative proteomics is demonstrated by its application to a set of 18 plasma samples from severe burn patients. When immunoaffinity depletion and cysteinyl-peptide enrichment-based fractionation with high resolution LC-MS measurements were combined, a total of 312 plasma proteins were confidently identified and quantified with a minimum of two unique peptides per protein. The isotope labeling data was directly compared with the label-free (16)O-MS intensity data extracted from the same data sets. The results showed that the (18)O reference-based labeling approach had significantly better quantitative precision compared to the label-free approach. The relative abundance differences determined by the two approaches also displayed strong correlation, illustrating the complementary nature of the two quantitative methods. The simplicity of including the (18)O-reference for accurate quantitation makes this strategy especially attractive when a large number of biological samples are involved in a study where label-free quantitation may be problematic, for example, due to issues associated with instrument platform robustness. The approach will also be useful for more effectively discovering subtle abundance changes in broad systems biology studies.

Journal ArticleDOI
TL;DR: The isothiocyanate sulforaphane, derived from cruciferous vegetables like broccoli, potently induces surrogate markers of phase 2 enzyme activity in prostate cells in vitro and in vivo and is carried out comprehensive transcriptome analysis using cDNA microarrays.
Abstract: BACKGROUND Prostate cancer is thought to arise as a result of oxidative stresses and induction of antioxidant electrophile defense (phase 2) enzymes has been proposed as a prostate cancer prevention strategy. The isothiocyanate sulforaphane, derived from cruciferous vegetables like broccoli, potently induces surrogate markers of phase 2 enzyme activity in prostate cells in vitro and in vivo. To better understand the temporal effects of sulforaphane and broccoli sprouts on gene expression in prostate cells, we carried out comprehensive transcriptome analysis using cDNA microarrays. METHODS Transcripts significantly modulated by sulforaphane over time were identified using StepMiner analysis. Ingenuity Pathway Analysis (IPA) was used to identify biological pathways, networks, and functions significantly altered by sulforaphane treatment. RESULTS StepMiner and IPA revealed significant changes in many transcripts associated with cell growth and cell cycle, as well as a significant number associated with cellular response to oxidative damage and stress. Comparison to an existing dataset suggested that sulforaphane blocked cell growth by inducing G2/M arrest. Cell growth assays and flow cytometry analysis confirmed that sulforaphane inhibited cell growth and induced cell cycle arrest. CONCLUSIONS Our data suggest that in prostate cells sulforaphane primarily induces cellular defenses and inhibits cell growth by causing G2/M phase arrest. Furthermore, based on the striking similarities in the gene expression patterns induced across experiments in these cells, sulforaphane appears to be the primary bioactive compound present in broccoli sprouts, suggesting that broccoli sprouts can serve as a suitable source for sulforaphane in intervention trials. Prostate 69: 181–190, 2009. © 2008 Wiley–Liss, Inc.

Book ChapterDOI
01 Jan 2009
TL;DR: In this article, the authors describe generalizations of linear decision boundaries for classification, including flexible discriminant analysis which facilitates construction of nonlinear boundaries in a manner very similar to the support vector machines.
Abstract: In this chapter we describe generalizations of linear decision boundaries for classification. Optimal separating hyperplanes are introduced in Chapter 4 for the case when two classes are linearly separable. Here we cover extensions to the nonseparable case, where the classes overlap. These techniques are then generalized to what is known as the support vector machine, which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space. The second set of methods generalize Fisher’s linear discriminant analysis (LDA). The generalizations include flexible discriminant analysis which facilitates construction of nonlinear boundaries in a manner very similar to the support vector machines, penalized discriminant analysis for problems such as signal and image classification where the large number of features are highly correlated, and mixture discriminant analysis for irregularly shaped classes.

Book ChapterDOI
01 Jan 2009


Book ChapterDOI
01 Jan 2009
TL;DR: For most of this book, the fitting (learning) of models has been achieved by minimizing a sum of squares for regression, or by minimizing cross-entropy for classification by maximizing the maximum likelihood approach to fitting.
Abstract: For most of this book, the fitting (learning) of models has been achieved by minimizing a sum of squares for regression, or by minimizing cross-entropy for classification. In fact, both of these minimizations are instances of the maximum likelihood approach to fitting.

Journal ArticleDOI
25 Jun 2009-PLOS ONE
TL;DR: Findings suggest that critical biological features of lethal ccRCC include loss of normal cortical differentiation and activation of programs associated with wound healing.
Abstract: Clear cell renal cell carcinoma (ccRCC) is the most common malignancy of the adult kidney and displays heterogeneity in clinical outcomes. Through comprehensive gene expression profiling, we have identified previously a set of transcripts that predict survival following nephrectomy independent of tumor stage, grade, and performance status. These transcripts, designated as the SPC (supervised principal components) gene set, show no apparent biological or genetic features that provide insight into renal carcinogenesis or tumor progression. We explored the relationship of this gene list to a set of genes expressed in different anatomical segments of the normal kidney including the cortex (cortex gene set) and the glomerulus (glomerulus gene set), and a gene set expressed after serum stimulation of quiescent fibroblasts (the core serum response or CSR gene set). Interestingly, the normal cortex, glomerulus (part of the normal renal cortex), and CSR gene sets captured more than 1/5 of the genes in the highly prognostic SPC gene set. Based on gene expression patterns alone, the SPC gene set could be used to sort samples from normal adult kidneys by the anatomical regions from which they were dissected. Tumors whose gene expression profiles most resembled the normal renal cortex or glomerulus showed better survival than those that did not, and those with expression features more similar to CSR showed poorer survival. While the cortex, glomerulus, and CSR signatures predicted survival independent of traditional clinical parameters, they were not independent of the SPC gene list. Our findings suggest that critical biological features of lethal ccRCC include loss of normal cortical differentiation and activation of programs associated with wound healing.

Book ChapterDOI
01 Jan 2009
TL;DR: Because they are highly unstructured, they typically aren’t useful for understanding the nature of the relationship between the features and class outcome, but as black box prediction engines, they can be very effective, and are often among the best performers in real data problems.
Abstract: In this chapter we discuss some simple and essentially model-free methods for classification and pattern recognition. Because they are highly unstructured, they typically aren’t useful for understanding the nature of the relationship between the features and class outcome. However, as black box prediction engines, they can be very effective, and are often among the best performers in real data problems. The nearest-neighbor technique can also be used in regression; this was touched on in Chapter 2 and works reasonably well for low-dimensional problems. However, with high-dimensional features, the bias—variance tradeoff does not work as favorably for nearest-neighbor regression as it does for classification.

Posted Content
TL;DR: This work introduces a new nearest-prototype classifier, the prototype vector machine (PVM), which arises from a combinatorial optimization problem which is cast as a variant of the set cover problem and proposes two algorithms for approximating its solution.
Abstract: We introduce a new nearest-prototype classifier, the prototype vector machine (PVM). It arises from a combinatorial optimization problem which we cast as a variant of the set cover problem. We propose two algorithms for approximating its solution. The PVM selects a relatively small number of representative points which can then be used for classification. It contains 1-NN as a special case. The method is compatible with any dissimilarity measure, making it amenable to situations in which the data are not embedded in an underlying feature space or in which using a non-Euclidean metric is desirable. Indeed, we demonstrate on the much studied ZIP code data how the PVM can reap the benefits of a problem-specific metric. In this example, the PVM outperforms the highly successful 1-NN with tangent distance, and does so retaining fewer than half of the data points. This example highlights the strengths of the PVM in yielding a low-error, highly interpretable model. Additionally, we apply the PVM to a protein classification problem in which a kernel-based distance is used.

Journal ArticleDOI
TL;DR: The local false discovery rate (LFDR) estimates the probability of falsely identifying specific genes with changes in expression and reveals differences and similarities among experiments, which complements functional assessments like gene set enrichment analysis.
Abstract: The local false discovery rate (LFDR) estimates the probability of falsely identifying specific genes with changes in expression. In computer simulations, LFDR 90% identified genes without changes. We used LFDR to compare different microarray experiments quantitatively: (i) Venn diagrams of genes with and without changes in expression, (ii) scatter plots of the genes, (iii) correlation coefficients in the scatter plots and (iv) distributions of gene function. To illustrate, we compared three methods for pre-processing microarray data. Correlations between methods were high (r = 0.84–0.92). However, responses were often different in magnitude, and sometimes discordant, even though the methods used the same raw data. LFDR complements functional assessments like gene set enrichment analysis. To illustrate, we compared responses to ultraviolet radiation (UV), ionizing radiation (IR) and tobacco smoke. Compared to unresponsive genes, genes responsive to both UV and IR were enriched for cell cycle, mitosis, and DNA repair functions. Genes responsive to UV but not IR were depleted for cell adhesion functions. Genes responsive to tobacco smoke were enriched for detoxification functions. Thus, LFDR reveals differences and similarities among experiments.

Posted Content
TL;DR: Using the nuclear norm as a regularizer, the convex relaxation techniques are used to provide a sequence of solutions to the matrix completion problem, and an algorithm iteratively replaces the missing elements with those obtained from a thresholded SVD.
Abstract: We use convex relaxation techniques to provide a sequence of solutions to the matrix completionproblem. Using the nuclear norm as a regularizer, we provide simple and very efficient algorithms forminimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm iterativelyreplaces the missing elements with those obtained from a thresholded SVD. With warm starts this allowsus to efficiently compute an entire regularization path of solutions. 1 Introduction In many applications measured data can be represented in a matrix X m×n , for which only a relativelysmall number of entries are observed. The problem is to “complete” the matrix based on the observedentries, and has been dubbed the matrix completion problem [CCS08, CR08, RFP07, CT09, KOM09]. The“Netflix” competition is a primary example, where the data is the basis for a recommender system. Therows correspond to viewers and the columns to movies, with the entry X ij being the rating ∈{1,...,5}byviewer i for movie j. There are 480K viewers and 18K movies, and hence 8.6 billion (8.6 ×10

Book ChapterDOI
01 Jan 2009

Journal ArticleDOI
TL;DR: The authors proposed a simple method for the estimation of this bias that uses information from the cross-validation process and applied it to a number of popular classifiers in various settings, and examined its performance.
Abstract: Tuning parameters in supervised learning problems are often estimated by cross-validation. The minimum value of the cross-validation error can be biased downward as an estimate of the test error at that same value of the tuning parameter. We propose a simple method for the estimation of this bias that uses information from the cross-validation process. As a result, it requires essentially no additional computation. We apply our bias estimate to a number of popular classifiers in various settings, and examine its performance.

Journal ArticleDOI
TL;DR: The goal was to develop a method of selecting genes that differentially expressed in patients who either improved or experienced organ failure and propose a test for the association between longitudinal gene expressions and the time to the occurrence of ordered categorical outcomes indicating recovery, stable disease, and organ failure.
Abstract: The NIH project 'Inflammatory and Host Response to Injury' (Glue) is being conducted to study the changes in the body over time in response to trauma and burn. Patients are monitored for changes in their clinical status, such as the onset of and recovery from organ failure. Blood samples are drawn over the first days and weeks after the injury to obtain gene expression levels over time. Our goal was to develop a method of selecting genes that differentially expressed in patients who either improved or experienced organ failure. For this, we needed a test for the association between longitudinal gene expressions and the time to the occurrence of ordered categorical outcomes indicating recovery, stable disease, and organ failure. We propose a test for which the relationship between the gene expression and the events is modeled using the cumulative proportional odds model that is a generalization of the pooling repeated observation method. Given the high-dimensionality of the microarray data, it was necessary to control for the multiplicity of the testing. To control for the false discovery rate (FDR), we applied both a permutational approach as well as Efron's empirical estimation method. We explore our method through simulations and provide the analysis of the multi-center, longitudinal study of immune response to inflammation and trauma (http://www.gluegrant.org).

Journal ArticleDOI
20 Nov 2009-Blood
TL;DR: Whether specific miRNAs differentially expressed acrossDLBCL cell lines were prognostic biomarkers for prediction of outcome of DLBCL patients revealed that all factors except miR-222cat were independent predictors of OS and all factorsExcept miR18a were predicted to play an important role in oncogenesis.