scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2014"


OtherDOI
22 Apr 2014
TL;DR: The generalized additive model (GA) as discussed by the authors is a generalization of the generalized linear model, which replaces the linear model with a sum of smooth functions in an iterative procedure called local scoring algorithm.
Abstract: Likelihood-based regression models such as the normal linear regression model and the linear logistic model, assume a linear (or some other parametric) form for the covariates $X_1, X_2, \cdots, X_p$. We introduce the class of generalized additive models which replaces the linear form $\sum \beta_jX_j$ by a sum of smooth functions $\sum s_j(X_j)$. The $s_j(\cdot)$'s are unspecified functions that are estimated using a scatterplot smoother, in an iterative procedure we call the local scoring algorithm. The technique is applicable to any likelihood-based regression model: the class of generalized linear models contains many of these. In this class the linear predictor $\eta = \Sigma \beta_jX_j$ is replaced by the additive predictor $\Sigma s_j(X_j)$; hence, the name generalized additive models. We illustrate the technique with binary response and survival data. In both cases, the method proves to be useful in uncovering nonlinear covariate effects. It has the advantage of being completely automatic, i.e., no "detective work" is needed on the part of the statistician. As a theoretical underpinning, the technique is viewed as an empirical method of maximizing the expected log likelihood, or equivalently, of minimizing the Kullback-Leibler distance to the true model.

5,700 citations


Journal ArticleDOI
TL;DR: Potential solutions for problems related to the research workforce are proposed, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.

1,169 citations


Journal ArticleDOI
TL;DR: A system analysis of the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine and a large number of immune system components finds a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.
Abstract: Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.

518 citations


Journal ArticleDOI
TL;DR: In this paper, the covariance test statistic is proposed to test the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path.
Abstract: In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).

425 citations


Journal ArticleDOI
TL;DR: A data-driven method termed Citrus is presented that identifies cell subsets associated with an experimental endpoint of interest and is demonstrated through the systematic identification of blood cells that signal in response to experimental stimuli and T-cell subsets whose abundance is predictive of AIDS-free survival risk in patients with HIV.
Abstract: Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased—and potentially more thorough—correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.

419 citations


Journal ArticleDOI
TL;DR: In this article, a simple method for modeling interactions between the treatment and covariates is proposed, where the idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects.
Abstract: We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers, the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable. Supplementary materials for this articl...

327 citations


Journal ArticleDOI
TL;DR: Desorption electrospray ionization mass spectrometric imaging (DESI-MSI) and the statistical method of least absolute shrinkage and selection operator (Lasso) are used to classify tissue as cancer or normal based on molecular information obtained from tissue and also to select those mass-spectra features most indicative of disease state.
Abstract: Surgical resection is the main curative option for gastrointestinal cancers. The extent of cancer resection is commonly assessed during surgery by pathologic evaluation of (frozen sections of) the tissue at the resected specimen margin(s) to verify whether cancer is present. We compare this method to an alternative procedure, desorption electrospray ionization mass spectrometric imaging (DESI-MSI), for 62 banked human cancerous and normal gastric-tissue samples. In DESI-MSI, microdroplets strike the tissue sample, the resulting splash enters a mass spectrometer, and a statistical analysis, here, the Lasso method (which stands for least absolute shrinkage and selection operator and which is a multiclass logistic regression with L1 penalty), is applied to classify tissues based on the molecular information obtained directly from DESI-MSI. The methodology developed with 28 frozen training samples of clear histopathologic diagnosis showed an overall accuracy value of 98% for the 12,480 pixels evaluated in cross-validation (CV), and 97% when a completely independent set of samples was tested. By applying an additional spatial smoothing technique, the accuracy for both CV and the independent set of samples was 99% compared with histological diagnoses. To test our method for clinical use, we applied it to a total of 21 tissue-margin samples prospectively obtained from nine gastric-cancer patients. The results obtained suggest that DESI-MSI/Lasso may be valuable for routine intraoperative assessment of the specimen margins during gastric-cancer surgery.

182 citations



Journal ArticleDOI
TL;DR: The results demonstrate the potential ability of the model to identify those AMD patients at risk of progressing to exudative AMD from an early or intermediate stage.
Abstract: Purpose We developed a statistical model based on quantitative characteristics of drusen to estimate the likelihood of conversion from early and intermediate age-related macular degeneration (AMD) to its advanced exudative form (AMD progression) in the short term (less than 5 years), a crucial task to enable early intervention and improve outcomes. Methods Image features of drusen quantifying their number, morphology, and reflectivity properties, as well as the longitudinal evolution in these characteristics, were automatically extracted from 2146 spectral-domain optical coherence tomography (SD-OCT) scans of 330 AMD eyes in 244 patients collected over a period of 5 years, with 36 eyes showing progression during clinical follow-up. We developed and evaluated a statistical model to predict the likelihood of progression at predetermined times using clinical and image features as predictors. Results Area, volume, height, and reflectivity of drusen were informative features distinguishing between progressing and nonprogressing cases. Discerning progression at follow-up (mean, 6.16 months) resulted in a mean area under the receiver operating characteristic curve (AUC) of 0.74 (95% confidence interval [CI], 0.58, 0.85). The maximum predictive performance was observed at 11 months after a patient's first early AMD diagnosis, with mean AUC 0.92 (95% CI, 0.83, 0.98). Those eyes predicted to progress showed a much higher progression rate than those predicted not to progress at any given time from the initial visit. Conclusions Our results demonstrate the potential ability of our model to identify those AMD patients at risk of progressing to exudative AMD from an early or intermediate stage.

118 citations


Journal ArticleDOI
TL;DR: A relationship between the appearance of specific lipid species and the overexpression of MYC in lymphomas is suggested, including many of the lipid species identified as significant for MYC-induced animal lymphoma tissue.
Abstract: Overexpression of the v-myc avian myelocytomatosis viral oncogene homolog (MYC) oncogene is one of the most commonly implicated causes of human tumorigenesis. MYC is known to regulate many aspects of cellular biology including glucose and glutamine metabolism. Little is known about the relationship between MYC and the appearance and disappearance of specific lipid species. We use desorption electrospray ionization mass spectrometry imaging (DESI-MSI), statistical analysis, and conditional transgenic animal models and cell samples to investigate changes in lipid profiles in MYC-induced lymphoma. We have detected a lipid signature distinct from that observed in normal tissue and in rat sarcoma-induced lymphoma cells. We found 104 distinct molecular ions that have an altered abundance in MYC lymphoma compared with normal control tissue by statistical analysis with a false discovery rate of less than 5%. Of these, 86 molecular ions were specifically identified as complex phospholipids. To evaluate whether the lipid signature could also be observed in human tissue, we examined 15 human lymphoma samples with varying expression levels of MYC oncoprotein. Distinct lipid profiles in lymphomas with high and low MYC expression were observed, including many of the lipid species identified as significant for MYC-induced animal lymphoma tissue. Our results suggest a relationship between the appearance of specific lipid species and the overexpression of MYC in lymphomas.

114 citations


Posted Content
TL;DR: In this paper, the authors compare the power of MIC to that of standard Pearson correlation and distance correlation, and find that MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome.
Abstract: The proposal of Reshef et al. (2011) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations. To investigate this, we ran simulations to compare the power of MIC to that of standard Pearson correlation and distance correlation (dcor). We simulated pairs of variables with different relationships (most of which were considered by the Reshef et. al.), but with varying levels of noise added. To determine proper cutoffs for testing the independence hypothesis, we simulated independent data with the appropriate marginals. As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome.

Journal ArticleDOI
TL;DR: A multicentre retrospective study to determine prognostic factors and the incidence of central nervous system (CNS) relapses in primary breast diffuse large B‐cell lymphoma found a low stage‐modified International Prognostic Index (IPI) was associated with longer overall survival.
Abstract: Primary breast diffuse large B-cell lymphoma (DLBCL) is a rare subtype of non-Hodgkin lymphoma (NHL) with limited data on pathology and outcome. A multicentre retrospective study was undertaken to determine prognostic factors and the incidence of central nervous system (CNS) relapses. Data was retrospectively collected on patients from 8 US academic centres. Only patients with stage I/II disease (involvement of breast and localized lymph nodes) were included. Histologies apart from primary DLBCL were excluded. Between 1992 and 2012, 76 patients met the eligibility criteria. Most patients (86%) received chemotherapy, and 69% received immunochemotherapy with rituximab; 65% received radiation therapy and 9% received prophylactic CNS chemotherapy. After a median follow-up of 4·5 years (range 0·6-20·6 years), the Kaplan-Meier estimated median progression-free survival was 10·4 years (95% confidence interval [CI] 5·8-14·9 years), and the median overall survival was 14·6 years (95% CI 10·2-19 years). Twelve patients (16%) had CNS relapse. A low stage-modified International Prognostic Index (IPI) was associated with longer overall survival. Rituximab use was not associated with a survival advantage. Primary breast DLBCL has a high rate of CNS relapse. The stage-modified IPI score is associated with survival.

Posted Content
16 Jan 2014
TL;DR: Inference tools for least angle regression and the lasso are proposed, from the joint distribution of suitably normalized spacings of the LARS algorithm, which provides exact conditional tests at any step of the LAR algorithm as well as “selection intervals” for the appropriate true underlying regression parameter.
Abstract: We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package "selectiveInference", freely available on the CRAN repository, implements the new inference tools described in this paper.

Journal ArticleDOI
TL;DR: A multicenter, randomized trial comparing a specific vaccine (MyVax), comprising Id chemically coupled to keyhole limpet hemocyanin plus granulocyte macrophage colony-stimulating factor to a control immunotherapy with KLH plus GM-CSF, failed to demonstrate clinical benefit of specific immunotherapy.
Abstract: Purpose Idiotypes (Ids), the unique portions of tumor immunoglobulins, can serve as targets for passive and active immunotherapies for lymphoma. We performed a multicenter, randomized trial comparing a specific vaccine (MyVax), comprising Id chemically coupled to keyhole limpet hemocyanin (KLH) plus granulocyte macrophage colony-stimulating factor (GM-CSF) to a control immunotherapy with KLH plus GM-CSF. Patients and Methods Patients with previously untreated advanced-stage follicular lymphoma (FL) received eight cycles of chemotherapy with cyclophosphamide, vincristine, and prednisone. Those achieving sustained partial or complete remission (n = 287 [44%]) were randomly assigned at a ratio of 2:1 to receive one injection per month for 7 months of MyVax or control immunotherapy. Anti-Id antibody responses (humoral immune responses [IRs]) were measured before each immunization. The primary end point was progression-free survival (PFS). Secondary end points included IR and time to subsequent antilymphoma th...

Posted Content
TL;DR: The cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) is applied to the fitting of a conditional logistic regression model with lasso and elastic net penalties and it is found that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection.
Abstract: We apply the cyclic coordinate descent algorithm of Friedman, Hastie and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso ($\ell_1$) and elastic net penalties. The sequential strong rules of Tibshirani et al (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularisation paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.

Posted Content
TL;DR: In this paper, an exact distribution-based method for hypothesis testing and construction of confidence intervals for signals in a noisy matrix is proposed, which is based on the approach of Taylor, Loftus and Tibshirani for testing the global null.
Abstract: Principal component analysis (PCA) is a well-known tool in multivariate statistics. One significant challenge in using PCA is the choice of the number of components. In order to address this challenge, we propose an exact distribution-based method for hypothesis testing and construction of confidence intervals for signals in a noisy matrix. Assuming Gaussian noise, we use the conditional distribution of the singular values of a Wishart matrix and derive exact hypothesis tests and confidence intervals for the true signals. Our paper is based on the approach of Taylor, Loftus and Tibshirani (2013) for testing the global null: we generalize it to test for any number of principal components, and derive an integrated version with greater power. In simulation studies we find that our proposed methods compare well to existing approaches.

Posted Content
TL;DR: In this paper, the authors propose new inference tools for forward stepwise regression, least angle regression, and the lasso, which can be used to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set.
Abstract: We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package "selectiveInference", freely available on the CRAN repository, implements the new inference tools described in this paper.

Journal ArticleDOI
TL;DR: In this article, the cyclic coordinate descent algorithm was applied to the fitting of a conditional logistic regression model with lasso and elastic net penalties, and the conditional model performed admirably on datasets drawn from a suitable conditional distribution.
Abstract: We apply the cyclic coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010) to the fitting of a conditional logistic regression model with lasso [Formula: see text] and elastic net penalties. The sequential strong rules of Tibshirani, Bien, Hastie, Friedman, Taylor, Simon, and Tibshirani (2012) are also used in the algorithm and it is shown that these offer a considerable speed up over the standard coordinate descent algorithm with warm starts. Once implemented, the algorithm is used in simulation studies to compare the variable selection and prediction performance of the conditional logistic regression model against that of its unconditional (standard) counterpart. We find that the conditional model performs admirably on datasets drawn from a suitable conditional distribution, outperforming its unconditional counterpart at variable selection. The conditional model is also fit to a small real world dataset, demonstrating how we obtain regularization paths for the parameters of the model and how we apply cross validation for this method where natural unconditional prediction rules are hard to come by.

Posted Content
16 Jan 2014
TL;DR: In this paper, a general framework for post-selection inference for forward stepwise and least angle regression is presented, which allows to derive conditional hypothesis tests at any step of the regression procedure.
Abstract: In this paper we propose new inference tools for forward stepwise and least angle regression. We first present a general scheme to perform valid inference after any selection event that can be characterized as the observation vector y falling into some polyhedral set. This framework then allows us to derive conditional (post-selection) hypothesis tests at any step of the forward stepwise and least angle regression procedures. We derive an exact null distribution for our proposed test statistics in finite samples, yielding p-values with exact type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. Application of this framework to general likelihood-based regression models (e.g., generalized linear models and the Cox model) is also discussed.

Journal ArticleDOI
TL;DR: This study finds that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identifies a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage.
Abstract: Background: The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations. Results: To characterize the transcriptional changes of early breast neoplasia, we sequenced 3′- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns. Conclusions: This study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.

Journal ArticleDOI
TL;DR: It is demonstrated that PCNSL expresses LMO2, HGAL(also known as GCSAM) and BCL6 proteins in 52%, 65% and 56% of tumours, respectively, which is associated with longer progression‐free survival and overall survival.
Abstract: Summary Primary central nervous system lymphoma (PCNSL) is an aggressive sub-variant of non-Hodgkin lymphoma (NHL) with morphological similarities to diffuse large B-cell lymphoma (DLBCL) While methotrexate (MTX)-based therapies have improved patient survival, the disease remains incurable in most cases and its pathogenesis is poorly understood We evaluated 69 cases of PCNSL for the expression of HGAL (also known as GCSAM), LMO2 and BCL6 – genes associated with DLBCL prognosis and pathobiology, and analysed their correlation to survival in 49 PCNSL patients receiving MTX-based therapy We demonstrate that PCNSL expresses LMO2, HGAL(also known as GCSAM) and BCL6 proteins in 52%, 65% and 56% of tumours, respectively BCL6 protein expression was associated with longer progression-free survival (P = 0·006) and overall survival (OS, P = 0·05), while expression of LMO2 protein was associated with longer OS (P = 0·027) Further research is needed to elucidate the function of BCL6 and LMO2 in PCNSL

Posted Content
TL;DR: An upper bound to the so‐called “worst case risk” of the estimator is proved and it is shown that it is within a constant multiple of the minimax risk over a rich set of parameter spaces meant to evoke sparsity.
Abstract: We tackle the problem of the estimation of a vector of means from a single vector-valued observation $y$. Whereas previous work reduces the size of the estimates for the largest (absolute) sample elements via shrinkage (like James-Stein) or biases estimated via empirical Bayes methodology, we take a novel approach. We adapt recent developments by Lee et al (2013) in post selection inference for the Lasso to the orthogonal setting, where sample elements have different underlying signal sizes. This is exactly the setup encountered when estimating many means. It is shown that other selection procedures, like selecting the $K$ largest (absolute) sample elements and the Benjamini-Hochberg procedure, can be cast into their framework, allowing us to leverage their results. Point and interval estimates for signal sizes are proposed. These seem to perform quite well against competitors, both recent and more tenured. Furthermore, we prove an upper bound to the worst case risk of our estimator, when combined with the Benjamini-Hochberg procedure, and show that it is within a constant multiple of the minimax risk over a rich set of parameter spaces meant to evoke sparsity.

Journal ArticleDOI
TL;DR: In this article, an order-constrained version of L1-regularized regression is proposed for time-lagged regression, where it is natural to impose an order constraint on the coefficients.
Abstract: We consider regression scenarios where it is natural to impose an order constraint on the coefficients. We propose an order-constrained version of L1-regularized regression for this problem, and show how to solve it efficiently using the well-known Pool Adjacent Violators Algorithm as its proximal operator. The main application of this idea is time-lagged regression, where we predict an outcome at time t from features at the previous K time points. In this setting it is natural to assume that the coefficients decay as we move farther away from t, and hence the order constraint is reasonable. Potential applications include financial time series and prediction of dynamic patient out- comes based on clinical measurements. We illustrate this idea on real and simulated data.

Posted Content
TL;DR: In this article, a convex sparse supervised canonical correlation analysis (sparse sCCA) is proposed for sparse mCCA when one of the data sets is a vector.
Abstract: We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples One approach that has been proposed for dealing with this type of data is ``sparse multiple canonical correlation analysis'' (sparse mCCA) All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods We derive efficient algorithms for this problem, and illustrate their use on simulated and real data

Journal ArticleDOI
06 Dec 2014-Blood
TL;DR: In response to in situ vaccination, all patients made tumor-specific immune responses within 2 to 4 weeks post-vaccination with the most informative markers being the activation marker CD278 (ICOS) for CD4 T cell response among the CD45RO+ memory subset, and perforin and granzyme B for CD8 T cell responses.

Journal ArticleDOI
TL;DR: The significance test for the lasso was first proposed by Lockhart, Taylor, and Tibshirani as discussed by the authors, who later extended it to a significance test based on the Lasso significance test.
Abstract: Rejoinder of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161].

01 Jan 2014
TL;DR: In this paper, gene expression patterns within early breast neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage.
Abstract: BackgroundThe earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations.ResultsTo characterize the transcriptional changes of early breast neoplasia, we sequenced 3′- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns.ConclusionsThis study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.

Journal ArticleDOI
TL;DR: In this article, the authors extended the work of the original paper to include adaptive linear models and provided a "spacing" test of the global null hypothesis, β ∗ = 0, which takes the form
Abstract: We would like to thank the Editors and referees for their considerable efforts that improved our paper, and all of the discussants for their feedback, and their thoughtful and stimulating comments. Linear models are central in applied statistics, and inference for adaptive linear modeling is an important active area of research. Our paper is clearly not the last word on the subject. Several of the discussants introduce novel proposals for this problem; in fact, many of the discussions are interesting “mini-papers” on their own, and we will not attempt to reply to all of the points that they raise. Our hope is that our paper and the excellent accompanying discussions will serve as a helpful resource for researchers interested in this topic. Since the writing of our original paper, we have (with many our of graduate students) extended the work considerably. Before responding to the discussants, we will first summarize this new work because it will be relevant to our responses. • As mentioned in the last section of the paper, we have derived a “spacing” test of the global null hypothesis, β ∗ = 0, which takes the form


Journal ArticleDOI
TL;DR: This paper presents a method based on semidefinite programming for automatically quantifying the bias of missing value imputation via conditional expectation by computing the range of possible equal-likelihood inferred values for convex functions of the covariance matrix.
Abstract: In some multivariate problems with missing data, pairs of variables exist that are never observed together. For example, some modern biological tools can produce data of this form. As a result of this structure, the covariance matrix is only partially identifiable, and point estimation requires that identifying assumptions be made. These assumptions can introduce an unknown and potentially large bias into the inference. This paper presents a method based on semidefinite programming for automatically quantifying this potential bias by computing the range of possible equal-likelihood inferred values for convex functions of the covariance matrix. We focus on the bias of missing value imputation via conditional expectation and show that our method can give an accurate assessment of the true error in cases where estimates based on sampling uncertainty alone are overly optimistic.