A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.

doi:10.1016/J.CELL.2017.10.049

Home
/
Papers
/
A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.

Journal Article•DOI•

A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.

Aravind Subramanian¹, Rajiv Narayan¹, Steven M. Corsello², Steven M. Corsello¹, David Peck¹, Ted Natoli¹, Xiaodong Lu¹, Joshua Gould¹, John F. Davis¹, Andrew A. Tubelli¹, Jacob K. Asiedu¹, David L. Lahr¹, Jodi E. Hirschman¹, Zihan Liu¹, Melanie Donahue¹, Bina Julian¹, Mariya Khan¹, David Wadden¹, Ian Smith¹, Daniel D. Lam¹, Arthur Liberzon¹, Courtney Toder¹, Mukta Bagul¹, Marek Orzechowski¹, Oana M. Enache¹, Federica Piccioni¹, Sarah A. Johnson¹, Nicholas J. Lyons¹, Alice H. Berger², Alice H. Berger¹, Alykhan F. Shamji¹, Angela N. Brooks², Angela N. Brooks¹, Anita Vrcic¹, Corey Flynn¹, Jacqueline Rosains¹, David Y. Takeda¹, David Y. Takeda², Roger Hu¹, Desiree Davison¹, Justin Lamb¹, Kristin Ardlie¹, Larson Hogstrom¹, Peyton Greenside¹, Nathanael S. Gray², Nathanael S. Gray¹, Paul A. Clemons¹, Serena J. Silver¹, Xiaoyun Wu¹, Wen-Ning Zhao², Wen-Ning Zhao¹, Willis Read-Button¹, Xiaohua Wu¹, Stephen J. Haggarty¹, Stephen J. Haggarty², Lucienne Ronco¹, Jesse S. Boehm¹, Stuart L. Schreiber¹, Stuart L. Schreiber², Stuart L. Schreiber³, John G. Doench¹, Joshua A. Bittker¹, David E. Root¹, Bang Wong¹, Todd R. Golub - Show less +61 more•Institutions (3)

Broad Institute¹, Harvard University², Howard Hughes Medical Institute³

30 Nov 2017-Cell (NIH Public Access)-Vol. 171, Iss: 6, pp 1437-1452

TL;DR: The expanded CMap is reported, made possible by a new, low-cost, high-throughput reduced representation expression profiling method that is shown to be highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts.

read less

About: This article is published in Cell.The article was published on 2017-11-30 and is currently open access. It has received 1943 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Interesting Properties of Profile Data Analysis in the Understanding and Utilization of the Effects of Drugs.

[...]

Tadahaya Mizuno¹, Katsuhisa Morita¹, Hiroyuki Kusuhara¹•Institutions (1)

University of Tokyo¹

01 Oct 2020-Biological & Pharmaceutical Bulletin

TL;DR: This review summarizes what profile data is, properties of profile data analysis, and current applications of profileData in order to understand and utilize the effects of small compounds, in particular, in a recently developed method to decompose multiple effects of a drug.

...read moreread less

Abstract: Profile data is defined as data which describes the properties of an object. Omics data of a specimen is profile data because its comprehensiveness supports the idea that omics data is numeric information which reflects biological information of the specimen. In general, omics data analysis utilizes an existing body of biological knowledge, while some profile data analysis methods are independent of existing knowledge, which is suitable for uncovering unidentified aspects of a specimen of interest. The effects of a small compound, such as drugs, are multiple, and include unrecognized effects, even by the developers. To uncover such unrecognized effects, it is useful to employ profile data analysis independent of existing knowledge. In this review, we summarize what profile data is, properties of profile data analysis, and current applications of profile data in order to understand and utilize the effects of small compounds, in particular, in a recently developed method to decompose multiple effects of a drug.

...read moreread less

3 citations

Journal Article•DOI•

Integrated analysis on transcriptome and behaviors defines HTT repeat-dependent network modules in Huntington's disease

[...]

01 Mar 2022-Genes and Diseases

TL;DR: Zhang et al. as mentioned in this paper revealed the landscape of behavioral features and gene expression correlations based on 445 mRNA samples and 445 microRNA samples, together with behavioral features (396 PhenoCube behaviors and 111 NeuroCube behaviors) in Htt CAG-knock-in mice.

...read moreread less

Abstract: Huntington's disease (HD) is caused by a CAG repeat expansion in the huntingtin (HTT) gene. Knock-in mice carrying a CAG repeat-expanded Htt will develop HD phenotypes. Previous studies suggested dysregulated molecular networks in a CAG length genotype- and the age-dependent manner in brain tissues from knock-in mice carrying expanded Htt CAG repeats. Furthermore, a large-scale phenome analysis defined a behavioral signature for HD genotype in knock-in mice carrying expanded Htt CAG repeats. However, an integrated analysis correlating phenotype features with genotypes (CAG repeat expansions) was not conducted previously. In this study, we revealed the landscape of the behavioral features and gene expression correlations based on 445 mRNA samples and 445 microRNA samples, together with behavioral features (396 PhenoCube behaviors and 111 NeuroCube behaviors) in Htt CAG-knock-in mice. We identified 37 behavioral features that were significantly associated with CAG repeat length including the number of steps and hind limb stand duration. The behavioral features were associated with several gene coexpression groups involved in neuronal dysfunctions, which were also supported by the single-cell RNA sequencing data in the striatum and the spatial gene expression in the brain. We also identified 15 chemicals with significant responses for genes with enriched behavioral features, most of them are agonist or antagonist for dopamine receptors and serotonin receptors used for neurology/psychiatry. Our study provides further evidence that abnormal neuronal signal transduction in the striatum plays an important role in causing HD-related phenotypic behaviors and provided rich information for the further pharmacotherapeutic intervention possibility for HD.

...read moreread less

3 citations

Posted Content•DOI•

Investigate the relevance of major signaling pathways in cancer survival using a biologically meaningful deep learning model

[...]

Jiarui Feng¹, Heming Zhang¹, Fuhai Li¹•Institutions (1)

Washington University in St. Louis¹

14 Apr 2020-bioRxiv

TL;DR: This exploratory study built a biologically meaningful and simplified deep neural network, DeepSigSurvNet, for survival prediction and investigated the relevance and difference of the 46 signaling pathways among the 4 types of cancer.

...read moreread less

Abstract: Survival analysis and prediction are important in cancer studies. In addition to the Cox proportional hazards model, recently deep learning models have been proposed to integrate the multi-omics data for survival prediction. Cancer signaling pathways are important and interpretable concepts that define the signaling cascades regulating cancer development and drug resistance. Thus, it is interesting and important to investigate the relevance to the survival time of individual signaling pathways. In this exploratory study, we propose to investigate the relevance and difference of a small set of core cancer signaling pathways in the survival analysis of cancer patients. Specifically, we built a biologically meaningful and simplified deep neural network, DeepSigSurvNet, for survival prediction. In the model, the gene expression and copy number data of 1648 genes from 46 major signaling pathways are used. We applied the model on 4 types of cancer and investigated the relevance and difference of the 46 signaling pathways among the 4 types of cancer. Interestingly, the interpretable analysis identified the distinct patterns of these signaling pathways, which are helpful to understand the relevance of the signaling pathways in terms of their association with cancer survival time. These highly relevant signaling pathways can be novel targets, combined with other essential signaling pathways inhibitors, for drug and drug combination prediction to improve the survival time of cancer patients.

...read moreread less

3 citations

Journal Article•DOI•

TOXRIC: a comprehensive database of toxicological data and benchmarks

[...]

Lian lian Wu, Bowei Yan, Ju Han, Ruijiang Li, Jian Xiao, Song He, Xiaochen Bo - Show less +3 more

18 Nov 2022-Nucleic Acids Research

TL;DR: Toxric as discussed by the authors is a database with comprehensive toxicological data, standardized attribute data, practical benchmarks, informative visualization of molecular representations, and an intuitive function interface to contribute to the early stage of compound/drug discovery.

...read moreread less

Abstract: Abstract The toxic effects of compounds on environment, humans, and other organisms have been a major focus of many research areas, including drug discovery and ecological research. Identifying the potential toxicity in the early stage of compound/drug discovery is critical. The rapid development of computational methods for evaluating various toxicity categories has increased the need for comprehensive and system-level collection of toxicological data, associated attributes, and benchmarks. To contribute toward this goal, we proposed TOXRIC (https://toxric.bioinforai.tech/), a database with comprehensive toxicological data, standardized attribute data, practical benchmarks, informative visualization of molecular representations, and an intuitive function interface. The data stored in TOXRIC contains 113 372 compounds, 13 toxicity categories, 1474 toxicity endpoints covering in vivo/in vitro endpoints and 39 feature types, covering structural, target, transcriptome, metabolic data, and other descriptors. All the curated datasets of endpoints and features can be retrieved, downloaded and directly used as output or input to Machine Learning (ML)-based prediction models. In addition to serving as a data repository, TOXRIC also provides visualization of benchmarks and molecular representations for all endpoint datasets. Based on these results, researchers can better understand and select optimal feature types, molecular representations, and baseline algorithms for each endpoint prediction task. We believe that the rich information on compound toxicology, ML-ready datasets, benchmarks and molecular representation distribution can greatly facilitate toxicological investigations, interpretation of toxicological mechanisms, compound/drug discovery and the development of computational methods.

...read moreread less

3 citations

Journal Article•DOI•

Industrializing AI-powered drug discovery: lessons learned from the Patrimony computing platform

[...]

Mickaël Guedj, Jack Swindle, Antoine Hamon, Sandra Hubert, Emiko Desvaux, Jessica Laplume, Laura Xuereb, Céline Lefebvre, Yannick Haudry, C. Gabarroca, A. Aussy, Laurence Laigle, I. Dupin-Roger, Philippe Moingeon - Show less +10 more

04 Jul 2022-Expert Opinion on Drug Discovery

TL;DR: This paper documents how this industrial computational platform has had a transformational impact on the R&D, making it more competitive, as well as time and cost effective through a model-based educated selection of therapeutic targets and drug candidates.

...read moreread less

Abstract: ABSTRACT Introduction As a mid-size international pharmaceutical company, we initiated 4 years ago the launch of a dedicated high-throughput computing platform supporting drug discovery. The platform named ‘Patrimony’ was built up on the initial predicate to capitalize on our proprietary data while leveraging public data sources in order to foster a Computational Precision Medicine approach with the power of artificial intelligence. Areas covered Specifically, Patrimony is designed to identify novel therapeutic target candidates. With several successful use cases in immuno-inflammatory diseases, and current ongoing extension to applications to oncology and neurology, we document how this industrial computational platform has had a transformational impact on our R&D, making it more competitive, as well time and cost effective through a model-based educated selection of therapeutic targets and drug candidates. Expert opinion We report our achievements, but also our challenges in implementing data access and governance processes, building up hardware and user interfaces, and acculturing scientists to use predictive models to inform decisions.

...read moreread less

3 citations

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

[...]

Aravind Subramanian¹, Pablo Tamayo¹, Vamsi K. Mootha², Sayan Mukherjee³, Benjamin L. Ebert², Michael A. Gillette², Amanda G. Paulovich⁴, Scott L. Pomeroy², Todd R. Golub², Eric S. Lander¹, Jill P. Mesirov¹ - Show less +7 more•Institutions (4)

Massachusetts Institute of Technology¹, Harvard University², Duke University³, Fred Hutchinson Cancer Research Center⁴

25 Oct 2005-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.

...read moreread less

Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

...read moreread less

34,830 citations

Journal Article•

Visualizing Data using t-SNE

[...]

Laurens van der Maaten, Geoffrey E. Hinton

01 Jan 2008-Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

...read moreread less

30,124 citations

Journal Article•DOI•

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

[...]

Ron Edgar¹, Michael Domrachev¹, Alex E. Lash¹•Institutions (1)

National Institutes of Health¹

01 Jan 2002-Nucleic Acids Research

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.

...read moreread less

Abstract: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.

...read moreread less

10,968 citations

Journal Article•DOI•

BLAT—The BLAST-Like Alignment Tool

[...]

W. James Kent¹•Institutions (1)

University of California, Santa Cruz¹

01 Apr 2002-Genome Research

TL;DR: How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences.

...read moreread less

Abstract: Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences BLAT's speed stems from an index of all nonoverlapping K-mers in the genome This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly BLAT has several major stages It uses the index to find regions in the genome likely to be homologous to the query sequence It performs an alignment between homologous regions It stitches together these aligned regions (often exons) into larger alignments (typically genes) Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible This paper describes how BLAT was optimized Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications http://genomeucscedu hosts a web-based BLAT server for the human genome

...read moreread less

8,326 citations

Journal Article•DOI•

Adjusting batch effects in microarray expression data using empirical Bayes methods

[...]

W. Evan Johnson¹, Cheng Li¹, Ariel Rabinovic¹•Institutions (1)

Harvard University¹

01 Jan 2007-Biostatistics

TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.

...read moreread less

Abstract: SUMMARY Non-biological experimental variation or “batch effects” are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes (>25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

...read moreread less

6,319 citations