scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.

TL;DR: The expanded CMap is reported, made possible by a new, low-cost, high-throughput reduced representation expression profiling method that is shown to be highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts.
About: This article is published in Cell.The article was published on 2017-11-30 and is currently open access. It has received 1943 citations till now.
Citations
More filters
Journal ArticleDOI
TL;DR: This review summarizes what profile data is, properties of profile data analysis, and current applications of profileData in order to understand and utilize the effects of small compounds, in particular, in a recently developed method to decompose multiple effects of a drug.
Abstract: Profile data is defined as data which describes the properties of an object. Omics data of a specimen is profile data because its comprehensiveness supports the idea that omics data is numeric information which reflects biological information of the specimen. In general, omics data analysis utilizes an existing body of biological knowledge, while some profile data analysis methods are independent of existing knowledge, which is suitable for uncovering unidentified aspects of a specimen of interest. The effects of a small compound, such as drugs, are multiple, and include unrecognized effects, even by the developers. To uncover such unrecognized effects, it is useful to employ profile data analysis independent of existing knowledge. In this review, we summarize what profile data is, properties of profile data analysis, and current applications of profile data in order to understand and utilize the effects of small compounds, in particular, in a recently developed method to decompose multiple effects of a drug.

3 citations

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper revealed the landscape of behavioral features and gene expression correlations based on 445 mRNA samples and 445 microRNA samples, together with behavioral features (396 PhenoCube behaviors and 111 NeuroCube behaviors) in Htt CAG-knock-in mice.
Abstract: Huntington's disease (HD) is caused by a CAG repeat expansion in the huntingtin (HTT) gene. Knock-in mice carrying a CAG repeat-expanded Htt will develop HD phenotypes. Previous studies suggested dysregulated molecular networks in a CAG length genotype- and the age-dependent manner in brain tissues from knock-in mice carrying expanded Htt CAG repeats. Furthermore, a large-scale phenome analysis defined a behavioral signature for HD genotype in knock-in mice carrying expanded Htt CAG repeats. However, an integrated analysis correlating phenotype features with genotypes (CAG repeat expansions) was not conducted previously. In this study, we revealed the landscape of the behavioral features and gene expression correlations based on 445 mRNA samples and 445 microRNA samples, together with behavioral features (396 PhenoCube behaviors and 111 NeuroCube behaviors) in Htt CAG-knock-in mice. We identified 37 behavioral features that were significantly associated with CAG repeat length including the number of steps and hind limb stand duration. The behavioral features were associated with several gene coexpression groups involved in neuronal dysfunctions, which were also supported by the single-cell RNA sequencing data in the striatum and the spatial gene expression in the brain. We also identified 15 chemicals with significant responses for genes with enriched behavioral features, most of them are agonist or antagonist for dopamine receptors and serotonin receptors used for neurology/psychiatry. Our study provides further evidence that abnormal neuronal signal transduction in the striatum plays an important role in causing HD-related phenotypic behaviors and provided rich information for the further pharmacotherapeutic intervention possibility for HD.

3 citations

Posted ContentDOI
14 Apr 2020-bioRxiv
TL;DR: This exploratory study built a biologically meaningful and simplified deep neural network, DeepSigSurvNet, for survival prediction and investigated the relevance and difference of the 46 signaling pathways among the 4 types of cancer.
Abstract: Survival analysis and prediction are important in cancer studies. In addition to the Cox proportional hazards model, recently deep learning models have been proposed to integrate the multi-omics data for survival prediction. Cancer signaling pathways are important and interpretable concepts that define the signaling cascades regulating cancer development and drug resistance. Thus, it is interesting and important to investigate the relevance to the survival time of individual signaling pathways. In this exploratory study, we propose to investigate the relevance and difference of a small set of core cancer signaling pathways in the survival analysis of cancer patients. Specifically, we built a biologically meaningful and simplified deep neural network, DeepSigSurvNet, for survival prediction. In the model, the gene expression and copy number data of 1648 genes from 46 major signaling pathways are used. We applied the model on 4 types of cancer and investigated the relevance and difference of the 46 signaling pathways among the 4 types of cancer. Interestingly, the interpretable analysis identified the distinct patterns of these signaling pathways, which are helpful to understand the relevance of the signaling pathways in terms of their association with cancer survival time. These highly relevant signaling pathways can be novel targets, combined with other essential signaling pathways inhibitors, for drug and drug combination prediction to improve the survival time of cancer patients.

3 citations

Journal ArticleDOI
TL;DR: Toxric as discussed by the authors is a database with comprehensive toxicological data, standardized attribute data, practical benchmarks, informative visualization of molecular representations, and an intuitive function interface to contribute to the early stage of compound/drug discovery.
Abstract: Abstract The toxic effects of compounds on environment, humans, and other organisms have been a major focus of many research areas, including drug discovery and ecological research. Identifying the potential toxicity in the early stage of compound/drug discovery is critical. The rapid development of computational methods for evaluating various toxicity categories has increased the need for comprehensive and system-level collection of toxicological data, associated attributes, and benchmarks. To contribute toward this goal, we proposed TOXRIC (https://toxric.bioinforai.tech/), a database with comprehensive toxicological data, standardized attribute data, practical benchmarks, informative visualization of molecular representations, and an intuitive function interface. The data stored in TOXRIC contains 113 372 compounds, 13 toxicity categories, 1474 toxicity endpoints covering in vivo/in vitro endpoints and 39 feature types, covering structural, target, transcriptome, metabolic data, and other descriptors. All the curated datasets of endpoints and features can be retrieved, downloaded and directly used as output or input to Machine Learning (ML)-based prediction models. In addition to serving as a data repository, TOXRIC also provides visualization of benchmarks and molecular representations for all endpoint datasets. Based on these results, researchers can better understand and select optimal feature types, molecular representations, and baseline algorithms for each endpoint prediction task. We believe that the rich information on compound toxicology, ML-ready datasets, benchmarks and molecular representation distribution can greatly facilitate toxicological investigations, interpretation of toxicological mechanisms, compound/drug discovery and the development of computational methods.

3 citations

Journal ArticleDOI
TL;DR: This paper documents how this industrial computational platform has had a transformational impact on the R&D, making it more competitive, as well as time and cost effective through a model-based educated selection of therapeutic targets and drug candidates.
Abstract: ABSTRACT Introduction As a mid-size international pharmaceutical company, we initiated 4 years ago the launch of a dedicated high-throughput computing platform supporting drug discovery. The platform named ‘Patrimony’ was built up on the initial predicate to capitalize on our proprietary data while leveraging public data sources in order to foster a Computational Precision Medicine approach with the power of artificial intelligence. Areas covered Specifically, Patrimony is designed to identify novel therapeutic target candidates. With several successful use cases in immuno-inflammatory diseases, and current ongoing extension to applications to oncology and neurology, we document how this industrial computational platform has had a transformational impact on our R&D, making it more competitive, as well time and cost effective through a model-based educated selection of therapeutic targets and drug candidates. Expert opinion We report our achievements, but also our challenges in implementing data access and governance processes, building up hardware and user interfaces, and acculturing scientists to use predictive models to inform decisions.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal Article
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

30,124 citations

Journal ArticleDOI
TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
Abstract: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.

10,968 citations

Journal ArticleDOI
TL;DR: How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences.
Abstract: Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences BLAT's speed stems from an index of all nonoverlapping K-mers in the genome This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly BLAT has several major stages It uses the index to find regions in the genome likely to be homologous to the query sequence It performs an alignment between homologous regions It stitches together these aligned regions (often exons) into larger alignments (typically genes) Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible This paper describes how BLAT was optimized Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications http://genomeucscedu hosts a web-based BLAT server for the human genome

8,326 citations

Journal ArticleDOI
TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.
Abstract: SUMMARY Non-biological experimental variation or “batch effects” are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes (>25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

6,319 citations

Related Papers (5)