Open accessJournal ArticleDOI: 10.1016/J.PATTER.2021.100213

Appyters: Turning Jupyter Notebooks into data-driven web apps

04 Mar 2021-Vol. 2, Iss: 3, pp 100213-100213
Abstract: Jupyter Notebooks have transformed the communication of data analysis pipelines by facilitating a modular structure that brings together code, markdown text, and interactive visualizations. Here, we extended Jupyter Notebooks to broaden their accessibility with Appyters. Appyters turn Jupyter Notebooks into fully functional standalone web-based bioinformatics applications. Appyters present to users an entry form enabling them to upload their data and set various parameters for a multitude of data analysis workflows. Once the form is filled, the Appyter executes the corresponding notebook in the cloud, producing the output without requiring the user to interact directly with the code. Appyters were used to create many bioinformatics web-based reusable workflows, including applications to build customized machine learning pipelines, analyze omics data, and produce publishable figures. These Appyters are served in the Appyters Catalog at In summary, Appyters enable the rapid development of interactive web-based bioinformatics applications.

Journal ArticleDOI: 10.1002/CPZ1.90
01 Mar 2021-
Abstract: Profiling samples from patients, tissues, and cells with genomics, transcriptomics, epigenomics, proteomics, and metabolomics ultimately produces lists of genes and proteins that need to be further analyzed and integrated in the context of known biology Enrichr (Chen et al, 2013; Kuleshov et al, 2016) is a gene set search engine that enables the querying of hundreds of thousands of annotated gene sets Enrichr uniquely integrates knowledge from many high-profile projects to provide synthesized information about mammalian genes and gene sets The platform provides various methods to compute gene set enrichment, and the results are visualized in several interactive ways This protocol provides a summary of the key features of Enrichr, which include using Enrichr programmatically and embedding an Enrichr button on any website © 2021 Wiley Periodicals LLC Basic Protocol 1: Analyzing lists of differentially expressed genes from transcriptomics, proteomics and phosphoproteomics, GWAS studies, or other experimental studies Basic Protocol 2: Searching Enrichr by a single gene or key search term Basic Protocol 3: Preparing raw or processed RNA-seq data through BioJupies in preparation for Enrichr analysis Basic Protocol 4: Analyzing gene sets for model organisms using modEnrichr Basic Protocol 5: Using Enrichr in Geneshot Basic Protocol 6: Using Enrichr in ARCHS4 Basic Protocol 7: Using the enrichment analysis visualization Appyter to visualize Enrichr results Basic Protocol 8: Using the Enrichr API Basic Protocol 9: Adding an Enrichr button to a website

Open accessJournal ArticleDOI: 10.1093/NAR/GKAB359
Abstract: Phosphoproteomics and proteomics experiments capture a global snapshot of the cellular signaling network, but these methods do not directly measure kinase state. Kinase Enrichment Analysis 3 (KEA3) is a webserver application that infers overrepresentation of upstream kinases whose putative substrates are in a user-inputted list of proteins. KEA3 can be applied to analyze data from phosphoproteomics and proteomics studies to predict the upstream kinases responsible for observed differential phosphorylations. The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data. To benchmark the performance of KEA3, we examined whether KEA3 can predict the perturbed kinase from single-kinase perturbation followed by gene expression experiments, and phosphoproteomics data collected from kinase-targeting small molecules. We show that integrating KSIs and KPIs across data sources to produce a composite ranking improves the recovery of the expected kinase. The KEA3 webserver is available at

Open accessJournal ArticleDOI: 10.1016/J.PATTER.2021.100323
Farid Nakhle1, Antoine Harfouche1Institutions (1)
10 Sep 2021-
Abstract: Summary High-throughput image-based technologies are now widely used in the rapidly developing field of digital phenomics and are generating ever-increasing amounts and diversity of data. Artificial intelligence (AI) is becoming a game changer in turning the vast seas of data into valuable predictions and insights. However, this requires specialized programming skills and an in-depth understanding of machine learning, deep learning, and ensemble learning algorithms. Here, we attempt to methodically review the usage of different tools, technologies, and services available to the phenomics data community and show how they can be applied to selected problems in explainable AI-based image analysis. This tutorial provides practical and useful resources for novices and experts to harness the potential of the phenomic data in explainable AI-led breeding programs.

Open accessPosted Content
Abstract: To better understand the potential of drug repurposing in COVID-19, we analyzed control strategies over essential host factors for SARS-CoV-2 infection. We constructed comprehensive directed protein-protein interaction networks integrating the top ranked host factors, drug target proteins, and directed protein-protein interaction data. We analyzed the networks to identify drug targets and combinations thereof that offer efficient control over the host factors. We validated our findings against clinical studies data and bioinformatics studies. Our method offers a new insight into the molecular details of the disease and into potentially new therapy targets for it. Our approach for drug repurposing is significant beyond COVID-19 and may be applied also to other diseases.

Open accessJournal ArticleDOI: 10.1007/S11060-021-03829-0
Abstract: A large subset of diffusely infiltrative gliomas contains a gain-of-function mutation in isocitrate dehydrogenase 1 or 2 (IDH1/2mut) which produces 2-hydroxglutarate, an inhibitor of α-ketoglutarate-dependent DNA demethylases, thereby inducing widespread DNA and histone methylation. Because histone deacetylase (HDAC) enzymes are localized to methylated chromatin via methyl-binding domain proteins, IDH1/2mut gliomas may be more dependent on HDAC activity, and therefore may be more sensitive to HDAC inhibitors. Six cultured patient-derived glioma cell lines, IDH1wt (n = 3) and IDH1mut (n = 3), were treated with an FDA-approved HDAC inhibitor, panobinostat. Cellular cytotoxicity and proliferation assays were conducted by flow cytometry. Histone modifications and cell signaling pathways were assessed using immunoblot and/or ELISA. IDH1mut gliomas exhibited marked upregulation of genes associated with the HDAC activity. Glioma cell cultures bearing IDH1mut were significantly more sensitive to the cytotoxic and antiproliferative effects of panobinostat, compared to IDH1wt glioma cells. Panobinostat caused a greater increase in acetylation of the histone residues H3K14, H3K18, and H3K27 in IDH1mut glioma cells. Another HDAC inhibitor, valproic acid, was also more effective against IDH1mut glioma cells. These data suggest that IDH1mut gliomas may be preferentially sensitive to HDAC inhibitors. Further, IDH1mut glioma cultures showed enhanced accumulation of acetylated histone residues in response to panobinostat treatment, suggesting a direct epigenetic mechanism for this sensitivity. This provides a rationale for further exploration of HDAC inhibitors against IDH1mut gliomas.

Open accessJournal Article
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from

Open accessJournal ArticleDOI: 10.1186/S13059-014-0550-8
05 Dec 2014-Genome Biology
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at .

Open accessJournal Article
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTP616
01 Jan 2010-Bioinformatics
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (

Open accessJournal ArticleDOI: 10.1093/NAR/30.1.207
Ron Edgar1, Michael Domrachev1, Alex E. Lash1Institutions (1)
Abstract: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at

