scispace - formally typeset
Search or ask a question
Author

Joshua J. Levy

Bio: Joshua J. Levy is an academic researcher from Dartmouth College. The author has contributed to research in topics: Medicine & Computer science. The author has an hindex of 6, co-authored 44 publications receiving 150 citations. Previous affiliations of Joshua J. Levy include Dartmouth–Hitchcock Medical Center & Joint Genome Institute.

Papers published on a yearly basis

Papers
More filters
Journal ArticleDOI
TL;DR: MethylNet is described, a DNAm deep learning method that can construct embeddings, make predictions, generate new data, and uncover unknown heterogeneity with minimal user supervision that can study cellular differences, grasp higher order information of cancer sub-types, and capture factors associated with smoking in concordance with known differences.
Abstract: DNA methylation (DNAm) is an epigenetic regulator of gene expression programs that can be altered by environmental exposures, aging, and in pathogenesis. Traditional analyses that associate DNAm alterations with phenotypes suffer from multiple hypothesis testing and multi-collinearity due to the high-dimensional, continuous, interacting and non-linear nature of the data. Deep learning analyses have shown much promise to study disease heterogeneity. DNAm deep learning approaches have not yet been formalized into user-friendly frameworks for execution, training, and interpreting models. Here, we describe MethylNet, a DNAm deep learning method that can construct embeddings, make predictions, generate new data, and uncover unknown heterogeneity with minimal user supervision. The results of our experiments indicate that MethylNet can study cellular differences, grasp higher order information of cancer sub-types, estimate age and capture factors associated with smoking in concordance with known differences. The ability of MethylNet to capture nonlinear interactions presents an opportunity for further study of unknown disease, cellular heterogeneity and aging processes.

70 citations

Journal ArticleDOI
TL;DR: Pan-genomic approach is used to reveal gradual polyploid genome evolution by analyzing of Brachypodium hybridum and its diploid progenitors and results are consistent with a gradual accumulation of genomic changes afterpolyploidization and a lack of subgenome expression dominance.
Abstract: Our understanding of polyploid genome evolution is constrained because we cannot know the exact founders of a particular polyploid. To differentiate between founder effects and post polyploidization evolution, we use a pan-genomic approach to study the allotetraploid Brachypodium hybridum and its diploid progenitors. Comparative analysis suggests that most B. hybridum whole gene presence/absence variation is part of the standing variation in its diploid progenitors. Analysis of nuclear single nucleotide variants, plastomes and k-mers associated with retrotransposons reveals two independent origins for B. hybridum, ~1.4 and ~0.14 million years ago. Examination of gene expression in the younger B. hybridum lineage reveals no bias in overall subgenome expression. Our results are consistent with a gradual accumulation of genomic changes after polyploidization and a lack of subgenome expression dominance. Significantly, if we did not use a pan-genomic approach, we would grossly overestimate the number of genomic changes attributable to post polyploidization evolution.

50 citations

Journal ArticleDOI
TL;DR: A statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output are found.
Abstract: Machine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each. We present three simple examples to illustrate when to use each modeling approach and posit a general framework for combining them into an enhanced logistic regression model building procedure that aids interpretation. We study 556 benchmark machine learning datasets to uncover when machine learning techniques outperformed rudimentary logistic regression models and so are potentially well-equipped to enhance them. We illustrate a software package, InteractionTransformer, which embeds logistic regression with advanced model building capacity by using machine learning algorithms to extract candidate interaction features from a random forest model for inclusion in the model. Finally, we apply our enhanced logistic regression analysis to two real-word biomedical examples, one where predictors vary linearly with the outcome and another with extensive second-order interactions. Preliminary statistical analysis demonstrated that across 556 benchmark datasets, the random forest approach significantly outperformed the logistic regression approach. We found a statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output. When a random forest model is closer to the true model, hybrid statistical-machine learning procedures can substantially enhance the performance of statistical procedures in an automated manner while preserving easy interpretation of the results. Such hybrid methods may help facilitate widespread adoption of machine learning techniques in the biomedical setting.

35 citations

Proceedings ArticleDOI
03 Aug 2020
TL;DR: In this paper, Graph Neural Networks (GNNs) and Topological Data Analysis (TDA) are combined to identify and quantify key pathogenic information pathways in colon cancer.
Abstract: Whole-slide images (WSI) are digitized representations of thin sections of stained tissue from various patient sources (biopsy, resection, exfoliation, fluid) and often exceed 100,000 pixels in any given spatial dimension. Deep learning approaches to digital pathology typically extract information from sub-images (patches) and treat the sub-images as independent entities, ignoring contributing information from vital large-scale architectural relationships. Modeling approaches that can capture higher-order dependencies between neighborhoods of tissue patches have demonstrated the potential to improve predictive accuracy while capturing the most essential slide-level information for prognosis, diagnosis and integration with other omics modalities. Here, we review two promising methods for capturing macro and micro architecture of histology images, Graph Neural Networks, which contextualize patch level information from their neighbors through message passing, and Topological Data Analysis, which distills contextual information into its essential components. We introduce a modeling framework, WSI-GTFE that integrates these two approaches in order to identify and quantify key pathogenic information pathways. To demonstrate a simple use case, we utilize these topological methods to develop a tumor invasion score to stage colon cancer.

27 citations

Journal ArticleDOI
TL;DR: The results demonstrate that virtual trichrome technologies may offer a software solution that can be employed in the clinical setting as a diagnostic decision aid.
Abstract: Non-alcoholic steatohepatitis (NASH) is a fatty liver disease characterized by accumulation of fat in hepatocytes with concurrent inflammation and is associated with morbidity, cirrhosis and liver failure. After extraction of a liver core biopsy, tissue sections are stained with hematoxylin and eosin (H&E) to grade NASH activity, and stained with trichrome to stage fibrosis. Methods to computationally transform one stain into another on digital whole slide images (WSI) can lessen the need for additional physical staining besides H&E, reducing personnel, equipment, and time costs. Generative adversarial networks (GAN) have shown promise for virtual staining of tissue. We conducted a large-scale validation study of the viability of GANs for H&E to trichrome conversion on WSI (n = 574). Pathologists were largely unable to distinguish real images from virtual/synthetic images given a set of twelve Turing Tests. We report high correlation between staging of real and virtual stains ( $${\rho} = 0.86$$ ; 95% CI: 0.84–0.88). Stages assigned to both virtual and real stains correlated similarly with a number of clinical biomarkers and progression to End Stage Liver Disease (Hazard Ratio HR = 2.06, 95% CI: 1.36–3.12, p < 0.001 for real stains; HR = 2.02, 95% CI: 1.40–2.92, p < 0.001 for virtual stains). Our results demonstrate that virtual trichrome technologies may offer a software solution that can be employed in the clinical setting as a diagnostic decision aid.

23 citations


Cited by
More filters
01 Jan 2016
TL;DR: The logistic regression a self learning text is universally compatible with any devices to read and is available in the book collection an online access to it is set as public so you can get it instantly.
Abstract: Thank you very much for downloading logistic regression a self learning text. As you may know, people have search hundreds times for their favorite books like this logistic regression a self learning text, but end up in malicious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some infectious bugs inside their desktop computer. logistic regression a self learning text is available in our book collection an online access to it is set as public so you can get it instantly. Our digital library spans in multiple countries, allowing you to get the most less latency time to download any of our books like this one. Merely said, the logistic regression a self learning text is universally compatible with any devices to read.

999 citations

Posted Content
TL;DR: It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
Abstract: Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

939 citations

01 Jan 2013
TL;DR: In this article, the authors proposed a hierarchical density-based hierarchical clustering method, which provides a clustering hierarchy from which a simplified tree of significant clusters can be constructed, and demonstrated that their approach outperforms the current, state-of-the-art, densitybased clustering methods.
Abstract: We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.

556 citations

01 Nov 2017
TL;DR: ChromHMM combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type, and provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

364 citations