scispace - formally typeset
Search or ask a question

Showing papers by "Tianwei Yu published in 2018"


Journal ArticleDOI
TL;DR: A newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector, which is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem.
Abstract: In predictive model development, gene expression data is associated with the unique challenge that the number of samples (n) is much smaller than the amount of features (p). This “n ≪ p” property has prevented classification of gene expression data from deep learning techniques, which have been proved powerful under “n > p” scenarios in other application fields, such as image classification. Further, the sparsity of effective features with unknown correlation structures in gene expression profiles brings more challenges for classification tasks. To tackle these problems, we propose a newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector. Using this built-in feature detector, the method is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem. Simulation experiments and real data analyses using two RNA-seq expression datasets are conducted to evaluate fDNN’s capability. The method is demonstrated a useful addition to current predictive models with better classification performance and more meaningful selected features compared to ordinary random forests and deep neural networks.

113 citations


Journal ArticleDOI
TL;DR: The current findings provide support for the use of untargeted HRM in the development of metabolic biomarkers of traffic pollution exposure and response and identify and verified biological perturbations associated with primary traffic pollutant in panel-based setting with repeated measurement.

111 citations


Journal ArticleDOI
TL;DR: In this paper, a Graph-Embedded Deep Feedforward Networks (GEDFN) is proposed to integrate external relational information of features into the deep neural network architecture to avoid overfitting.
Abstract: Motivation Gene expression data represents a unique challenge in predictive model building, because of the small number of samples (n) compared with the huge amount of features (p). This 'n≪p' property has hampered application of deep learning techniques for disease outcome classification. Sparse learning by incorporating external gene network information could be a potential solution to this issue. Still, the problem is very challenging because (i) there are tens of thousands of features and only hundreds of training samples, (ii) the scale-free structure of the gene network is unfriendly to the setup of convolutional neural networks. Results To address these issues and build a robust classification model, we propose the Graph-Embedded Deep Feedforward Networks (GEDFN), to integrate external relational information of features into the deep neural network architecture. The method is able to achieve sparse connection between network layers to prevent overfitting. To validate the method's capability, we conducted both simulation experiments and real data analysis using a breast invasive carcinoma RNA-seq dataset and a kidney renal clear cell carcinoma RNA-seq dataset from The Cancer Genome Atlas. The resulting high classification accuracy and easily interpretable feature selection results suggest the method is a useful addition to the current graph-guided classification models and feature selection procedures. Availability and implementation The method is available at https://github.com/yunchuankong/GEDFN. Supplementary information Supplementary data are available at Bioinformatics online.

73 citations


Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed Graph-Embedded Deep Feedforward Networks (GEDFN) to integrate external relational information of features into the deep neural network architecture, which achieved sparse connection between network layers to prevent overfitting.
Abstract: Gene expression data represents a unique challenge in predictive model building, because of the small number of samples $(n)$ compared to the huge amount of features $(p)$. This "$n<

41 citations


Journal ArticleDOI
19 Sep 2018-PLOS ONE
TL;DR: Acute exposures to traffic-related air pollutants are associated with broad inflammatory response, including several traditional markers of inflammation, as well as a subclinical immune response.
Abstract: Introduction Advances in liquid chromatography-mass spectrometry (LC-MS) have enabled high-resolution metabolomics (HRM) to emerge as a sensitive tool for measuring environmental exposures and corresponding biological response. Using measurements collected as part of a large, panel-based study of car commuters, the current analysis examines in-vehicle air pollution concentrations, targeted inflammatory biomarker levels, and metabolomic profiles to trace potential metabolic perturbations associated with on-road traffic exposures. Methods A 60-person panel of adults participated in a crossover study, where each participant conducted a highway commute and randomized to either a side-street commute or clinic exposure session. In addition to in-vehicle exposure characterizations, participants contributed pre- and post-exposure dried blood spots for 2-hr changes in targeted proinflammatory and vascular injury biomarkers and 10-hr changes in the plasma metabolome. Samples were analyzed on a Thermo QExactive MS system in positive and negative electrospray ionization (ESI) mode. Data were processed and analyzed in R using apLCMS, xMSanalyzer, and limma. Features associated with environmental exposures or biological endpoints were identified with a linear mixed effects model and annotated through human metabolic pathway analysis in mummichog. Results HRM detected 10-hr perturbations in 110 features associated with in-vehicle, particulate metal exposures (Al, Pb, and Fe) which reflect changes in arachidonic acid, leukotriene, and tryptophan metabolism. Two-hour changes in proinflammatory biomarkers hs-CRP, IL-6, IL-8, and IL-1β were also associated with 10-hr changes in the plasma metabolome, suggesting diverse amino acid, leukotriene, and antioxidant metabolism effects. A putatively identified metabolite, 20-OH-LTB4, decreased after in-vehicle exposure to particulate metals, suggesting a subclinical immune response. Conclusions Acute exposures to traffic-related air pollutants are associated with broad inflammatory response, including several traditional markers of inflammation.

37 citations


01 Apr 2018
TL;DR: The Dorm Room Inhalation to Vehicle Emissions (DRIVE2) study was conducted to measure traditional single-pollutant and novel multipollutant traffic indicators along a complete emission-to-exposure pathway, and found that atmospheric processing further enhanced FPMOPtotal-DTT.
Abstract: Introduction The Dorm Room Inhalation to Vehicle Emissions (DRIVE2) study was conducted to measure traditional single-pollutant and novel multipollutant traffic indicators along a complete emission-to-exposure pathway. The overarching goal of the study was to evaluate the suitability of these indicators for use as primary traffic exposure metrics in panel-based and small-cohort epidemiological studies. Methods Intensive field sampling was conducted on the campus of the Georgia Institute of Technology (GIT) between September 2014 and January 2015 at 8 monitoring sites (2 indoors and 6 outdoors) ranging from 5 m to 2.3 km from the busiest and most congested highway artery in Atlanta. In addition, 54 GIT students living in one of two dormitories either near (20 m) or far (1.4 km) from the highway were recruited to conduct personal exposure sampling and weekly biomonitoring. The pollutants measured were selected to provide information about the heterogeneous particulate and gaseous composition of primary traffic emissions, including the traditional traffic-related species (e.g., carbon monoxide [CO], nitrogen dioxide [NO2], nitric oxide [NO], fine particulate matter [PM2.5], and black carbon [BC]), and of secondary species (e.g., ozone [O3] and sulfate as well as organic carbon [OC], which is both primary and secondary) from traffic and other sources. Along with these pollutants, we also measured two multipollutant traffic indicators: integrated mobile source indicators (IMSIs) and fine particulate matter oxidative potential (FPMOP). IMSIs are derived from elemental carbon (EC), CO, and nitrogen oxide (NOx) concentrations, along with the fractions of these species emitted by gasoline and diesel vehicles, to construct integrated estimates of gasoline and diesel vehicle impacts. Our FPMOP indicator was based on an acellular assay involving the depletion of dithiothreitol (DTT), considering both water-soluble and insoluble components (referred to as FPMOPtotal-DTT). In addition, a limited assessment of 18 low-cost sensors was added to the study to supplement the four original aims. Results Pollutant levels measured during the study showed a low impact by this highway hotspot source on its surrounding vicinity. These findings are broadly consistent with results from other studies throughout North America showing decreased relative contributions to urban air pollution from primary traffic emissions. We view these reductions as an indication of a changing near-road environment, facilitated by the effectiveness of mobile source emission controls. Many of the primary pollutant species, including NO, CO, and BC, decreased to near background levels by 20 to 30 m from the highway source. Patterns of correlation among the sites also varied by pollutant and time of day. NO2 exhibited spatial trends that differed from those of the other single-pollutant primary traffic indicators. We believe this was caused by kinetic limitations in the photochemical chemistry, associated with primary emission reductions, required to convert the NO-dominant primary NOx, emitted from automobiles, to NO2. This finding provides some indication of limitations in the use of NO2 as a primary traffic exposure indicator in panel-based health effect studies. Roadside monitoring of NO, CO, and BC tended to be more strongly correlated with sites, both near and far from the road, during morning rush hour periods and often weakly to moderately correlated during other time periods of the day. This pattern was likely associated with diurnal changes in mixing and chemistry and their impact on spatial heterogeneity across the campus. Among our candidate multipollutant primary traffic indicators, we report several key findings related to the use of oxidative potential (OP)-based indicators. Although earlier studies have reported elevated levels of FPMOP in direct exhaust emissions, we found that atmospheric processing further enhanced FPMOPtotal-DTT, likely associated with the oxidation of primary polycyclic aromatic hydrocarbons (PAHs) to quinones and hydroxyquinones and with the oxidization and water solubility of metals. This has important implications in terms both of the utility of FPMOPtotal-DTT as a marker for exhaust emissions and of the importance of atmospheric processing of particulate matter (PM) being tied to potential health outcomes. The results from the personal exposure monitoring also point to the complexity and diversity of the spatiotemporal variability patterns among the study monitoring sites and the importance of accounting for location and spatial mobility when estimating exposures in panel-based and small-cohort studies. This was most clearly demonstrated with the personal BC measurements, where ambient roadside monitoring was shown to be a poor surrogate for exposures to BC. Alternative surrogates, including ambient and indoor BC at the participants' respective dorms, were more strongly associated with personal BC, and knowledge of the participants' mean proximity to the highway was also shown to explain a substantial level of the variability in corresponding personal exposures to both BC and NO2. In addition, untargeted metabolomic indicators measured in plasma and saliva, which represent emerging methods for measuring exposure, were used to extract approximately 20,000 and 30,000 features from plasma and saliva, respectively. Using hydrophilic interaction liquid chromatography (HILIC) in the positive ion mode, we identified 221 plasma features that differed significantly between the two dorm cohorts. The bimodal distribution of these features in the HILIC column was highly idiosyncratic; one peak consisted of features with elevated intensities for participants living in the near dorm; the other consisted of features with elevated intensities for participants in the far dorm. Both peaks were characterized by relatively short retention times, indicative of the hydrophobicity of the identified features. The results from the metabolomics analyses provide a strong basis for continuing this work toward specific chemical validation of putative biomarkers of traffic-related pollution. Finally, the study had a supplemental aim of examining the performance of 18 low-cost CO, NO, NO2, O3, and PM2.5 pollutant sensors. These were colocated alongside the other study monitors and assessed for their ability to capture temporal trends observed by the reference monitoring instrumentation. Generally, we found the performance of the low-cost gas-phase sensors to be promising after extensive calibration; the uncalibrated measurements alone, however, would likely not have led to reliable results. The low-cost PM sensors we evaluated had poor accuracy, although PM sensor technology is evolving quickly and warrants future attention. Conclusions An immediate implication of the changing near-road environment is that future studies aimed at characterizing hotspots related to mobile sources and their impacts on health will need to consider multiple approaches for characterizing spatial gradients and exposures. Specifically and most directly, the mobile source contributions to ambient concentrations of single-pollutant indicators of traffic exposure are not as distinguishable to the degree that they have been in the past. Collectively, the study suggests that characterizing exposures to traffic-related pollutants, which is already difficult, will become more difficult because of the reduction in traffic-related emissions. Additional multi-tiered approaches should be considered along with traditional measurements, including the use of alternative OP measures beyond those based on DTT assays, metabolomics, low-cost sensors, and air quality modeling.

15 citations


Journal ArticleDOI
TL;DR: An adaptive method to directly adjust the dissimilarity matrix between samples, compared to the leading batch effect adjustment method ComBat, which effectively corrected distance matrices and improved the performance of clustering algorithms.
Abstract: Motivation It is well known that batch effects exist in RNA-seq data and other profiling data. Although some methods do a good job adjusting for batch effects by modifying the data matrices, it is still difficult to remove the batch effects entirely. The remaining batch effect can cause artifacts in the detection of patterns in the data. Results In this study, we consider the batch effect issue in the pattern detection among the samples, such as clustering, dimension reduction and construction of networks between subjects. Instead of adjusting the original data matrices, we design an adaptive method to directly adjust the dissimilarity matrix between samples. In simulation studies, the method achieved better results recovering true underlying clusters, compared to the leading batch effect adjustment method ComBat. In real data analysis, the method effectively corrected distance matrices and improved the performance of clustering algorithms. Availability and implementation The R package is available at: https://github.com/tengfei-emory/QuantNorm. Supplementary information Supplementary data are available at Bioinformatics online.

15 citations


Journal ArticleDOI
Tianwei Yu1
TL;DR: A new method is developed that directly identifies strong latent dynamic correlation signals from the data matrix, named DCA: Dynamic Correlation Analysis, with a new metric for the identification of pairs of variables that are highly likely to be dynamically correlated, without knowing the underlying physiological states that govern the dynamic correlation.
Abstract: Dynamic correlations are pervasive in high-throughput data. Large numbers of gene pairs can change their correlation patterns in response to observed/unobserved changes in physiological states. Finding changes in correlation patterns can reveal important regulatory mechanisms. Currently there is no method that can effectively detect global dynamic correlation patterns in a dataset. Given the challenging nature of the problem, the currently available methods use genes as surrogate measurements of physiological states, which cannot faithfully represent true underlying biological signals. In this study we develop a new method that directly identifies strong latent dynamic correlation signals from the data matrix, named DCA: Dynamic Correlation Analysis. At the center of the method is a new metric for the identification of pairs of variables that are highly likely to be dynamically correlated, without knowing the underlying physiological states that govern the dynamic correlation. We validate the performance of the method with extensive simulations. We applied the method to three real datasets: a single cell RNA-seq dataset, a bulk RNA-seq dataset, and a microarray gene expression dataset. In all three datasets, the method reveals novel latent factors with clear biological meaning, bringing new insights into the data.

14 citations


Journal ArticleDOI
TL;DR: An imputation algorithm that incorporates the existing metabolic network, adduct ion relations even for unknown compounds, as well as linear and nonlinear associations between feature intensities to build a feature-level network is proposed.
Abstract: Motivation Metabolomics data generated from liquid chromatography-mass spectrometry platforms often contain missing values. Existing imputation methods do not consider underlying feature relations and the metabolic network information. As a result, the imputation results may not be optimal. Results We proposed an imputation algorithm that incorporates the existing metabolic network, adduct ion relations even for unknown compounds, as well as linear and nonlinear associations between feature intensities to build a feature-level network. The algorithm uses support vector regression for missing value imputation based on features in the neighborhood on the network. We compared our proposed method with methods being widely used. As judged by the normalized root mean squared error in real data-based simulations, our proposed methods can achieve better accuracy. Availability and implementation The R package is available at http://web1.sph.emory.edu/users/tyu8/MINMA. Contact jiankang@umich.edu or tianwei.yu@emory.edu. Supplementary information Supplementary data are available at Bioinformatics online.

13 citations


Journal ArticleDOI
01 Dec 2018-Virology
TL;DR: The role of FcγR variants on HIV acquisition, viral control, and disease progression in two longitudinal heterosexual transmission cohorts with HIV subtypes A and C as the major circulating viruses is assessed.

7 citations


Posted Content
TL;DR: A novel prior model for Bayesian network marker selection in the generalized linear model (GLM) framework is proposed: the Thresholded Graph Laplacian Gaussian (TGLG) prior, which adopts the graph LaPLacian matrix to characterize the conditional dependence between neighboring markers accounting for the global network structure.
Abstract: Selecting informative nodes over large-scale networks becomes increasingly important in many research areas. Most existing methods focus on the local network structure and incur heavy computational costs for the large-scale problem. In this work, we propose a novel prior model for Bayesian network marker selection in the generalized linear model (GLM) framework: the Thresholded Graph Laplacian Gaussian (TGLG) prior, which adopts the graph Laplacian matrix to characterize the conditional dependence between neighboring markers accounting for the global network structure. Under mild conditions, we show the proposed model enjoys the posterior consistency with a diverging number of edges and nodes in the network. We also develop a Metropolis-adjusted Langevin algorithm (MALA) for efficient posterior computation, which is scalable to large-scale networks. We illustrate the superiorities of the proposed method compared with existing alternatives via extensive simulation studies and an analysis of the breast cancer gene expression dataset in the Cancer Genome Atlas (TCGA).


Journal ArticleDOI
Tianwei Yu1
TL;DR: Given no functional form is assumed, a method of variable selection for the sparse generalized additive model is presented, and an approach termed “roughening” to adjust the residuals in the iterations is devised.
Abstract: We present a method of variable selection for the sparse generalized additive model. The method doesn't assume any specific functional form, and can select from a large number of candidates. It takes the form of incremental forward stagewise regression. Given no functional form is assumed, we devised an approach termed "roughening" to adjust the residuals in the iterations. In simulations, we show the new method is competitive against popular machine learning approaches. We also demonstrate its performance using some real datasets. The method is available as a part of the nlnet package on CRAN (https://cran.r-project.org/package=nlnet).