scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Bioinformatics in 2017"


Proceedings ArticleDOI
20 Aug 2017
TL;DR: A novel unsupervised molecular embedding method, providing a continuous feature vector for each molecule to perform further tasks, e.g., solubility classification, which is also robust and task-insensitive.
Abstract: Many of today's drug discoveries require expertise knowledge and insanely expensive biological experiments for identifying the chemical molecular properties. However, despite the growing interests of using supervised machine learning algorithms to automatically identify those chemical molecular properties, there is little advancement of the performance and accuracy due to the limited amount of training data. In this paper, we propose a novel unsupervised molecular embedding method, providing a continuous feature vector for each molecule to perform further tasks, e.g., solubility classification. In the proposed method, a multi-layered Gated Recurrent Unit (GRU) network is used to map the input molecule into a continuous feature vector of fixed dimensionality, and then another deep GRU network is employed to decode the continuous vector back to the original molecule. As a result, the continuous encoding vector is expected to contain rigorous and enough information to recover the original molecule and predict its chemical properties. The proposed embedding method could utilize almost unlimited molecule data for the training phase. With sufficient information encoded in the vector, the proposed method is also robust and task-insensitive. The performance and robustness are confirmed and interpreted in our extensive experiments.

120 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: A gated recurrent unit-based recurrent neural network with hierarchical attention for mortality prediction and the prediction accuracy of the model outperforms baseline models is found and the interpretability is demonstrated in visualizations.
Abstract: The increasing accumulation of healthcare data provides researchers with ample opportunities to build machine learning approaches for clinical decision support and to improve the quality of health care. Several studies have developed conventional machine learning approaches that rely heavily on manual feature engineering and result in task-specific models for health care. In contrast, healthcare researchers have begun to use deep learning, which has emerged as a revolutionary machine learning technique that obviates manual feature engineering but still achieves impressive results in research fields such as image classification. However, few of them have addressed the lack of the interpretability of deep learning models although interpretability is essential for the successful adoption of machine learning approaches by healthcare communities. In addition, the unique characteristics of healthcare data such as high dimensionality and temporal dependencies pose challenges for building models on healthcare data. To address these challenges, we develop a gated recurrent unit-based recurrent neural network with hierarchical attention for mortality prediction, and then, using the diagnostic codes from the Medical Information Mart for Intensive Care, we evaluate the model. We find that the prediction accuracy of the model outperforms baseline models and demonstrate the interpretability of the model in visualizations.

89 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: A multi-view deep learning model to capture brain abnormality from multi-channel epileptic EEG signals for seizure detection and is effective in detecting epileptic seizure is proposed.
Abstract: With the advances in pervasive sensor technologies, physiological signals can be captured continuously to prevent the serious outcomes caused by epilepsy. Detection of epileptic seizure onset on collected multi-channel electroencephalogram (EEG) has attracted lots of attention recently. Deep learning is a promising method to analyze large-scale unlabeled data. In this paper, we propose a multi-view deep learning model to capture brain abnormality from multi-channel epileptic EEG signals for seizure detection. Specifically, we first generate EEG spectrograms using short-time Fourier transform (STFT) to represent the time-frequency information after signal segmentation. Second, we adopt stacked sparse denoising autoencoders (SSDA) to unsupervisedly learn multiple features by considering both intra and inter correlation of EEG channels, denoted as intra-channel and cross-channel features, respectively. Third, we add an SSDA-based channel selection procedure using proposed response rate to reduce the dimension of intra-channel feature. Finally, we concatenate the learned multi-features and apply a fully-connected SSDA model with softmax classifier to jointly learn the cross-patient seizure detector in a supervised fashion. To evaluate the performance of the proposed model, we carry out experiments on a real world benchmark EEG dataset and compare it with six baselines. Extensive experimental results demonstrate that the proposed learning model is able to extract latent features with meaningful interpretation, and hence is effective in detecting epileptic seizure.

88 citations


Proceedings ArticleDOI
06 Jan 2017
TL;DR: The technique of transfer learning not only overcomes the unsatisfactory performance of traditional approaches, but also breaks the obstacle of limited samples for building deep CNNs.
Abstract: Automatic classification of breast mass lesions in mammographic images remains an unsolved problem. This paper explored the technique of transfer learning to tackle this problem. It utilized the convolutional neural network (CNN) of GoogLeNet and AlexNet pre-trained on a large-scale visual database. The performance was evaluated a new dataset in terms of the area under the receiver operating characteristic curves (AUC). Results demonstrate that GoogLeNet (AUC=0.88) outperforms AlexNet (AUC=0.83) and other state-of-the-art traditional approaches in breast cancer diagnosis. The technique of transfer learning not only overcomes the unsatisfactory performance of traditional approaches, but also breaks the obstacle of limited samples for building deep CNNs.

69 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: To the best of the knowledge, this is the first work enabling both automated detection and diagnosis of these areas in one step from full mammogram images, and the accuracy of the framework's ability to both locate the regions of interest as well as diagnose them.
Abstract: Detection of suspicious regions in mammogram images and the subsequent diagnosis of these regions remains a challenging problem in the medical world. There still exists an alarming rate of misdiagnosis of breast cancer. This results in both over treatment through incorrect positive diagnosis of cancer and under treatment through overlooked cancerous masses. Convolutional neural networks have shown strong applicability to various image datasets, enabling detailed features to be learned from the data and, as a result, the ability to classify these images at extremely low error rates. In order to overcome the difficulty in diagnosing breast cancer from mammogram images, we propose our framework for automated breast cancer detection and diagnosis, called BC-DROID, which provides automated region of interest detection and diagnosis using convolutional neural networks. BC-DROID first pretrains based on physician-defined regions of interest in mammogram images. It then trains based on the full mammogram image. The resulting network is able to detect and classify regions of interest as cancerous or benign in one step. We demonstrate the accuracy of our framework's ability to both locate the regions of interest as well as diagnose them. Our framework achieves a detection accuracy of up to 90% and a classification accuracy of 93.5% (AUC of 92.315%). To the best of our knowledge, this is the first work enabling both automated detection and diagnosis of these areas in one step from full mammogram images. Using our framework's website, a user can upload a single mammogram image, visualize suspicious regions, and receive the automated diagnoses of these regions.

59 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is shown that pretraining and the use of deep residual networks are crucial to seeing large improvements in Alzheimer's Disease diagnosis from brain MRIs.
Abstract: We propose a framework that leverages deep residual CNNs pretrained on large, non-biomedical image data sets. These pretrained networks learn cross-domain features that improve low-level interpretation of images. We evaluate our model on brain imaging data and show that pretraining and the use of deep residual networks are crucial to seeing large improvements in Alzheimer's Disease diagnosis from brain MRIs.

55 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is shown that preprocessing data with SeqyClean first improves both de-novo genome assembly and genome mapping, and, according to the authors' tests, outperforms other available preprocessing tools.
Abstract: Modern high-throughput sequencing instruments produce massive amounts of data, which often contains noise in the form of sequencing errors, sequencing adaptors, and contaminating reads. This noise complicates genomics studies. Although many preprocessing software tools have been developed to reduce the sequence noise, many of them cannot handle data from multiple technologies and few address more than one type of noise. We present SeqyClean, a comprehensive preprocessing software pipeline. SeqyClean effectively removes multiple sources of noise in high throughput sequence data and, according to our tests, outperforms other available preprocessing tools. We show that preprocessing data with SeqyClean first improves both de-novo genome assembly and genome mapping. We have used SeqyClean extensively in the genomics core at the Institute for Bioinformatics and Evolutionary STudies (IBEST) at the University of Idaho, so it has been validated with both test and production data. SeqyClean is available as open source software under the MIT License at http://github.com/ibest/seqyclean

55 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: The Bidirectional LSTM Recurrent Neural Networks model achieves the state-of-the-art performance compared with other machine learning approaches, and shows strong robustness as evaluated by cross-validation.
Abstract: Brain fog, also known as confusion, is one of the main reasons of the low performance in the learning process or any kind of daily task that involves and requires thinking. Detecting confusion in human's mind in real time is a challenging and important task which can be applied to online education, driver fatigue detection and so on. In this paper, we applied Bidirectional LSTM Recurrent Neural Networks to classify students' confusions. The results show that Bidirectional LSTM model achieves the state-of-the-art performance compared with other machine learning approaches, and shows strong robustness as evaluated by cross-validation. We can predict whether or not a student is confused in the accuracy of 73.3%. Furthermore, we find the most important feature to detecting the brain confusion is gamma 1 wave of EEG signal. Our results suggest that machine learning is a potentially powerful tool to model and understand brain activities.

45 citations


Journal ArticleDOI
23 Nov 2017
TL;DR: The findings indicate that the number of studies published from 2009 to 2017 on this field significantly increased, having reached a peak in 2014, and five dimensions associated with the Web 4.0 paradigm are identified.
Abstract: Web 4.0 is a new evolution of the Web paradigm based on multiple models, technologies and social relationships. The concept of Web 4.0 is not totally clear and unanimous in literature, because it is composed by several dimensions. In this sense, this study uses a systematic review approach to clarify the concept of Web 4.0 and explore its various dimensions, analyzing if they have elements in common. The findings indicate that the number of studies published from 2009 to 2017 on this field significantly increased, having reached a peak in 2014. Furthermore, we identified five dimensions associated with the Web 4.0 paradigm, in which the terms “pervasive computing” and “ubiquitous computing” are the most widely used in the literature. On the other side, terms such as “Web 4.0”, “symbiotic Web” and “Web social computing” are not often used.

35 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: This paper presents a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework that is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques.
Abstract: In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.

33 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: An experimental analysis shows that the ILP approach is able to explain data that do not fit the perfect phylogeny assumption, thereby allowing multiple losses and gains of mutations, and a number of subpopulations that is smaller than the number of input mutations.
Abstract: Most of the evolutionary history reconstruction approaches are based on the infinite site assumption which is underlying the Perfect Phylogeny model. This is one of the most used models in cancer genomics. Recent results gives a strong evidence that recurrent and back mutations are present in the evolutionary history of tumors[19], thus showing that more general models then the Perfect phylogeny are required. To address this problem we propose a framework based on the notion of Incomplete Perfect Phylogeny. Our framework incorporates losing and gaining mutations, hence including the Dollo and the Camin-Sokal models, and is described with an Integer Linear Programming (ILP) formulation. Our approach generalizes the notion of persistent phylogeny[1] and the ILP approach[14,15] proposed to solve the corresponding phylogeny reconstruction problem on character data. The final goal of our paper is to integrate our approach into an ILP formulation of the problem of reconstructing trees on mixed populations, where the input data consists of the fraction of cells in a set of samples that have a certain mutation. This is a fundamental problem in cancer genomics, where the goal is to study the evolutionary history of a tumor. An experimental analysis shows that our ILP approach is able to explain data that do not fit the perfect phylogeny assumption, thereby allowing (1) multiple losses and gains of mutations, and (2) a number of subpopulations that is smaller than the number of input mutations.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: DeepCCI as discussed by the authors uses a simplified molecular input line entry system (SMILES), which is a string notation representing the chemical structure, instead of learning from crafted features, to discover hidden representations for the SMILES strings, using convolutional neural networks (CNNs).
Abstract: Chemical-chemical interaction (CCI) plays a key role in predicting candidate drugs, toxicity, therapeutic effects, and biological functions. In various types of chemical analyses, computational approaches are often required due to the amount of data that needs to be handled. The recent remarkable growth and outstanding performance of deep learning have attracted considerable research attention. However, even in state-of-the-art drug analysis methods, deep learning continues to be used only as a classifier, although deep learning is capable of not only simple classification but also automated feature extraction. In this paper, we propose the first end- to-end learning method for CCI, named DeepCCI. Hidden features are derived from a simplified molecular input line entry system (SMILES), which is a string notation representing the chemical structure, instead of learning from crafted features. To discover hidden representations for the SMILES strings, we use convolutional neural networks (CNNs). To guarantee the commutative property for homogeneous interaction, we apply model sharing and hidden representation merging techniques. The performance of DeepCCI was compared with a plain deep classifier and conventional machine learning methods. The proposed DeepCCI showed the best performance in all seven evaluation metrics used. In addition, the commutative property was experimentally validated. The automatically extracted features through end-to-end SMILES learning alleviates the significant efforts required for manual feature engineering. It is expected to improve prediction performance in drug analyses.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The experimental results show the effectiveness of dependency and AMR embeddings in the DDI extraction task, and a novel syntactic embedding approach using AMR, which aims to abstract away from syntactic idiosyncrasies and attempts to capture only the core meaning of a sentence, which could potentially improve D DI extraction from sentences.
Abstract: Drug-drug interaction (DDI) is an unexpected change in a drug's effect on the human body when the drug and a second drug are co-prescribed and taken together. As many DDIs are frequently reported in biomedical literature, it is important to mine DDI information from literature to keep DDI knowledge up to date. One of the SemEval challenges in the year 2011 and 2013 was designed to tackle the task where the best system achieved an F1 score of 0.80. In this paper, we propose to utilize dependency embeddings and Abstract Meaning Representation (AMR) embeddings as features for extracting DDIs. Our contribution is two-fold. First, we employed dependency embeddings, previously shown effective for sentence classification, for DDI extraction. The dependency embeddings incorporated structural syntactic contexts into the embeddings, which were not present in the conventional word embeddings. Second, we proposed a novel syntactic embedding approach using AMR. AMR aims to abstract away from syntactic idiosyncrasies and attempts to capture only the core meaning of a sentence, which could potentially improve DDI extraction from sentences. Two classifiers (Support Vector Machine and Random Forest) taking these embedding features as input were evaluated on the DDIExtraction 2013 challenge corpus. The experimental results show the effectiveness of dependency and AMR embeddings in the DDI extraction task. The best performance was obtained by combining word, dependency and AMR embeddings (F1 score=0.84).

Journal ArticleDOI
01 Dec 2017
TL;DR: In this article, the authors examined how higher education students across different generations are embracing electronic books in their studies and found that the acceptance of e-books is varied among different generations of students.
Abstract: Technology is becoming increasingly embedded into our lives. We are seeing a push towards digitization and online access. This can be a challenge for some because users’ level of technical ability varies among the generations and other factors. Predicting technological innovations and how they might supplement, integrate with, or entirely replace existing technology is nearly impossible. These changes and innovations include many within the realm of education, including the relatively recent advent and increasing presence of e-Book sources and platforms. This study examines how higher education students across different generations are embracing electronic books in their studies. Students have more distractions than ever before. Using mobile devices smartly but having the ability to concentrate when you need to can be a challenge. Just because e-Books options are available and being increasingly adopted does not necessarily mean they are preferred by students. This study contributes to our understanding of their acceptance across different generations of students.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This study introduces the first ontology-based information extraction model introduced to find shifts in the established knowledge in the medical domain using research paper abstracts, and finds 102 inconsistencies relevant to the microRNA domain.
Abstract: Searching for a cure for cancer is one of the most vital pursuits in modern medicine. In that aspect microRNA research plays a key role. Keeping track of the shifts and changes in established knowledge in the microRNA domain is very important. In this paper, we introduce an Ontology-Based Information Extraction method to detect occurrences of inconsistencies in microRNA research paper abstracts. We propose a method to first use the Ontology for MIcroRNA Targets (OMIT) to extract triples from the abstracts. Then we introduce a new algorithm to calculate the oppositeness of these candidate relationships. Finally we present the discovered inconsistencies in an easy to read manner to be used by medical professionals. To our best knowledge, this study is the first ontology-based information extraction model introduced to find shifts in the established knowledge in the medical domain using research paper abstracts. We downloaded 36877 abstracts from the PubMed database. From those, we found 102 inconsistencies relevant to the microRNA domain.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The biRNN-CRF may be seen as an improved alternative to an auto-regressive uni-directional RNN where predictions are performed sequentially conditioning on the prediction in the previous time-step, and close to ideally suited for the secondary structure task.
Abstract: Deep learning has become the state-of-the-art method for predicting protein secondary structure from only its amino acid residues and sequence profile. Building upon these results, we propose to combine a bi-directional recurrent neural network (biRNN) with a conditional random field (CRF), which we call the biRNN-CRF. The biRNN-CRF may be seen as an improved alternative to an auto-regressive uni-directional RNN where predictions are performed sequentially conditioning on the prediction in the previous time-step. The CRF is instead nearest neighbor-aware and models for the joint distribution of the labels for all time-steps. We condition the CRF on the output of biRNN, which learns a distributed representation based on the entire sequence. The biRNN-CRF is therefore close to ideally suited for the secondary structure task because a high degree of cross-talk between neighboring elements can be expected. We validate the model on several benchmark datasets. For example, on CB513, a model with 1.7 million parameters, achieves a Q8 accuracy of 69.4 for single model and 70.9 for ensemble, which to our knowledge is state-of-the-art.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This interactive tutorial introduces the eICU Collaborative Research Database, a large, publicly available database that contains routinely collected data from over 200,000 admissions to intensive care units across the United States, with representation from 10-12% of US ICU beds.
Abstract: Patients in hospital intensive care units (ICUs) are physiologically fragile and unstable, generally have life-threatening conditions, and require close monitoring and rapid therapeutic interventions. Staggering amounts of data are collected in the ICU daily: multi-channel waveforms sampled hundreds of times each second, vital sign time series updated each second or minute, alarms and alerts, lab results, imaging results, records of medication and fluid administration, staff notes and more. Reducing the barriers to data access has the potential to accelerate knowledge generation and ultimately improve patient care. In this interactive tutorial we introduce the eICU Collaborative Research Database: a large, publicly available database created by the MIT Laboratory for Computational Physiology in partnership with the Philips eICU Research Institute. The database contains routinely collected data from over 200,000 admissions to intensive care units across the United States, with representation from 10-12% of US ICU beds. The data facilitates a breadth of research studies, such as investigations into treatment efficacy, discovery of clinical markers in illnesses, and the development of decision support models. Participants in the tutorial gain an overview of the eICU Collaborative Research Database, in particular being introduced to its structure, content, and limitations. Following this overview, participants explore a demo version of the database using a laptop in a hands-on project. This exercise requires minimal technical expertise and gives an insight into the type of study that can be carried out using the database. We also highlight the growing online community centered around secondary analysis of this data. All tutorial materials are open source and publicly available. The eICU Collaborative Research Database offers an unparalleled insight into ICU care. Access to the database is granted to legitimate researchers who request it, following completion of a training course in human subjects research and acceptance of a data use agreement. We anticipate that the research community will use this unique resource to further human knowledge in the field of critical care.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The modular architecture of the software allows us to plug in a module enabling data aggregation for research purposes, which is an important feature in order to develop new mathematical models for drugs, and thus to improve TDM.
Abstract: Therapeutic Drug Monitoring (TDM) is a key concept in precision medicine. The goal of TDM is to avoid therapeutic failure or toxic effects of a drug due to insufficient or excessive circulating concentration exposure related to between-patient variability in the drug's disposition. We present TUCUXI - an intelligent system for TDM. By making use of embedded mathematical models, the software allows to compute maximum likelihood individual predictions of drug concentrations from population pharmacokinetic data, based on patient's parameters and previously observed concentrations. TUCUXI was developed to be used in medical practice, to assist clinicians in taking dosage adjustment decisions for optimizing drug concentration levels. This software is currently being tested in a University Hospital. In this paper we focus on the process of software integration in clinical workflow. The modular architecture of the software allows us to plug in a module enabling data aggregation for research purposes. This is an important feature in order to develop new mathematical models for drugs, and thus to improve TDM. Finally we discuss ethical issues related to the use of an automated decision support system in clinical practice, in particular if it allows data aggregation for research purposes.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This paper proposes GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson's Correlation Coefficient while performing computation once for each pair without the need to do post-processing reordering of coefficients.
Abstract: Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain's functional activities. Pearson's Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson's Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU-PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62x faster as compared to the CPU version and 4.28x faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions and develops a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF).
Abstract: Drug repositioning is a promising strategy in drug discovery. New biomedical insights of drug-target-disease relationships are important in drug repositioning, and such relationships have been intensively studied recently. Most of the studies utilize network-based computational approaches based on drug and disease similarities. However, one common limitation of existing approaches is that both drug similarities and disease similarities are defined based on a single feature of drugs/diseases. In reality, the relationships between drug (or disease) pairs can be characterized based on many different features. Therefore, it is increasingly important to include them in drug repositioning studies. In this study, we propose a flexible and robust multi-source learning (FRMSL) framework to integrate multiple heterogeneous data sources for drug-disease association predictions. We first construct a two-layer heterogeneous network consisting of drug nodes, disease nodes and known drug-disease relationships. The drug repositioning problem can thus be treated as a missing link prediction problem on the heterogeneous graph and can be solved using Kronecker regularized least square (KronRLS) method. Multiple data sources describing drugs and diseases are incorporated into the framework using similarity-based kernels. In practice, a great challenge in such data integration projects is the data incompleteness problem due to the nature of data generation and collection. To address this issue, we develop a novel multi-view learning algorithm based on symmetric nonnegative matrix factorization (SymNMF). Extensive experimental studies show that our framework outperforms several recent network-based methods.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The specific problem of predicting the patients' length of stay in the hospital is investigated in a predictive diagnosis framework which uses DLFS for feature selection, an efficient feature selection scheme based on deep learning that is applicable for heterogeneous data.
Abstract: Predictive diagnosis benefits both patients and hospitals. Major challenges limiting the effectiveness of machine learning based predictive diagnosis include the lack of efficient feature selection methods and the heterogeneity of measured patient data (e.g., vital signs). In this paper, we propose DLFS, an efficient feature selection scheme based on deep learning that is applicable for heterogeneous data. DLFS is unsupervised in nature and can learn compact representations from patient data automatically for efficient prediction. In this paper, the specific problem of predicting the patients' length of stay in the hospital is investigated in a predictive diagnosis framework which uses DLFS for feature selection. Real patient data from the pneumonia database of the National University Health System (NUHS) in Singapore are collected to verify the effectiveness of DLFS. By running experiments on real-world patient data and comparing with several other commonly used feature selection methods, we demonstrate the advantage of the proposed DLFS scheme.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This work presents SKraken an efficient approach to accurately classify metagenomic reads against a set of reference genomes, e.g. the NCBI/RefSeq database, based on k-mers statistics combined with the taxonomic tree.
Abstract: The study of microbial communities is an emerging field that is revolutionizing many disciplines from ecology to medicine. The major problem when analyzing a metagenomic sample is to taxonomic annotate its reads in order to identify the species in the sample and their relative abundance. Many tools have been developed in the recent years, however the performance in terms of precision and speed are not always adequate for these very large datasets. In this work we present SKraken an efficient approach to accurately classify metagenomic reads against a set of reference genomes, e.g. the NCBI/RefSeq database. SKraken is based on k-mers statistics combined with the taxonomic tree. Given a set of target genomes SKraken is able to detect the most representative k-mers for each species, filtering out uninformative k-mers. The classification performance on several synthetic and real metagenomics datasets shows that SKraken achieves in most cases the best performances in terms of precision and recall w.r.t. Kraken. In particular, at species level classification, the estimation of the abundance ratios improves by 6% and the precision by 8%. This behavior is confirmed also on a real stool metagenomic sample where SKraken is able to detect species with high precision. Because of the efficient filtering of uninformative k-mers, SKraken requires less RAM and it is faster than Kraken, one of

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra, which achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds.
Abstract: Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR.

Journal ArticleDOI
11 Oct 2017
TL;DR: This paper focuses on the detailed survey of various clustering approaches used in wsn, including system design, networking, and distributed algorithms, programming models, data management, security and social components.
Abstract: WSNs are evolving as both a vital new domain in the IT environment and a hot research including system design, networking, and distributed algorithms, programming models, data management, security and social components. Wireless sensor networks are rapidly picking up the popularity as they are potentially low cost solutions. The fundamental thought of sensor network is to scatter minor sensing gadgets over a particular geographic zone for some specific purposes like target tracking, surveillance, environmental screening and so on. These tiny devices are equipped for sensing a few progressions of parameters and communicating with different units. Since the network is wireless, the nodes communicate with each other via means of wireless. The wireless sensor network is interfaced with the outside world by a gateway server or controlling server. This paper focuses on the detailed survey of various clustering approaches used in wsn.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This talk will present the latest method that uses Lorentzian function to describe distance restraints between chromosomal regions, which will be used to guide the reconstruction of 3D structures of individual chromosomes and an entire genome.
Abstract: Reconstructing 3D structure of a genome from chromosomal conformation capturing data such as Hi-C data has emerged as an important problem in bioinformatics and computational biology in the recent years. In this talk, I will present our latest method that uses Lorentzian function to describe distance restraints between chromosomal regions, which will be used to guide the reconstruction of 3D structures of individual chromosomes and an entire genome. The method is more robust against noisy distance restraints derived from Hi-C data than traditional objective functions such as squared error function and Gaussian probabilistic function. The method can handle both intra- and inter-chromosomal contacts effectively to build 3D structures of a big genome such as the human genome consisting of a number of chromosomes, which are not possible with most existing methods. We have released the Java source code that implements the method (called LorDG) at GitHub (https://github.com/BDM-Lab/LorDG), which is being used by the community to model 3D genome structures. We are currently further improving the method to build very high-resolution (e.g. 1KB base pair) 3D genome and chromosome models.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A novel framework called Microbial Time-series Prior Lasso (MTPLasso) is proposed which integrates sparse linear regression with microbial co-occurrences and associations obtained from scientific literature and cross-sectional metagenomics data and outperforms existing models in terms of precision and recall rates, as well as the accuracy in inferring the interaction types.
Abstract: Due to the recent advances in modern metagenomics sequencing methods, it becomes possible to directly analyze the microbial communities within human body. To understand how microbial communities adapt, develop, and interact over time with the human body and the surrounding environment, a critical step is the inference of interactions among different microbes directly from sequencing data. However, metagenomics data is both compositional and highly dimensional in nature. Consequently, new approaches that can accurately and robustly estimate the interactions among various microbe species are needed to analyze such data. To this end, we propose a novel framework called Microbial Time-series Prior Lasso (MTPLasso) which integrates sparse linear regression with microbial co-occurrences and associations obtained from scientific literature and cross-sectional metagenomics data. We show that MTPLasso outperforms existing models in terms of precision and recall rates, as well as the accuracy in inferring the interaction types. Finally, the interaction networks we infer from human gut data demonstrate credible results when compared against real data.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This work in silico generate mutant protein structures, and compute several rigidity metrics for each of them, which are features for support vector regression, random forest, and deep neural network methods.
Abstract: Predicting how a point mutation alters a protein's stability can guide drug design initiatives which aim to counter the effects of serious diseases. Mutagenesis studies give insights about the effects of amino acid substitutions, but such wet-lab work is prohibitive due to the time and costs needed to assess the consequences of even a single mutation. Computational methods for predicting the effects of a mutation are available, with promising accuracy rates. In this work we study the utility of several machine learning methods and their ability to predict the effects of mutations. We in silico generate mutant protein structures, and compute several rigidity metrics for each of them. Our approach does not require costly calculations of energy functions that rely on atomic-level statistical mechanics and molecular energetics. Our metrics are features for support vector regression, random forest, and deep neural network methods. We validate the effects of our in silico mutations against experimental Delta Delta G stability data. We attain Pearson Correlations upwards of 0.69.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The proposed DCD approach overcomes the limitations of previous statistical techniques and the issues associated with identifying differential sub-networks by use of community detection methods on the noisy DT graph and demonstrates the potential benefits of DCD for finding network-inferred bio-markers/pathways associated with a trait of interest.
Abstract: Motivation: Biological networks unravel the inherent structure of molecular interactions which can lead to discovery of driver genes and meaningful pathways especially in cancer context. Often due to gene mutations, the gene expression undergoes changes and the corresponding gene regulatory network sustains some amount of localized re-wiring. The ability to identify significant changes in the interaction patterns caused by the progression of the disease can lead to the revelation of novel relevant signatures. Methods: The task of identifying differential sub-networks in paired biological networks (A:control,B:case) can be re-phrased as one of finding dense communities in a single noisy differential topological (DT) graph constructed by taking absolute difference between the topological graphs of A and B. In this paper, we propose a fast three-stage approach, namely Differential Community Detection (DCD), to identify differential sub-networks as differential communities in a de-noised version of the DT graph. In the first stage, we iteratively re-order the nodes of the DT graph to determine approximate block diagonals present in the DT adjacency matrix using neighbourhood information of the nodes and Jaccard similarity. In the second stage, the ordered DT adjacency matrix is traversed along the diagonal to remove all the edges associated with a node, if that node has no immediate edges within a window. Finally, we apply community detection methods on this de-noised DT graph to discover differential sub-networks as communities. Results: Our proposed DCD approach can effectively locate differential sub-networks in several simulated paired random-geometric networks and various paired scale-free graphs with different power-law exponents. The DCD approach easily outperforms community detection methods applied on the original noisy DT graph and recent statistical techniques in simulation studies. We applied DCD method on two real datasets: a) Ovarian cancer dataset to discover differential DNA co-methylation sub-networks in patients and controls; b) Glioma cancer dataset to discover the difference between the regulatory networks of IDH-mutant and IDH-wild-type. We demonstrate the potential benefits of DCD for finding network-inferred bio-markers/pathways associated with a trait of interest. Conclusion: The proposed DCD approach overcomes the limitations of previous statistical techniques and the issues associated with identifying differential sub-networks by use of community detection methods on the noisy DT graph. This is reflected in the superior performance of the DCD method with respect to various metrics like Precision, Accuracy, Kappa and Specificity. The code implementing proposed DCD method is available at https://sites.google.com/site/raghvendramallmlresearcher/codes.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A practical application called TRuML is introduced that translates models written in either Kappa or BNGL into the other language and produces a semantically equivalent model in the alternate language of the input model when possible and an approximate model in certain other cases.
Abstract: Rule-based modeling languages, such as the Kappa and BioNetGen languages (BNGL), are powerful frameworks for modeling the dynamics of complex biochemical reaction networks. Each language is distributed with a distinct software suite and modelers may wish to take advantage of both toolsets. This paper introduces a practical application called TRuML that translates models written in either Kappa or BNGL into the other language. While similar in many respects, key differences between the two languages makes translation sufficiently complex that automation becomes a useful tool. TRuML accommodates the languages' complexities and produces a semantically equivalent model in the alternate language of the input model when possible and an approximate model in certain other cases. Here, we discuss a number of these complexities and provide examples of equivalent models in both Kappa and BNGL.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: ProMuteHT is presented, a program for high throughput in silico generating user-specified sets of mutant protein structures with single or multiple amino acid substitutions that are of high quality, as determined via all-atom and mutated residue RMSD measurements for existing mutant structures in the PDB.
Abstract: Understanding how an amino acid substitution affects a protein's structure is fundamental to advancing drug design and protein docking studies. Mutagenesis experiments on physical proteins provide a precise assessment of the effects of mutations, but they are time and cost prohibitive. Computational approaches for performing in silico amino acid substitutions are available, but they are not suited for generating large numbers of protein variants needed for high-throughput screening studies. We present ProMuteHT, a program for high throughput in silico generating user-specified sets of mutant protein structures with single or multiple amino acid substitutions. We combine our custom mutation algorithm with side chain homology modeling external libraries, and generate energetically feasible mutant structures. Our efficient command-line invocation syntax requires only a few arguments to specify large datasets of mutant structures. We achieve quick run-times due to our hybrid approach in which we limit the use of costly energy calculations when mutating from a large to a small amino acid. We compare our mutant structures with those generated by FoldX, and report faster run-times. We show that the mutants generated by ProMuteHT are of high quality, as determined via all-atom and mutated residue RMSD measurements for existing mutant structures in the PDB.