scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Bioinformatics in 2019"


Proceedings ArticleDOI
04 Sep 2019
TL;DR: The proposed semi-supervised model named SMILES-BERT, which consists of attention mechanism based Transformer Layer outperforms the state-of-the-art methods on all three datasets, showing the effectiveness of the unsupervised pre-training and great generalization capability of the pre-trained model.
Abstract: With the rapid progress of AI in both academia and industry, Deep Learning has been widely introduced into various areas in drug discovery to accelerate its pace and cut R&D costs. Among all the problems in drug discovery, molecular property prediction has been one of the most important problems. Unlike general Deep Learning applications, the scale of labeled data is limited in molecular property prediction. To better solve this problem, Deep Learning methods have started focusing on how to utilize tremendous unlabeled data to improve the prediction performance on small-scale labeled data. In this paper, we propose a semi-supervised model named SMILES-BERT, which consists of attention mechanism based Transformer Layer. A large-scale unlabeled data has been used to pre-train the model through a Masked SMILES Recovery task. Then the pre-trained model could easily be generalized into different molecular property prediction tasks via fine-tuning. In the experiments, the proposed SMILES-BERT outperforms the state-of-the-art methods on all three datasets, showing the effectiveness of our unsupervised pre-training and great generalization capability of the pre-trained model.

178 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This work uses 12,000 drug features from DrugBank, PharmGKB, and KEGG drugs, which are integrated using Knowledge Graphs and finds that the best performing combination was a ComplEx embedding method creating using PyTorch-BigGraph with a Convolutional-LSTM network and classic machine learning-based prediction models.
Abstract: Interference between pharmacological substances can cause serious medical injuries. Correctly predicting so-called drug-drug interactions (DDI) does not only reduce these cases but can also result in a reduction of drug development cost. Presently, most drug-related knowledge is the result of clinical evaluations and post-marketing surveillance; resulting in a limited amount of information. Existing data-driven prediction approaches for DDIs typically rely on a single source of information, while using information from multiple sources would help improve predictions. Machine learning (ML) techniques are used, but the techniques are often unable to deal with skewness in the data. Hence, we propose a new ML approach for predicting DDIs based on multiple data sources. For this task, we use 12,000 drug features from DrugBank, PharmGKB, and KEGG drugs, which are integrated using Knowledge Graphs (KGs). To train our prediction model, we first embed the nodes in the graph using various embedding approaches. We found that the best performing combination was a ComplEx embedding method creating using PyTorch-BigGraph (PBG) with a Convolutional-LSTM network and classic machine learning-based prediction models. The model averaging ensemble method of three best classifiers yields up to 0.94, 0.92, 0.80 for AUPR, F1 F1-score, and MCC, respectively during 5-fold cross-validation tests.

66 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: A model called Auto-ASD-Network is proposed in order to classify subjects with Autism disorder from healthy subjects using only fMRI data and improves the performance of SVM by 26%, the stand-alone MLP by 16% and the state of the art method in ASD classification by 14%.
Abstract: Quantitative analysis of brain disorders such as Autism Spectrum Disorder (ASD) is an ongoing field of research. Machine learning and deep learning techniques have been playing an important role in automating the diagnosis of brain disorders by extracting discriminative features from the brain data. In this study, we propose a model called Auto-ASD-Network in order to classify subjects with Autism disorder from healthy subjects using only fMRI data. Our model consists of a multilayer perceptron (MLP) with two hidden layers. We use an algorithm called SMOTE for performing data augmentation in order to generate artificial data and avoid overfitting, which helps increase the classification accuracy. We further investigate the discriminative power of features extracted using MLP by feeding them to an SVM classifier. In order to optimize the hyperparameters of SVM, we use a technique called Auto Tune Models (ATM) which searches over the hyperparameter space to find the best values of SVM hyperparameters. Our model achieves more than 70% classification accuracy for 4 fMRI datasets with the highest accuracy of 80%. It improves the performance of SVM by 26%, the stand-alone MLP by 16% and the state of the art method in ASD classification by 14%. The implemented code will be available as GPL license on GitHub portal of our lab (https://github.com/PCDS).

46 citations


Journal ArticleDOI
04 Feb 2019
TL;DR: The proposed predictor manages to classify glycated and non-glycated lysine residues with promising results consistently on various cross-validation schemes and outperforms other state of the art methods.
Abstract: Glycation is a one of the post-translational modifications (PTM) where sugar molecules and residues in protein sequences are covalently bonded. It has become one of the clinically important PTM in recent times attributed to many chronic and age related complications. Being a non-enzymatic reaction, it is a great challenge when it comes to its prediction due to the lack of significant bias in the sequence motifs. We developed a classifier, GlyStruct based on support vector machine, to predict glycated and non-glycated lysine residues using structural properties of amino acid residues. The features used were secondary structure, accessible surface area and the local backbone torsion angles. For this work, a benchmark dataset was extracted containing 235 glycated and 303 non-glycated lysine residues. GlyStruct demonstrated improved performance of approximately 10% in comparison to benchmark method of Gly-PseAAC. The performance for GlyStruct on the metrics, sensitivity, specificity, accuracy and Mathew’s correlation coefficient were 0.7013, 0.7989, 0.7562, and 0.5065, respectively for 10-fold cross-validation. Glycation has emerged to be one of the clinically important PTM of proteins in recent times. Therefore, the development of computational tools become necessary to predict glycation, which could help medical professionals administer drugs and manage patients more effectively. The proposed predictor manages to classify glycated and non-glycated lysine residues with promising results consistently on various cross-validation schemes and outperforms other state of the art methods.

35 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This paper adopted the Shapley additive explanation (SHAP) for interpreting a gradient-boosting decision tree model using hospital data and proposes two novel techniques, a new metric of feature importance using SHAP and a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of themodel.
Abstract: When using machine learning techniques in decision-making processes, the interpretability of the models is important. In the present paper, we adopted the Shapley additive explanation (SHAP), which is based on fair profit allocation among many stakeholders depending on their contribution, for interpreting a gradient-boosting decision tree model using hospital data. For better interpretability, we propose two novel techniques as follows: (1) a new metric of feature importance using SHAP and (2) a technique termed feature packing, which packs multiple similar features into one grouped feature to allow an easier understanding of the model without reconstruction of the model.

32 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: The deep learning and contact distance prediction methods of the MULTICOM protein structure prediction system, that was ranked among the top three best methods in the 13th community-wide Critical Assessment of Techniques for Protein Structure Prediction (CASP13) in 2018, are presented.
Abstract: Ab initio prediction of protein structure from sequence is one of the most challenging and important problems in bioinformatics and computational biology. After a long period of stagnancy, ab initio protein structure prediction is undergoing a revolution driven by inter-residue contact distance prediction empowered by deep learning. In this talk, I will present the deep learning and contact distance prediction methods of our MULTICOM protein structure prediction system that was ranked among the top three best methods in the 13th community-wide Critical Assessment of Techniques for Protein Structure Prediction (CASP13) in 2018 [1]. MULTICOM was able to correctly fold structures of numerous hard protein targets from scratch in CASP13, which was an unprecedented progress. The success clearly demonstrates that contact distance prediction is the key direction to tackle the protein structure prediction challenge and deep learning is the key technology to solve it. However, to completely solve the problem, more advanced deep learning methods are needed to accurately predict inter-residue distances when few homologous sequences are available to calculate residue-residue co-evolution scores, fold proteins from noisy inter-residue distances, and rank the structural models of hard protein targets.

32 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This paper extends the segmentation network, U-Net with a Self-Attention module, named SAU-Net, for cell counting and designs an online version of Batch Normalization to mitigate the generalization gap caused by data augmentation in small datasets.
Abstract: Image-based cell counting is a fundamental yet challenging task with wide applications in biological research. In this paper, we propose a novel Deep Network designed to universally solve this problem for various cell types. Specifically, we first extend the segmentation network, U-Net with a Self-Attention module, named SAU-Net, for cell counting. Second, we design an online version of Batch Normalization to mitigate the generalization gap caused by data augmentation in small datasets. We evaluate the proposed method on four public cell counting benchmarks - synthetic fluorescence microscopy (VGG) dataset, Modified Bone Marrow (MBM) dataset, human subcutaneous adipose tissue (ADI) dataset, and Dublin Cell Counting (DCC) dataset. Our method surpasses the current state-of-the-art performance in the three real datasets (MBM, ADI and DCC) and achieves competitive results in the synthetic dataset (VGG). The source code is available at https://github.com/mzlr/sau-net.

27 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: First experiments show that serverless computing is useful for this particular bioinformatics high-throughput application, because it simplifies resource management.
Abstract: Currently, several factors are moving biomedical research towards a (big) data-centred science. This yields new challenges for computer science solutions when dealing with bioinformatics applications. Among others, efficient storage, preprocessing, integration and analysis of omics and clinical data, result in a bottleneck on the analysis pipeline. This may be faced using cloud technology. This paper discusses the challenges and opportunities of deploying bioinformatics applications using the Amazon Serverless Lambda services. First experiments show that serverless computing is useful for this particular bioinformatics high-throughput application, because it simplifies resource management.

23 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This work demonstrates how serverless computing provides low cost access to hundreds of CPUs, on demand, with little or no setup, and illustrates that the all-against-all pairwise comparison among all unique human proteins can be accomplished in approximately 2 minutes, at a cost of less than $1, using Amazon Web Services Lambda.
Abstract: Cloud computing offers on-demand, scalable computing and storage, and has become an essential resource for the analyses of big biomedical data. The usual approach to cloud computing requires users to reserve and provision virtual servers. An emerging alternative is to have the provider allocate machine resources dynamically. This type of serverless computing has tremendous potential for biomedical research in terms of ease-of-use, instantaneous scalability, and cost effectiveness. In our proof of concept example, we demonstrate how serverless computing provides low cost access to hundreds of CPUs, on demand, with little or no setup. In particular, we illustrate that the all-against-all pairwise comparison among all unique human proteins can be accomplished in approximately 2 minutes, at a cost of less than $1, using Amazon Web Services Lambda. We also demonstrate the feasibility of our approach using Google Functions and show that the same task of pairwise protein sequence comparison can be accomplished in approximately 11.5 minutes. In contrast, running the same task on a typical laptop computer required 8.7 hours.

22 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: This work takes some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical, and shows that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence.
Abstract: Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences are labeled as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.

20 citations


Proceedings ArticleDOI
04 Sep 2019
TL;DR: The proposed residual deep learning system for mass segmentation and classification in mammography shows superior performance compared to the other DL methods in detecting and segmenting masses, especially for heterogeneously dense and dense MG images, in terms of intersection over union (IOU) and the Dice index coefficient (DI).
Abstract: Automatic extraction of breast mass in mammogram (MG) images is a challenging task due to the varying sizes, shapes, and textures of masses. Moreover, the density of MGs makes mass detection very challenging since masses can be hidden in dense MGs. In this paper, we propose a residual deep learning (DL) system for mass segmentation and classification in mammography. The overall proposed system consists of two cascaded parts: 1) a residual attention U-Net model (RU-Net) to precisely segment mass lesions in MG images, followed by 2) a ResNet classifier to classify the detected binary segmented lesions into benign or malignant. The proposed semantic based CNN model, RU-Net, has the basic architecture of the U-Net model, which extracts contextual information combining low-level feature with high-level ones. We have modified the U-Net structure by adding residual attention modules in order to preserve the spatial and context information, help the network have deeper architecture, and handles the gradient vanishing problem. We compared the performance of the proposed RU-Net model with those of state-of-the-art two semantic segmentation models, and two object detectors using public databases. We also examined the effect of the breast density on the accuracy of localizing and segmenting the breast masses. Our proposed model shows superior performance compared to the other DL methods in detecting and segmenting masses, especially for heterogeneously dense and dense MG images, in terms of intersection over union (IOU) and the Dice index coefficient (DI). Moreover, our results show that the cascaded ResNet model, trained using binary-scale images, classify the masses to benign or malignant with higher accuracy compared to the ResNet model that is trained on gray-scale images.

Proceedings ArticleDOI
18 Mar 2019
TL;DR: A novel machine learning based framework for the prediction of colorectal cancer outcome from whole digitized haematoxylin & eosin stained histopathology slides is introduced and a detailed analysis of its different elements corroborate its ability to extract and learn salient, discriminative, and clinically meaningful content.
Abstract: Digital pathology (DP) is a new research area which falls under the broad umbrella of health informatics. Owing to its potential for major public health impact, in recent years DP has been attracting much research attention. Nevertheless, a wide breadth of significant conceptual and technical challenges remain, few of them greater than those encountered in the field of oncology. The automatic analysis of digital pathology slides of cancerous tissues is particularly problematic due to the inherent heterogeneity of the disease, extremely large images, amongst numerous others. In this paper we introduce a novel machine learning based framework for the prediction of colorectal cancer outcome from whole digitized haematoxylin & eosin (H&E) stained histopathology slides. Using a real-world data set we demonstrate the effectiveness of the method and present a detailed analysis of its different elements which corroborate its ability to extract and learn salient, discriminative, and clinically meaningful content.

Proceedings ArticleDOI
Yuhan Dong, Rui Wen1, Zhide Li1, Kai Zhang1, Lin Zhang 
21 Mar 2019
TL;DR: Numerical results suggest that the proposed Clu-RNN approach utilizes more than one cluster for both type I and type II diabetes and has gained improvements compared with support vector regression (SVR) and other RNN methods in terms of BG prediction accuracy.
Abstract: Diabetes is a kind of metabolic disease characterized by increased chronic blood glucose (BG) and may introduce a series of severe complications in a long run. To facilitate health management for diabetic patients, continuous monitoring and prediction of BG concentration are particularly important. Among the popular data driven solutions to BG prediction, machine learning methods, e.g. SVR, RNN and etc., utilize BG data of multiple patients to train the prediction model. However, all the training data sharing the same parameters may not be able to capture the characteristics of BG fluctuation effectively. Motivated by the fact that different subgroups of diabetic patients possess different BG fluctuation patterns, we propose a new BG prediction approach referred to as Clu-RNN based on recurrent neural networks (RNN) by incorporating a pre-process of clustering into the classical RNN. Numerical results suggest that the proposed Clu-RNN approach utilizes more than one cluster for both type I and type II diabetes and has gained improvements compared with support vector regression (SVR) and other RNN methods in terms of BG prediction accuracy.

Journal ArticleDOI
18 Apr 2019
TL;DR: EvolStruct-Phogly, a new predictor that uses structural and evolutionary information relating to amino acids to predict phosphoglycerylated lysine residues, showed a noteworthy improvement in regards to the performance when compared with the previous predictors.
Abstract: Post-translational modification (PTM), which is a biological process, tends to modify proteome that leads to changes in normal cell biology and pathogenesis. In the recent times, there has been many reported PTMs. Out of the many modifications, phosphoglycerylation has become particularly the subject of interest. The experimental procedure for identification of phosphoglycerylated residues continues to be an expensive, inefficient and time-consuming effort, even with a large number of proteins that are sequenced in the post-genomic period. Computational methods are therefore being anticipated in order to effectively predict phosphoglycerylated lysines. Even though there are predictors available, the ability to detect phosphoglycerylated lysine residues still remains inadequate. We have introduced a new predictor in this paper named EvolStruct-Phogly that uses structural and evolutionary information relating to amino acids to predict phosphoglycerylated lysine residues. Benchmarked data is employed containing experimentally identified phosphoglycerylated and non-phosphoglycerylated lysines. We have then extracted the three structural information which are accessible surface area of amino acids, backbone torsion angles, amino acid’s local structure conformations and profile bigrams of position-specific scoring matrices. EvolStruct-Phogly showed a noteworthy improvement in regards to the performance when compared with the previous predictors. The performance metrics obtained are as follows: sensitivity 0.7744, specificity 0.8533, precision 0.7368, accuracy 0.8275, and Mathews correlation coefficient of 0.6242. The software package and data of this work can be obtained from https://github.com/abelavit/EvolStruct-Phogly or www.alok-ai-lab.com

Proceedings ArticleDOI
04 Sep 2019
TL;DR: An image classification pipeline that combines both a shallow learner and deep learner that outperformed any individual learner with the highest accuracy as 92%.
Abstract: Breast cancer is a deadly disease that affects millions of women worldwide. The International Conference on Image Analysis and Recognition in 2018 presents the BreAst Cancer Histology (ICIAR2018 BACH) image data challenge that calls for computer tools to assist pathologists and doctors in the clinical diagnosis of breast cancer subtypes. Using the BACH dataset, we have developed an image classification pipeline that combines both a shallow learner (support vector machine) and a deep learner (convolutional neural network). The shallow learner and deep learners achieved moderate accuracies of 79% and 81% individually. When being integrated by fusion algorithms, the system outperformed any individual learner with the highest accuracy as 92%. The fusion presents big potential for improving clinical design support. KEYWORDS

Proceedings ArticleDOI
29 May 2019
TL;DR: This research not only improves the predictability of cervical cancer risk, but also inspires the development of pathological model based on MLP and random forest.
Abstract: Cervical cancer, as one of the most common malignant tumor among women, is difficult to be diagnosed and studied due to its complexity of disease factors and challenged prediction. In this paper, a real data-driven powerful machine learning model is employed. With this technique, we model the detection methods of cervical cancer and determine the diagnostic accuracy of current mainstream methods for cervical cancer by multi-layer perceptron. Finally, the importance index of cervical cancer risk factors can be analyzed by random forest. The experiment results have shown that there is a close relationship between the risk factors and cervical cancer. And compared with other risk factors, age, number of sexual partners, hormonal contraceptives have a greater influence on the diagnosis of cervical cancer. Therefore, our research not only improves the predictability of cervical cancer risk, but also inspires the development of pathological model based on MLP and random forest.

Proceedings ArticleDOI
Xianghao Zhan1, Xiaoqing Guan1, Rumeng Wu1, Zhan Wang1, You Wang1, Guang Li1 
21 Mar 2019
TL;DR: This work proposes a method of using electronic nose to discriminate herbal medicines from different origins with better combinations of pattern recognition algorithms and feature engineering approaches for optimal classification performances.
Abstract: As pharmacists attach great significance to geographical origins of herbal medicines, cheap, nondestructive and convenient methods for discriminating herbal medicines originated from diverse regions are much in need. This work proposes a method of using electronic nose to discriminate herbal medicines from different origins. With 5 categories of herbal medicines and 3 to 4 geographical origins for each category, 8 pattern recognition algorithms prove the feasibility of the classification task and SVM, LDA and BP neural network have shown better classification accuracy. Additionally, feature engineering approaches are used to facilitate classification, showing that normalization based on each feature and each sensor and centralization prove to be better normalization approaches for classifiers; a proper degree of noise addition help classifiers get better generalization ability; finally, feature selection with SNR could lead to more efficient classifiers by selecting the most meaningful features and disregarding unnecessary features. This work provides insights for future herbal medicine evaluation based on electronic nose with better combinations of pattern recognition algorithms and feature engineering approaches for optimal classification performances.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: This work proposes a new method to incorporate information from the Electronic Health Records (EHRs) of patients and utilize hyperbolic embeddings of a medical ontology (i.e., ICD-9) in the prediction model and shows that hyperbolics of ontological concepts give promising performance.
Abstract: Unplanned intensive care units (ICU) readmissions and in-hospital mortality of patients are two important metrics for evaluating the quality of hospital care. Identifying patients with higher risk of readmission to ICU or of mortality can not only protect those patients from potential dangers, but also reduce the high costs of healthcare. In this work, we propose a new method to incorporate information from the Electronic Health Records (EHRs) of patients and utilize hyperbolic embeddings of a medical ontology (i.e., ICD-9) in the prediction model. The results prove the effectiveness of our method and show that hyperbolic embeddings of ontological concepts give promising performance.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: An efficient algorithm, Tempo++, is developed for identifying co-evolving subnetworks within two given temporal networks, where each network may evolve at different rates and the rates of evolution may change over time.
Abstract: Biological networks describe the interactions among molecules. Unlike static biological networks at a single time point, temporal networks capture how the network topology evolves over time in response to external stimuli or internal variations. We say that two temporal networks have co-evolving subnetworks if the topologies of these subnetworks remain similar to each other as the networks evolve over time. Existing methods for identifying co-evolving patterns make the strong and unrealistic assumption that the two network topologies evolve at the same rate. In this paper, we consider the generalized problem, where each network may evolve at different rates and the rates of evolution may change over time. Moreover, the two networks may have network topologies available for different number of time points. Existing methods fail to solve this problem as they rely on the strong prior assumption. We develop an efficient algorithm, Tempo++, for identifying co-evolving subnetworks within two given temporal networks. Unlike existing methods, Tempo++ does not assume that the networks have same and uniform evolutionary rates. We experimentally demonstrate that Tempo++ scales efficiently and accurately on both synthetic and real datasets. Our results on E. coli time resolved response to five different environmental stress conditions demonstrate that Tempo++ identifies genes specific to those conditions that conforms to well-known studies in the literature. Statistical significance of alignments found by Tempo++ outperforms existing strategies. Moreover, Tempo++ correctly identifies co-evolving networks with similar stress response compared to networks with different stress response.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: This is the first study that employs CNN to automatically detect MW using only EEG data, and a channel-wise deep convolutional neural network model is developed to classify the features of focusing state and MW extracted from EEG signals.
Abstract: Mind wandering (MW) is a ubiquitous phenomenon which reflects a shift in attention from task-related to task-unrelated thoughts. There is a need for intelligent interfaces that can reorient attention when MW is detected due to its detrimental effects on performance and productivity. In this paper, we propose a deep learning model for MW detection using Electroencephalogram (EEG) signals. Specifically, we develop a channel-wise deep convolutional neural network (CNN) model to classify the features of focusing state and MW extracted from EEG signals. This is the first study that employs CNN to automatically detect MW using only EEG data. The experimental results on the collected dataset demonstrate promising performance with 91.78% accuracy, 92.84% sensitivity, and 90.73% specificity. Finally, a data augmentation scheme is applied to increase the size of training dataset and improve the classification results.

Journal ArticleDOI
01 Jan 2019
TL;DR: In this article, the importance of writing skills in English language learning has been discussed and some useful suggestions are given to both teachers and learners to make the writing skills a grand success in the ELL environment.
Abstract: Language is a medium of communication and people use a language to convey their views, opinions, thoughts, ideas, reactions, emotions and passions. People carry out their communication in order to fulfill their everyday needs. Language plays a vital role in sharing people's ideas and feelings with others. Human beings are different from animals because of their oral and written communication skills. So, language has become an important tool of communication for human beings to convey their messages to others. Therefore, there is a need for the human beings to learn the language skills. In learning English also, the learners have to acquire all the four basic skills of it. Moreover, the English language learners (ELLs) have to concentrate more on these four skills, viz., listening, speaking, reading and writing. Writing is considered the most difficult skill among these for the learners because of the complexity of the English language. In English language learning (ELL) environment, the learners find it difficult to produce well when they are given certain tasks in writing. There are several reasons why students have lack of written communication skills and some among them are the use of old-fashioned methods by the teachers, lack of proper motivation, large crowded classrooms, lack of facilities and learners’ attitude towards learning. The teachers of English have to study the problems of their ELLs and try to change their methods of teaching so that the learners can improve their writing skills. Moreover, the teachers have to focus on the innovative techniques of teaching writing so that the learners will follow them in order to develop their writing skills in English. Hence, the teachers should involve the ELLs in pair work or group work to develop their writing skills by embracing the latest techniques. This paper sheds a light on the significance of writing skills in ELL environment. First of all, this paper explains the importance of language skills, especially, the skills involved in learning English. Then, it mainly focuses on writing skills which are the most difficult skills for the ELLs to acquire. Later, it also emphasizes both the classification and the role of writing. Furthermore, this paper stresses on the influence of the internet on writing skills. This paper also unfolds the facts of accessing the ELL's writing skills. This paper also elaborates the role of teachers in developing the learners’ writing skills in ELL environment. Finally, some useful suggestions are given to both teachers and learners to make the writing skills a grand success in the ELL environment.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: It is shown that autoencoders that model nonlinear relationships among variables outperform linear dimensionality reduction and will gain in popularity in the structure biology community and open up further avenues of research.
Abstract: The protein modeling community has long been interested in dimensionality reduction of structure data. Motivated by rapid progress in neural network research, we investigate autoencoders of various architectures on reducing the dimensionality of protein structure data generated by template-free protein structure prediction methods. We show that autoencoders that model nonlinear relationships among variables outperform linear dimensionality reduction. We evaluate various architectures and propose a better-performing one. We further show that the learned, low-dimensional latent representations capture inherent information useful for structure prediction. Given the ease with which open-source neural network libraries, such as Keras, which we employ here, allow constructing, training, and evaluating neural networks, we believe that autoencoders will gain in popularity in the structure biology community and open up further avenues of research.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: A new method, NamedKeys, to automatically identify meaningful and informative keyphrases from biomedical text that integrates named entity recognition, phrase embedding, phrase quality scoring, ranking, and clustering to extract author-assigned keywords from biomedical documents.
Abstract: A vast amount of biomedical literature is generated and digitized every year. As a result is a growing need to develop methods for discovering, accessing, and sharing knowledge from medical literature. Keyphrase extraction is the task of summarizing a text by identifying the key concepts. The keyphrases can be single-word or multi-word linguistic units which can concisely represent a document. Although a variety of models have been proposed for automated keyphrase extraction, the performance is poor in comparison with other natural language processing tasks. The problem is even more daunting for biomedical domain where the text is filled with highly domain-specific terminologies. We propose a new method, NamedKeys, to automatically identify meaningful and informative keyphrases from biomedical text. NamedKeys integrates named entity recognition, phrase embedding, phrase quality scoring, ranking, and clustering to extract author-assigned keywords from biomedical documents. Performance evaluation on PubMed abstracts demonstrates that NamedKeys achieves significant improvements over existing state-of-the-art keyphrase extraction models. Furthermore, we propose the first benchmark dataset for keyphrase extraction from biomedical text.

Journal ArticleDOI
18 Apr 2019
TL;DR: The proposed predictor HseSUMO uses half-sphere exposures of amino acids to predict sumoylation sites and has shown promising results on a benchmark dataset when compared with the state-of-the-art method.
Abstract: Post-translational modifications are viewed as an important mechanism for controlling protein function and are believed to be involved in multiple important diseases. However, their profiling using laboratory-based techniques remain challenging. Therefore, making the development of accurate computational methods to predict post-translational modifications is particularly important for making progress in this area of research. This work explores the use of four half-sphere exposure-based features for computational prediction of sumoylation sites. Unlike most of the previously proposed approaches, which focused on patterns of amino acid co-occurrence, we were able to demonstrate that protein structural based features could be sufficiently informative to achieve good predictive performance. The evaluation of our method has demonstrated high sensitivity (0.9), accuracy (0.89) and Matthew’s correlation coefficient (0.78–0.79). We have compared these results to the recently released pSumo-CD method and were able to demonstrate better performance of our method on the same evaluation dataset. The proposed predictor HseSUMO uses half-sphere exposures of amino acids to predict sumoylation sites. It has shown promising results on a benchmark dataset when compared with the state-of-the-art method. The extracted data of this study can be accessed at https://github.com/YosvanyLopez/HseSUMO .

Proceedings ArticleDOI
04 Sep 2019
TL;DR: A beta-binomial distribution is proposed which better models the CRISPR pooled screen data and outperforms the other methods in both alignment and target identification and will also accelerate discovering novel biological findings from CRISpr pooled screens.
Abstract: \begin The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)/Cas9 gene-editing platform is simple, cost-effect, and robust, and it has allowed high-throughput pooled screen approaches to be accessible by any laboratory. As a result, large amounts of next-generation sequencing data is being generated at a fast rate, but utilizing this data is challenging due to three hurdles: designing a statistical model to identify the hit genes, developing an accurate alignment algorithm to extract readout for each single-guide RNA (sgRNA) from the sequence data, and minimizing the parameters that have to be tuned. Several methods have been proposed to tackle the challenges above, but most of them use the negative binomial distribution to model CRISPR pooled screen data even though the structure of the sgRNA screen data is far away from the nature of RNA-seq data which tend to be over-dispersed. In RNA-seq data the length of the transcript varies, while the length of sgRNAs used for the CRISPR system is designed to be the same in any gene which often leads to under-dispersion. Here we propose CRISPRBetaBinomial (CB\textsuperscript2 ) which uses the beta-binomial distribution which better models the CRISPR pooled screen data (\urlhttps://CRAN.R-project.org/package=CB2 ) \citeJeong01062019. We used published screen datasets to benchmark the accuracy of CB\textsuperscript2 and compared it to eight published methods. CB\textsuperscript2 outperforms the other methods in both alignment and target identification (Figure \reffig:teaser ). CB\textsuperscript2 will also accelerate discovering novel biological findings from CRISPR pooled screens with the cooperation of CRISPRcloud (\urlhttp://crispr.nrihub.org ). \endabstract

Proceedings ArticleDOI
04 Sep 2019
TL;DR: A model for predicting the levels of four major biomarkers related to type 2 diabetes after a one-year period is created using a wide and deep neural network developed based on the long short-term memory (LSTM) structure to process the time-series dataset collected by the wearables.
Abstract: With the increasing availability of wearable devices, continuous monitoring of individuals' physiological and behavioral patterns has become significantly more accessible. Access to these continuous patterns about individuals' statuses offers an unprecedented opportunity for studying complex diseases and health conditions such as type 2 diabetes (T2D). T2D is a widely common chronic disease that its roots and progression patterns are not fully understood. Predicting the progression of T2D can inform timely and more effective interventions to prevent or manage the disease. In this study, we have used a dataset related to 63 patients with T2D that includes the data from two different types of wearable devices worn by the patients: continuous glucose monitoring (CGM) devices and activity trackers (ActiGraphs). Using this dataset, we created a model for predicting the levels of four major biomarkers related to T2D after a one-year period. We developed a wide and deep neural network and used the data from the demographic information, lab tests, and wearable sensors to create the model. The deep part of our method was developed based on the long short-term memory (LSTM) structure to process the time-series dataset collected by the wearables. In predicting the patterns of the four biomarkers, we have obtained a root mean square error of ±1.67% for HBA1c, ±6.22 mg/dl for HDL cholesterol, ±10.46 mg/dl for LDL cholesterol, and ±18.38 mg/dl for Triglyceride. Compared to existing models for studying T2D, our model offers a more comprehensive tool for combining a large variety of factors that contribute to the disease.

Proceedings ArticleDOI
04 Sep 2019
TL;DR: In this article, the relationship among multilocus sequence typing (MLST) SNPs and methicillin/oxacillin resistance or susceptibility was studied, using a public data base, by means of cross-tabulation tests, multivariable logistic regression (LR), decision trees, rule bases, and random forests (RF).
Abstract: Methicillin-resistant Staphylococcus aureus (MRSA) is currently the most commonly identified antibiotic-resistant pathogen in US hospitals. Resistance to methicillin is carried by SCCmec genetic elements. Multilocus sequence typing (MLST) covers internal fragments of seven housekeeping genes of S. aureus. In conjunction with mec typing, MLST has been used to create an international nomenclature for S. aureus. MLST sequence types with a single nucleotide polymorphism (SNP) considered distinct. In this work, relationships among MLST SNPs and methicillin/oxacillin resistance or susceptibility were studied, using a public data base, by means of cross-tabulation tests, multivariable (phylogenetic) logistic regression (LR), decision trees, rule bases, and random forests (RF). Model performances were assessed through multiple cross-validation. Hierarchical clustering of SNPs was also employed to analyze mutational covariation. The number of instances with a known methicillin (oxacillin) antibiogram result was 1526 (649), where 63% (54%) was resistant to methicillin (oxacillin). In univariable analysis, several MLST SNPs were found strongly associated with antibiotic resistance/susceptibility. A RF model predicted correctly the resistance/susceptibility to methicillin and oxacillin in 75% and 63% of cases (cross-validated). Results were similar for LR. Hierarchical clustering of the aforementioned SNPs yielded a high level of covariation both within the same and different genes; this suggests strong genetic linkage between SNPs of housekeeping genes and antibiotic resistant associated genes. This finding provides a basis for rapid identification of antibiotic resistant S. arues lineages using a small number of genomic markers. The number of sites could subsequently be increased moderately to increase the sensitivity and specificity of genotypic tests for resistance that do not rely on the direct detection of the resistance marker itself.

Proceedings ArticleDOI
21 Mar 2019
TL;DR: This method combines the advantages from both traditional segmentation-based and density-based methods and overcomes the limitations such as cell clumping, overlapping, and it can also bypass the fine-tuning step which was necessary for previous density- based methods when applying to different datasets.
Abstract: A stacked deep convolutional neural network (DCNN) model was generated to predict cell density maps and count cells. We treated the cell counting as a regression problem with a preprocessing step to generate cell density maps. We implemented this approach by integrating two trustworthy and state-of-art model architectures (U-net & VGG19). This method combines the advantages from both traditional segmentation-based and density-based methods. It overcomes the limitations such as cell clumping, overlapping, and it can also bypass the fine-tuning step which was necessary for previous density-based methods when applying to different datasets. A publicly available well-labeled dataset was used to train and test the model. An unlabeled real dataset which generated in-house was used to evaluate the performance.

Journal ArticleDOI
01 Jan 2019
TL;DR: Since learning all the four skills of the English language is more essential for the learners to learn the language in a systematic way, the teachers should put more efforts on improving the standards of the learners.
Abstract: In the present competitive world, communication plays a vital role in almost all the arenas. It becomes a barrier for people to communicate without learning a language. So, there is a need for people to learn a language in order to convey their thoughts and ideas to the other people all over the world. People need a common language to communicate internationally and English serves the purpose since it is the only language spoken all around the globe. Therefore, learning English has become mandatory in the present phenomenon and most of the non-native speakers of English are trying to learn it using various methods and approach. First of all, to learn a foreign language, one must devote more time on it and do regular practice on all the four language skills such as listening, speaking, reading and writing. As classroom is the right place to practice all these skills, the teachers of English should understand the needs of the learners and try to implement various techniques, methods and approaches to improve the language skills of the foreign or second language learners. Moreover, the teachers should inspire the learners by following learner-centered approach in their classroom by adopting interesting and needful material to improve all the four skills of English. Since learning all the four skills of the English language is more essential for the learners to learn the language in a systematic way, the teachers should put more efforts on improving the standards of the learners. This paper sheds a light on the importance of four language skills of the English language. First of all, this paper discusses not only the importance of English but also the importance of the language skills in English. Later, this paper also elaborates the importance of teaching language skills to the second or foreign language learners of English in detail. Furthermore, it enlightens the advantages of each skill comprehensively. Finally, some valuable suggestions are given to the teachers and learners of English to improve their teaching-learning process to attain good results.

Journal ArticleDOI
01 Jan 2019
TL;DR: In this paper, the importance of collaborative learning and the role of teachers and learners in collaborative learning are clearly expounded and some useful suggestions are given to the teachers as well as learners to implement this technique in the English classrooms.
Abstract: Learning is a continuous process in one's life. Humans start learning from childhood age and they will be learning until they die. It may be a language, a subject, a course or skills (life skills and professional skills). Language learning is also such a process where people learn languages with constant effort and determination. Teachers follow techniques and strategies in language classrooms and learners follow them in order to learn a language successfully. They follow different techniques such as team work, group work, pair work, collaborative learning and so on. Collaborative learning is one of the techniques that learners follow to learn new things and the same is followed while learning the languages. In learning English also, learners adopt this collaborative learning technique and learn the things fast and in a systematic way. The present paper focuses on the advantages of collaborative learning and how it is useful in English language classrooms. Moreover, the importance of collaborative learning and the role of teachers and learners in collaborative learning are clearly expounded. This paper mainly focuses on collaborative learning in English language classrooms and it elaborates how it is useful for the learners in doing the given tasks. Finally, some useful suggestions are given to the teachers as well as learners to implement this technique in the English classrooms.