scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Knowledge Discovery in Bioinformatics in 2012"


Journal ArticleDOI
TL;DR: 10-fold cross-validation on cost-sensitive algorithms along with base classifiers of Naive Bayes, Sequential Minimal Optimization SMO, K-Nearest Neighbors KNN, Support Vector Machine SVM, and C4.5 are employed and the results show that the SMO algorithm yielded very high sensitivity 97.22% and accuracy 92.09% rates.
Abstract: One of the main causes of death the world over is the family of cardiovascular diseases, of which coronary artery disease CAD is a major type. Angiography is the principal diagnostic modality for the stenosis of heart arteries; however, it leads to high complications and costs. The present study conducted data-mining algorithms on the Z-Alizadeh Sani dataset, so as to investigate rule based and feature based classifiers and their comparison, and the reason for the effectiveness of a preprocessing algorithm on a dataset. Misclassification of diseased patients has more side effects than that of healthy ones. To this end, this paper employs 10-fold cross-validation on cost-sensitive algorithms along with base classifiers of Naive Bayes, Sequential Minimal Optimization SMO, K-Nearest Neighbors KNN, Support Vector Machine SVM, and C4.5 and the results show that the SMO algorithm yielded very high sensitivity 97.22% and accuracy 92.09% rates.

40 citations


Journal ArticleDOI
TL;DR: This article takes a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.
Abstract: Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this article, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.

24 citations


Journal ArticleDOI
TL;DR: This study provides insight into the underlying factors that influence hospital length of stay, using a multi-tiered data mining approach to form training sets and identifying patients who need aggressive or moderate early interventions to prevent prolonged stays.
Abstract: A model to predict the Length of Stay LOS for hospitalized patients can be an effective tool for measuring the consumption of hospital resources Such a model will enable early interventions to prevent complications and prolonged LOS and also enable more efficient utilization of manpower and facilities in hospitals In this paper, the authors propose an approach for Predicting Hospital Length of Stay PHLOS using a multi-tiered data mining approach In their aproach, the authors form training sets, using groups of similar claims identified by k-means clustering and perfom classification using ten different classifiers The authors provide a combined measure of performance to statistically evaluate and rank the classifiers for different levels of clustering They consistently found that using clustering as a precursor to form the training set gives better prediction results as compared to non-clustering based training sets The authors have also found the accuracies to be consistently higher than some reported in the current literature for predicting individual patient LOS Binning the LOS to three groups of short, medium and long stays, their method identifies patients who need aggressive or moderate early interventions to prevent prolonged stays The classification techniques used in this study are interpretable, enabling them to examine the details of the classification rules learned from the data As a result, this study provides insight into the underlying factors that influence hospital length of stay They also examine the authors' prediction results for three randomly selected conditions with domain expert insights

18 citations


Journal ArticleDOI
TL;DR: The results show that classifying the VOCs leads to substantial gain over chance but of varying accuracy, and Rotation Forest ensemble AUC 0.825 had the highest accuracy for COPD classification from exhaled V OCs.
Abstract: The diagnosis of Chronic Obstructive Pulmonary Disease COPD is based on symptoms, clinical examination, exposure to risk factors smoking and certain occupational dusts and confirming lung airflow obstruction on spirometry. However, most people with COPD remain undiagnosed and controversies regarding spirometry persist. Developing accurate and reliable automated tests for the early diagnosis of COPD would aid successful management. We evaluated the diagnostic potential of a non-invasive test of chemical analysis volatile organic compounds-VOCs from exhaled breath. We applied 26 individual classifier methods and 30 state-of-the-art classifier ensemble methods to a large VOC data set from 109 patients with COPD and 63 healthy controls of similar age; we evaluated the classification error, the F measure and the area under the ROC curve AUC. The results show that classifying the VOCs leads to substantial gain over chance but of varying accuracy. We found that Rotation Forest ensemble AUC 0.825 had the highest accuracy for COPD classification from exhaled VOCs.

10 citations


Journal ArticleDOI
TL;DR: Dis2PPI is designed, a workflow that integrates information retrieved from genetic disease databases and interactomes and uses this information to mine protein-protein interaction PPI networks, which can be used in systems biology analyses.
Abstract: Experiments in bioinformatics are based on protocols that employ different steps for data mining and data integration, collectively known as computational workflows. Considering the use of databases in the biomedical sciences software that is able to query multiple databases is desirable. Systems biology, which encompasses the design of interactomic networks to understand complex biological processes, can benefit from computational workflows. Unfortunately, the use of computational workflows in systems biology is still very limited, especially for applications associated with the study of disease. To address this limitation, we designed Dis2PPI, a workflow that integrates information retrieved from genetic disease databases and interactomes. Dis2PPI extracts protein names from a disease report and uses this information to mine protein-protein interaction PPI networks. The data gathered from this mining can be used in systems biology analyses. To demonstrate the functionality of Dis2PPI for systems biology analyses, the authors mined information about xeroderma pigmentosum and Cockayne syndrome, two monogenic diseases that lead to skin cancer when the patients are exposed to sunlight and neurodegeneration.

10 citations


Journal ArticleDOI
TL;DR: Data sources, bioinformatics tools, and computational methods for prioritizing disease candidate genes and identifying disease pathways, and different similarity measures and prevailing methods for integrating results from different functional aspects are introduced.
Abstract: Genetic factors play a major role in the etiology of many human diseases Genome-wide experimental methods produce an increasing number of genes associated with such diseases This article introduces data sources, bioinformatics tools, and computational methods for prioritizing disease candidate genes and identifying disease pathways The main strategy is to examine the similarity among the candidate genes and known disease genes at the functional level The authors review different similarity measures and prevailing methods for integrating results from different functional aspects The authors hope this article will help advocate many useful resources that the researchers can use to investigate diseases of their interest

9 citations


Journal ArticleDOI
TL;DR: The authors propose an enhancement to the LASSO, a shrinkage and selection technique that induces parameter sparsity by penalizing a model's objective function and devise a coordinate descent algorithm to minimize the corresponding objective function.
Abstract: Personalized medicine is customizing treatments to a patient's genetic profile and has the potential to revolutionize medical practice. An important process used in personalized medicine is gene expression profiling. Analyzing gene expression profiles is difficult, because there are usually few patients and thousands of genes, leading to the curse of dimensionality. To combat this problem, researchers suggest using prior knowledge to enhance feature selection for supervised learning algorithms. The authors propose an enhancement to the LASSO, a shrinkage and selection technique that induces parameter sparsity by penalizing a model's objective function. Their enhancement gives preference to the selection of genes that are involved in similar biological processes. The authors' modified LASSO selects similar genes by penalizing interaction terms between genes. They devise a coordinate descent algorithm to minimize the corresponding objective function. To evaluate their method, the authors created simulation data where they compared their model to the standard LASSO model and an interaction LASSO model. The authors' model outperformed both the standard and interaction LASSO models in terms of detecting important genes and gene interactions for a reasonable number of training samples. They also demonstrated the performance of their method on a real gene expression data set from lung cancer cell lines.

8 citations


Journal ArticleDOI
TL;DR: Topological features of proteins encoded by Drosophila melanogaster aging genes versus those encoded by non-aging genes in protein-protein interaction PPI network are analyzed and found that aging genes are characterized by several network topological features such as higher in degrees.
Abstract: An important task of aging research is to find genes that regulate lifespan. Wet-lab identification of aging genes is tedious and labor-intensive activity. Developing an algorithm to predict aging genes will be greatly helpful. In this paper, we systematically analyzed topological features of proteins encoded by Drosophila melanogaster aging genes versus those encoded by non-aging genes in protein-protein interaction PPI network and found that aging genes are characterized by several network topological features such as higher in degrees. And aging genes tend to be enriched in certain functions were also found. Based on these features, an algorithm was developed to detect aging genes genome wide. With a posterior probability score describing possible involvement in aging no less than 1, 1014 novel aging genes were predicted by decision trees. Evidence supporting our prediction can be found.

8 citations


Journal ArticleDOI
TL;DR: Experimental work is done where five time domain features are obtained from EEG signals and used by a set classifiers, namely, Bayesian, K-nearest neighbor, neural network, linear discriminant analysis, and support vector machine classifiers.
Abstract: Different brain states and conditions can be captured by electroencephalogram EEG signals. EEG-based epileptic seizure detection techniques often reduce these signals into sets of discriminant features. In this work, an evidence theory-based approach for epileptic detection, using several classifiers, is proposed. Within the framework of the evidence theory, each of these classifiers is considered a source of information and given a certain weight based on both its overall classification accuracy as well as its precision rate for the respective brain state. These sources are fused using the Dempster's rule of combination. Experimental work is done where five time domain features are obtained from EEG signals and used by a set classifiers, namely, Bayesian, K-nearest neighbor, neural network, linear discriminant analysis, and support vector machine classifiers. Higher classification accuracy of 89.5% is achieved, compared to 75.07% and 87.71% accuracy obtained from the worst and best used classifiers.

6 citations


Journal ArticleDOI
TL;DR: Experiments demonstrated that the data-driven approach, for the first time, allows identifying selective spatial pattern changes of the human connectome in AD that perfectly matched grey matter changes ofThe disease.
Abstract: Alzheimer's disease AD is the most common cause of age-related dementia, which prominently affects the human connectome In this paper, the authors focus on the question how they can identify disrupted spatial patterns of the human connectome in AD based on a data mining framework Using diffusion tractography, the human connectomes for each individual subject were constructed based on two diffusion derived attributes: fiber density and fractional anisotropy, to represent the structural brain connectivity patterns After frequent subgraph mining, the abnormal score was finally defined to identify disrupted subgraph patterns in patients Experiments demonstrated that our data-driven approach, for the first time, allows identifying selective spatial pattern changes of the human connectome in AD that perfectly matched grey matter changes of the disease Their findings also bring new insights into how AD propagates and disrupts the regional integrity of large-scale structural brain networks in a fiber connectivity-based way

5 citations


Journal ArticleDOI
TL;DR: A model to predict the Length of Stay LOS for hospitalized patients can be an effective tool for measuring the consumption of hospital resources and will enable early interventions to pre-injury interventions.
Abstract: A model to predict the Length of Stay LOS for hospitalized patients can be an effective tool for measuring the consumption of hospital resources. Such a model will enable early interventions to pre...

Journal ArticleDOI
TL;DR: In this article, the authors review the state of the art in the use of protein-protein interactions ppis within the context of the interpretation of genomic experiments and report the available resources and methodologies used to create a curated compilation of ppis introducing a novel approach to filter interactions.
Abstract: Here the authors review the state of the art in the use of protein-protein interactions ppis within the context of the interpretation of genomic experiments. They report the available resources and methodologies used to create a curated compilation of ppis introducing a novel approach to filter interactions. Special attention is paid in the complexity of the topology of the networks formed by proteins nodes and pairwise interactions edges. These networks can be studied using graph theory and a brief introduction to the characterization of biological networks and definitions of the more used network parameters is also given. Also a report on the available resources to perform different modes of functional profiling using ppi data is provided along with a discussion on the approaches that have typically been applied into this context. They also introduce a novel methodology for the evaluation of networks and some examples of its application.

Journal ArticleDOI
TL;DR: A visual knowledge discovery from databases application is created to efficiently and accurately understand a large collection of fixed and temporal patients' data in the Intensive Care Unit in order to prevent the nosocomial infection occurrence.
Abstract: Increasing the improvement of confidence and comprehensibility of medical data as well as the possibility of using the human capacities in medical pattern recognition is a significant interest for the coming years In this context, we have created a visual knowledge discovery from databases application It has been developed to efficiently and accurately understand a large collection of fixed and temporal patients' data in the Intensive Care Unit in order to prevent the nosocomial infection occurrence It is based on data visualization technique which is the perspective wall Its application is a good example of the usefulness of data visualization techniques in the medical domain

Journal ArticleDOI
TL;DR: The present work demonstrates that image features describing the structural properties of figures are sufficient for the figure retrieval task, and presents a methodology using a novel feature descriptor namely Fourier Edge Orientation Autocorrelogram FEOAC to describe structural Properties of figures and build an effective Biomedical document retrieval system.
Abstract: Multi-modal and Unstructured nature of documents make their retrieval from healthcare document repositories a challenging task Text based retrieval is the conventional approach used for solving this problem In this paper, the authors explore an alternate avenue of using embedded figures for the retrieval task Usually, context of a document is directly reflected in the associated figures, therefore embedded text within these figures along with image features have been used for similarity based retrieval of figures The present work demonstrates that image features describing the structural properties of figures are sufficient for the figure retrieval task First, the authors analyze the problem of figure retrieval from biomedical literature and identify significant classes of figures Second, they use edge information as a means to discriminate between structural properties of each figure category Finally, the authors present a methodology using a novel feature descriptor namely Fourier Edge Orientation Autocorrelogram FEOAC to describe structural properties of figures and build an effective Biomedical document retrieval system The experimental results demonstrate the better retrieval performance and overall improvement of FEOAC for figure retrieval task, especially when most of the edge information is retained Apart from invariance to scale, rotation and non-uniform illumination, the proposed feature descriptor is shown to be relatively robust to noisy edges

Journal ArticleDOI
Koji Tsuda1
TL;DR: This tutorial article reviews basics about frequent pattern mining algorithms, including itemset mining, association rule mining, and graph mining, which can find frequently appearing substructures in discrete data.
Abstract: In this tutorial article, the author reviews basics about frequent pattern mining algorithms, including itemset mining, association rule mining, and graph mining. These algorithms can find frequently appearing substructures in discrete data. They can discover structural motifs, for example, from mutation data, protein structures, and chemical compounds. As they have been primarily used for business data, biological applications are not so common yet, but their potential impact would be large. Recent advances in computers including multicore machines and ever increasing memory capacity support the application of such methods to larger datasets. The author explains technical aspects of the algorithms, but do not go into details. Current biological applications are summarized and possible future directions are given.

Journal ArticleDOI
TL;DR: The authors' new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and they believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeat.
Abstract: Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, the authors describe a new method for post-processing tandem repeats through clustering and classification. Their work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. The authors' new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and they believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.