scispace - formally typeset
Search or ask a question

Showing papers in "Biodata Mining in 2011"


Journal ArticleDOI
TL;DR: Att approaches, models and methods from the graph theory universe are demonstrated and ways in which they can be used to reveal hidden properties and features of a network are discussed to better understand the biological significance of the system.
Abstract: Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

595 citations


Journal ArticleDOI
TL;DR: Comprehensive experimental results on the Online Mendelian Inheritance in Man (OMIM) database show that DA DA outperforms existing methods in prioritizing candidate disease genes, and demonstrates the importance of employing accurate statistical models and associated adjustment methods in network- based disease gene prioritization, as well as other network-based functional inference applications.
Abstract: High-throughput molecular interaction data have been used effectively to prioritize candidate genes that are linked to a disease, based on the observation that the products of genes associated with similar diseases are likely to interact with each other heavily in a network of protein-protein interactions (PPIs). An important challenge for these applications, however, is the incomplete and noisy nature of PPI data. Information flow based methods alleviate these problems to a certain extent, by considering indirect interactions and multiplicity of paths. We demonstrate that existing methods are likely to favor highly connected genes, making prioritization sensitive to the skewed degree distribution of PPI networks, as well as ascertainment bias in available interaction and disease association data. Motivated by this observation, we propose several statistical adjustment methods to account for the degree distribution of known disease and candidate genes, using a PPI network with associated confidence scores for interactions. We show that the proposed methods can detect loosely connected disease genes that are missed by existing approaches, however, this improvement might come at the price of more false negatives for highly connected genes. Consequently, we develop a suite called DA DA, which includes different uniform prioritization methods that effectively integrate existing approaches with the proposed statistical adjustment strategies. Comprehensive experimental results on the Online Mendelian Inheritance in Man (OMIM) database show that DA DA outperforms existing methods in prioritizing candidate disease genes. These results demonstrate the importance of employing accurate statistical models and associated adjustment methods in network-based disease gene prioritization, as well as other network-based functional inference applications. DA DA is implemented in Matlab and is freely available at http://compbio.case.edu/dada/ .

160 citations


Journal ArticleDOI
TL;DR: The predicted targets derived from approximately 20% of all human miRNAs constructed biologically meaningful molecular networks, supporting the view that a set of miRNA targets regulated by a single miRNA generally constitute the biological network of functionally-associated molecules in human cells.
Abstract: MicroRNAs (miRNAs) mediate posttranscriptional regulation of protein-coding genes by binding to the 3' untranslated region of target mRNAs, leading to translational inhibition, mRNA destabilization or degradation, depending on the degree of sequence complementarity. In general, a single miRNA concurrently downregulates hundreds of target mRNAs. Thus, miRNAs play a key role in fine-tuning of diverse cellular functions, such as development, differentiation, proliferation, apoptosis and metabolism. However, it remains to be fully elucidated whether a set of miRNA target genes regulated by an individual miRNA in the whole human microRNAome generally constitute the biological network of functionally-associated molecules or simply reflect a random set of functionally-independent genes. The complete set of human miRNAs was downloaded from miRBase Release 16. We explored target genes of individual miRNA by using the Diana-microT 3.0 target prediction program, and selected the genes with the miTG score ≧ 20 as the set of highly reliable targets. Then, Entrez Gene IDs of miRNA target genes were uploaded onto KeyMolnet, a tool for analyzing molecular interactions on the comprehensive knowledgebase by the neighboring network-search algorithm. The generated network, compared side by side with human canonical networks of the KeyMolnet library, composed of 430 pathways, 885 diseases, and 208 pathological events, enabled us to identify the canonical network with the most significant relevance to the extracted network. Among 1,223 human miRNAs examined, Diana-microT 3.0 predicted reliable targets from 273 miRNAs. Among them, KeyMolnet successfully extracted molecular networks from 232 miRNAs. The most relevant pathway is transcriptional regulation by transcription factors RB/E2F, the disease is adult T cell lymphoma/leukemia, and the pathological event is cancer. The predicted targets derived from approximately 20% of all human miRNAs constructed biologically meaningful molecular networks, supporting the view that a set of miRNA targets regulated by a single miRNA generally constitute the biological network of functionally-associated molecules in human cells.

75 citations


Journal ArticleDOI
TL;DR: By adopting a GRID technology, the algorithm for 3D reconstruction FT-COMAR is benchmarked on a huge set of non redundant proteins taking random noise into consideration and this makes the computation the largest ever performed for the task at hand.
Abstract: The present knowledge of protein structures at atomic level derives from some 60,000 molecules. Yet the exponential ever growing set of hypothetical protein sequences comprises some 10 million chains and this makes the problem of protein structure prediction one of the challenging goals of bioinformatics. In this context, the protein representation with contact maps is an intermediate step of fold recognition and constitutes the input of contact map predictors. However contact map representations require fast and reliable methods to reconstruct the specific folding of the protein backbone. In this paper, by adopting a GRID technology, our algorithm for 3D reconstruction FT-COMAR is benchmarked on a huge set of non redundant proteins (1716) taking random noise into consideration and this makes our computation the largest ever performed for the task at hand. We can observe the effects of introducing random noise on 3D reconstruction and derive some considerations useful for future implementations. The dimension of the protein set allows also statistical considerations after grouping per SCOP structural classes. All together our data indicate that the quality of 3D reconstruction is unaffected by deleting up to an average 75% of the real contacts while only few percentage of randomly generated contacts in place of non-contacts are sufficient to hamper 3D reconstruction.

72 citations


Journal ArticleDOI
TL;DR: This paper presents a Scatter Search with the aim of finding biclusters from gene expression data, based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes.
Abstract: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.

57 citations


Journal ArticleDOI
TL;DR: Based on the large data sets of Western medicine literature and traditional Chinese medicine literature, by applying data slicing algorithm in text mining, some simple and meaningful networks are retrieved and might show the positive answer that there are biological basis/networks commonly existed in both RA and CHD.
Abstract: One important concept in traditional Chinese medicine (TCM) is "treating different diseases with the same therapy". In TCM practice, some patients with Rheumatoid Arthritis (RA) and some other patients with Coronary Heart Disease (CHD) can be treated with similar therapies. This suggests that there might be something commonly existed between RA and CHD, for example, biological networks or biological basis. As the amount of biomedical data in leading databases (i.e., PubMed, SinoMed, etc.) is growing at an exponential rate, it might be possible to get something interesting and meaningful through the techniques developed in data mining. Based on the large data sets of Western medicine literature (PubMed) and traditional Chinese medicine literature (SinoMed), by applying data slicing algorithm in text mining, we retrieved some simple and meaningful networks. The Chinese herbs used in treatment of both RA and CHD, might affect the commonly existed networks between RA and CHD. This might support the TCM concept of treating different diseases with the same therapy. First, the data mining results might show the positive answer that there are biological basis/networks commonly existed in both RA and CHD. Second, there are basic Chinese herbs used in the treatment of both RA and CHD. Third, these commonly existed networks might be affected by the basic Chinese herbs. Forth, discrete derivative, the data slicing algorithm is feasible in mining out useful data from literature of PubMed and SinoMed.

46 citations


Journal ArticleDOI
TL;DR: Investigation of the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature concludes that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.
Abstract: The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature. We show that Genetic Programming performs significantly better than Support Vector Machines, Multilayered Perceptrons and Random Forests in classifying patients from the NKI breast cancer dataset, and comparably to the scoring-based method originally proposed by the authors of the 70-gene signature. Furthermore, Genetic Programming is able to perform an automatic feature selection. Since the performance of Genetic Programming is likely to be improvable compared to the out-of-the-box approach used here, and given the biological insight potentially provided by the Genetic Programming solutions, we conclude that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.

46 citations


Journal ArticleDOI
TL;DR: A package for the R statistical language to implement the Multifactor Dimensionality Reduction (MDR) method for nonparametric variable selection of interactions is introduced, designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research.
Abstract: A breadth of high-dimensional data is now available with unprecedented numbers of genetic markers and data-mining approaches to variable selection are increasingly being utilized to uncover associations, including potential gene-gene and gene-environment interactions. One of the most commonly used data-mining methods for case-control data is Multifactor Dimensionality Reduction (MDR), which has displayed success in both simulations and real data applications. Additional software applications in alternative programming languages can improve the availability and usefulness of the method for a broader range of users. We introduce a package for the R statistical language to implement the Multifactor Dimensionality Reduction (MDR) method for nonparametric variable selection of interactions. This package is designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research. The 'MDR' package is freely available online at http://www.r-project.org/ . We also provide data examples to illustrate the use and functionality of the package. MDR is a frequently-used data-mining method to identify potential gene-gene interactions, and alternative implementations will further increase this usage. We introduce a flexible software package for R users.

27 citations


Journal ArticleDOI
TL;DR: This model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size and will allow the capabilities of novel methods to be tested without pre-specified genetic models.
Abstract: Background: A goal of human genetics is to discover genetic factors that influence individuals’ susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Results: Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. Conclusions: This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

25 citations


Journal ArticleDOI
TL;DR: This analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat by combining structural and sequence-based information in classifier ensembles.
Abstract: Maturation inhibitors such as Bevirimat are a new class of antiretroviral drugs that hamper the cleavage of HIV-1 proteins into their functional active forms. They bind to these preproteins and inhibit their cleavage by the HIV-1 protease, resulting in non-functional virus particles. Nevertheless, there exist mutations in this region leading to resistance against Bevirimat. Highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients, who might benefit from the usage of these new drugs. We tested several methods to improve Bevirimat resistance prediction in HIV-1. It turned out that combining structural and sequence-based information in classifier ensembles led to accurate and reliable predictions. Moreover, we were able to identify the most crucial regions for Bevirimat resistance computationally, which are in line with experimental results from other studies. Our analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat. New maturation inhibitors are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Thus, accurate prediction tools are very useful to enable a personalized therapy.

24 citations


Journal ArticleDOI
TL;DR: The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.
Abstract: Background Most machine learning techniques currently applied in the literature need a fixed dimensionality of input data. However, this requirement is frequently violated by real input data, such as DNA and protein sequences, that often differ in length due to insertions and deletions. It is also notable that performance in classification and regression is often improved by numerical encoding of amino acids, compared to the commonly used sparse encoding.

Journal ArticleDOI
TL;DR: This paper demonstrates the discovery of a set of internal control probes that have log ratios theoretically equal to zero according to this DMH protocol and proposes two LOESS (or LOWESS, locally weighted scatter-plot smoothing) normalization methods that are novel and unique for DMH microarray data.
Abstract: DNA methylation plays a very important role in the silencing of tumor suppressor genes in various tumor types. In order to gain a genome-wide understanding of how changes in methylation affect tumor growth, the differential methylation hybridization (DMH) protocol has been developed and large amounts of DMH microarray data have been generated. However, it is still unclear how to preprocess this type of microarray data and how different background correction and normalization methods used for two-color gene expression arrays perform for the methylation microarray data. In this paper, we demonstrate our discovery of a set of internal control probes that have log ratios (M) theoretically equal to zero according to this DMH protocol. With the aid of this set of control probes, we propose two LOESS (or LOWESS, locally weighted scatter-plot smoothing) normalization methods that are novel and unique for DMH microarray data. Combining with other normalization methods (global LOESS and no normalization), we compare four normalization methods. In addition, we compare five different background correction methods. We study 20 different preprocessing methods, which are the combination of five background correction methods and four normalization methods. In order to compare these 20 methods, we evaluate their performance of identifying known methylated and un-methylated housekeeping genes based on two statistics. Comparison details are illustrated using breast cancer cell line and ovarian cancer patient methylation microarray data. Our comparison results show that different background correction methods perform similarly; however, four normalization methods perform very differently. In particular, all three different LOESS normalization methods perform better than the one without any normalization. It is necessary to do within-array normalization, and the two LOESS normalization methods based on specific DMH internal control probes produce more stable and relatively better results than the global LOESS normalization method.

Journal ArticleDOI
TL;DR: Higher levels of LD begin to confound the MDR algorithm and lead to a drop in sensitivity with respect to the identification of a direct association; it does not, however, affect the ability to detect indirect association.
Abstract: In the analysis of large-scale genomic datasets, an important consideration is the power of analytical methods to identify accurate predictive models of disease. When trying to assess sensitivity from such analytical methods, a confounding factor up to this point has been the presence of linkage disequilibrium (LD). In this study, we examined the effect of LD on the sensitivity of the Multifactor Dimensionality Reduction (MDR) software package. Four relative amounts of LD were simulated in multiple one- and two-locus scenarios for which the position of the functional SNP(s) within LD blocks varied. Simulated data was analyzed with MDR to determine the sensitivity of the method in different contexts, where the sensitivity of the method was gauged as the number of times out of 100 that the method identifies the correct one- or two-locus model as the best overall model. As the amount of LD increases, the sensitivity of MDR to detect the correct functional SNP drops but the sensitivity to detect the disease signal and find an indirect association increases. Higher levels of LD begin to confound the MDR algorithm and lead to a drop in sensitivity with respect to the identification of a direct association; it does not, however, affect the ability to detect indirect association. Careful examination of the solution models generated by MDR reveals that MDR can identify loci in the correct LD block; though it is not always the functional SNP. As such, the results of MDR analysis in datasets with LD should be carefully examined to consider the underlying LD structure of the dataset.

Journal ArticleDOI
TL;DR: The current analyses substantiate the utility of rule based classifiers such as RIPPER, RIDOR and PART for the detection of gene-gene/gene-environment interactions in genetic association studies and provide an advantage in being able to handle both categorical and continuous variable types.
Abstract: Several methods have been presented for the analysis of complex interactions between genetic polymorphisms and/or environmental factors. Despite the available methods, there is still a need for alternative methods, because no single method will perform well in all scenarios. The aim of this work was to evaluate the performance of three selected rule based classifier algorithms, RIPPER, RIDOR and PART, for the analysis of genetic association studies. Overall, 42 datasets were simulated with three different case-control models, a varying number of subjects (300, 600), SNPs (500, 1500, 3000) and noise (5%, 10%, 20%). The algorithms were applied to each of the datasets with a set of algorithm-specific settings. Results were further investigated with respect to a) the Model, b) the Rules, and c) the Attribute level. Data analysis was performed using WEKA, SAS and PERL. The RIPPER algorithm discovered the true case-control model at least once in >33% of the datasets. The RIDOR and PART algorithm performed poorly for model detection. The RIPPER, RIDOR and PART algorithm discovered the true case-control rules in more than 83%, 83% and 44% of the datasets, respectively. All three algorithms were able to detect the attributes utilized in the respective case-control models in most datasets. The current analyses substantiate the utility of rule based classifiers such as RIPPER, RIDOR and PART for the detection of gene-gene/gene-environment interactions in genetic association studies. These classifiers could provide a valuable new method, complementing existing approaches, in the analysis of genetic association studies. The methods provide an advantage in being able to handle both categorical and continuous variable types. Further, because the outputs of the analyses are easy to interpret, the rule based classifier approach could quickly generate testable hypotheses for additional evaluation. Since the algorithms are computationally inexpensive, they may serve as valuable tools for preselection of attributes to be used in more complex, computationally intensive approaches. Whether used in isolation or in conjunction with other tools, rule based classifiers are an important addition to the armamentarium of tools available for analyses of complex genetic association studies.

Journal ArticleDOI
TL;DR: A revival of the notion of disease network is called for, and it is recalled how superimposing layers of clinical data and biological information to such networks may help identify novel disease genes.
Abstract: Over the last ten years, genome-wide association studies (GWAS) have reported over 4000 single nucleotide polymorphisms associated to more than 200 traits. Despite providing us with a slightly better understanding of the genetic architecture of common diseases, generating avalanches of new hypotheses, and fostering timid progress in pharmacogenomics, genetic associations studies haven't yet revolutionized clinical practice. Hence, although such studies are still published at a remarkable pace, the notion of 'post-GWAS' functional characterization of risk loci is gradually gaining in popularity. Indeed, deciphering the function of disease-associated genetic variants is likely to get us closer to achieving an understanding of disease architecture that will ultimately be translatable into clinical applications. Despite this gradual change in research priorities, the field of medical genomics remains fairly conservative: the 'single gene single disease' paradigm largely prevails, to the detriment of the avant-garde notion of 'diseasome' and of human disease network (HDN) in particular, and attempts to truly integrate clinical information (e.g., age at onset or reduction in life span) and molecular data are scarce. Here we call for a revival of the notion of disease network, and recall how superimposing layers of clinical data and biological information to such networks may help identify novel disease genes.

Journal ArticleDOI
TL;DR: The results of the animation-assisted detection of changes in gene regulatory patterns make predictions about the potential roles of Hsp90 and its co-chaperone p23 in regulating whole sets of genes.
Abstract: To make sense out of gene expression profiles, such analyses must be pushed beyond the mere listing of affected genes. For example, if a group of genes persistently display similar changes in expression levels under particular experimental conditions, and the proteins encoded by these genes interact and function in the same cellular compartments, this could be taken as very strong indicators for co-regulated protein complexes. One of the key requirements is having appropriate tools to detect such regulatory patterns. We have analyzed the global adaptations in gene expression patterns in the budding yeast when the Hsp90 molecular chaperone complex is perturbed either pharmacologically or genetically. We integrated these results with publicly accessible expression, protein-protein interaction and intracellular localization data. But most importantly, all experimental conditions were simultaneously and dynamically visualized with an animation. This critically facilitated the detection of patterns of gene expression changes that suggested underlying regulatory networks that a standard analysis by pairwise comparison and clustering could not have revealed. The results of the animation-assisted detection of changes in gene regulatory patterns make predictions about the potential roles of Hsp90 and its co-chaperone p23 in regulating whole sets of genes. The simultaneous dynamic visualization of microarray experiments, represented in networks built by integrating one's own experimental with publicly accessible data, represents a powerful discovery tool that allows the generation of new interpretations and hypotheses.

Journal ArticleDOI
TL;DR: Analysis of three copy number calling programs and quantitative PCR showed Birdsuite to have the greatest agreement with quantitative PCR.
Abstract: Copy number variants are >1 kb genomic amplifications or deletions that can be identified using array platforms. However, arrays produce substantial background noise that contributes to high false discovery rates of variants. We hypothesized that quantitative PCR could finitely determine copy number and assess the validity of calling algorithms. Using data from 29 Affymetrix SNP 6.0 arrays, we determined copy numbers using three programs: Partek Genomics Suite, Affymetrix Genotyping Console 2.0 and Birdsuite. We compared array calls at 25 chromosomal regions to those determined by qPCR and found nearly identical calls in regions of copy number 2. Conversely, agreement differed in regions called variant by at least one method. The highest overall agreement in calls, 91%, was between Birdsuite and quantitative PCR. Partek Genomics Suite calls agreed with quantitative PCR 76% of the time while the agreement of Affymetrix Genotyping Console 2.0 with quantitative PCR was 79%. In 38 independent samples, 96% of Birdsuite calls agreed with quantitative PCR. Analysis of three copy number calling programs and quantitative PCR showed Birdsuite to have the greatest agreement with quantitative PCR.

Journal ArticleDOI
TL;DR: An evolutionary model based on hill-climbing genetic operators is proposed for protein structure prediction in the hydrophobic - polar (HP) model and the emerging consolidated model is compared to relevant algorithms from the literature for a set of difficult bidimensional instances from lattice protein models.
Abstract: Proteins are complex structures made of amino acids having a fundamental role in the correct functioning of living cells. The structure of a protein is the result of the protein folding process. However, the general principles that govern the folding of natural proteins into a native structure are unknown. The problem of predicting a protein structure with minimum-energy starting from the unfolded amino acid sequence is a highly complex and important task in molecular and computational biology. Protein structure prediction has important applications in fields such as drug design and disease prediction. The protein structure prediction problem is NP-hard even in simplified lattice protein models. An evolutionary model based on hill-climbing genetic operators is proposed for protein structure prediction in the hydrophobic - polar (HP) model. Problem-specific search operators are implemented and applied using a steepest-ascent hill-climbing approach. Furthermore, the proposed model enforces an explicit diversification stage during the evolution in order to avoid local optimum. The main features of the resulting evolutionary algorithm - hill-climbing mechanism and diversification strategy - are evaluated in a set of numerical experiments for the protein structure prediction problem to assess their impact to the efficiency of the search process. Furthermore, the emerging consolidated model is compared to relevant algorithms from the literature for a set of difficult bidimensional instances from lattice protein models. The results obtained by the proposed algorithm are promising and competitive with those of related methods.

Journal ArticleDOI
TL;DR: The AGEP method is a widely applicable method for the rapid comprehensive interpretation of microarray data, as proven here by the definition of tissue- and disease-specific changes in gene expression as well as during cellular differentiation.
Abstract: Gene expression microarray data have been organized and made available as public databases, but the utilization of such highly heterogeneous reference datasets in the interpretation of data from individual test samples is not as developed as e.g. in the field of nucleotide sequence comparisons. We have created a rapid and powerful approach for the alignment of microarray gene expression profiles (AGEP) from test samples with those contained in a large annotated public reference database and demonstrate here how this can facilitate interpretation of microarray data from individual samples. AGEP is based on the calculation of kernel density distributions for the levels of expression of each gene in each reference tissue type and provides a quantitation of the similarity between the test sample and the reference tissue types as well as the identity of the typical and atypical genes in each comparison. As a reference database, we used 1654 samples from 44 normal tissues (extracted from the Genesapiens database). Using leave-one-out validation, AGEP correctly defined the tissue of origin for 1521 (93.6%) of all the 1654 samples in the original database. Independent validation of 195 external normal tissue samples resulted in 87% accuracy for the exact tissue type and 97% accuracy with related tissue types. AGEP analysis of 10 Duchenne muscular dystrophy (DMD) samples provided quantitative description of the key pathogenetic events, such as the extent of inflammation, in individual samples and pinpointed tissue-specific genes whose expression changed (SAMD4A) in DMD. AGEP analysis of microarray data from adipocytic differentiation of mesenchymal stem cells and from normal myeloid cell types and leukemias provided quantitative characterization of the transcriptomic changes during normal and abnormal cell differentiation. The AGEP method is a widely applicable method for the rapid comprehensive interpretation of microarray data, as proven here by the definition of tissue- and disease-specific changes in gene expression as well as during cellular differentiation. The capability to quantitatively compare data from individual samples against a large-scale annotated reference database represents a widely applicable paradigm for the analysis of all types of high-throughput data. AGEP enables systematic and quantitative comparison of gene expression data from test samples against a comprehensive collection of different cell/tissue types previously studied by the entire research community.

Journal ArticleDOI
TL;DR: Computational genomic and proteomic analysis combined to predictive functionnal analysis represent an alternative way for rapid identification of new putative bacteriocins as well as new potential antimicrobial drugs compared to the more traditional methods of drugs discovery using antagonism tests.
Abstract: In order to characterise new bacteriocins produced by Streptococcus mutans we perform a complete bioinformatic analyses by scanning the genome sequence of strains UA159 and NN2025. By searching in the adjacent genomic context of the two-component signal transduction system we predicted the existence of many putative new bacteriocins' maturation pathways and some of them were only exclusive to a group of Streptococcus. Computational genomic and proteomic analysis combined to predictive functionnal analysis represent an alternative way for rapid identification of new putative bacteriocins as well as new potential antimicrobial drugs compared to the more traditional methods of drugs discovery using antagonism tests.

Journal ArticleDOI
TL;DR: A novel index, the Normalized Tree Index (NTI), is proposed, which allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations.
Abstract: Measurements on gene level are widely used to gain new insights in complex diseases e.g. cancer. A promising approach to understand basic biological mechanisms is to combine gene expression profiles and classical clinical parameters. However, the computation of a correlation coefficient between high-dimensional data and such parameters is not covered by traditional statistical methods. We propose a novel index, the Normalized Tree Index (NTI), to compute a correlation coefficient between the clustering result of high-dimensional microarray data and nominal clinical parameters. The NTI detects correlations between hierarchically clustered microarray data and nominal clinical parameters (labels) and gives a measurement of significance in terms of an empiric p-value of the identified correlations. Therefore, the microarray data is clustered by hierarchical agglomerative clustering using standard settings. In a second step, the computed cluster tree is evaluated. For each label, a NTI is computed measuring the correlation between that label and the clustered microarray data. The NTI successfully identifies correlated clinical parameters at different levels of significance when applied on two real-world microarray breast cancer data sets. Some of the identified highly correlated labels confirm the actual state of knowledge whereas others help to identify new risk factors and provide a good basis to formulate new hypothesis. The NTI is a valuable tool in the domain of biomedical data analysis. It allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations.

Journal ArticleDOI
TL;DR: Canonization is introduced here canalization as an evolutionary force in biological systems as a way to facilitate the selection of data mining methods.
Abstract: A common challenge of identifying meaningful patterns in high-dimensional biological data is the complexity of the relationship between genotype and phenotype. Complexity arises as a result of many environmental, genetic, genomic, metabolic and proteomic factors interacting in a nonlinear manner through time and space to influence variability in biological traits and processes. The assumptions we make about this complexity greatly influences the analytical methods we choose for data mining and, in turn, our results and inferences. For example, linear discriminant analysis assumes a linear additive relationship among the variables or attributes while support vector machine or neural network can model nonlinear relationships. Regardless, it is a useful exercise to think about where biological complexity comes from as a way to facilitate the selection of data mining methods. One important theory is that evolution has shaped the complexity of biological systems. More specifically, we introduce here canalization as an evolutionary force in biological systems.

Journal ArticleDOI
TL;DR: An innovative Evolutionary Algorithm method to search the best graphical representation of unresolved trees, in order to give a biological meaning to the vertical order of taxa, showing an improvement both of the fitness, and of the biological interpretation.
Abstract: Background: In in a typical “left-to-right” phylogenetic tree, the vertical order of taxa is meaningless, as only the branch path between them reflects their degree of similarity. To make unresolved trees more informative, here we propose an innovative Evolutionary Algorithm (EA) method to search the best graphical representation of unresolved trees, in order to give a biological meaning to the vertical order of taxa. Methods: Starting from a West Nile virus phylogenetic tree, in a (1 + 1)-EA we evolved it by randomly rotating the internal nodes and selecting the tree with better fitness every generation. The fitness is a sum of genetic distances between the considered taxon and the r (radius) next taxa. After having set the radius to the best performance, we evolved the trees with (l + μ)-EAs to study the influence of population on the algorithm. Results: The (1 + 1)-EA consistently outperformed a random search, and better results were obtained setting the radius to 8. The (l + μ)-EAs performed as well as the (1 + 1), except the larger population (1000 + 1000). Conclusions: The trees after the evolution showed an improvement both of the fitness (based on a genetic distance matrix, then close taxa are actually genetically close), and of the biological interpretation. Samples collected in the same state or year moved close each other, making the tree easier to interpret. Biological relationships between samples are also easier to observe.

Journal ArticleDOI
TL;DR: It is suggested that amid genetic deserts and genetic islands, there is also more to explore than the coding regions of the genome and the importance and the necessity of designing efficient methods to mine beyond the exome.
Abstract: In the late 18th century, Erasmus Darwin, Charles Darwin's grandfather, advocated evolutionary theory as a mean to "unravel the theory of disease". More than 200 years later, although Darwinian medicine is regaining some ground after having been muzzled during the second half of the 20th century, genomics has largely outcompeted evolution and has acquired a dictatorial success as a tool for studying disease etiology. From an evolution-inspired perspective, we have gradually drifted into the habit of focusing primarily on genomic data from sources such as genome-wide association studies (GWAS). As a result, understanding the how and why of human diseases and pathobiology has largely become a matter of crunching DNA sequences. Despite the popularity of GWAS, their reality remains unchanged: most of the susceptibility loci they allow to identify explain only a small fraction of the heritability of complex diseases. A number of reasons for the so-called "missing heritability" have been proposed, and our goal is not to review them all. Here we primarily reiterate that there is more to discover than non-synonymous point mutations and suggest that amid genetic deserts and genetic islands, there is also more to explore than the coding regions of the genome. We then highlight the importance and the necessity of designing efficient methods to mine beyond the exome.

Journal ArticleDOI
TL;DR: Probabilistic timeboxes are proposed, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining and enables users without expert knowledge to specify fairly complex statistical models with ease.
Abstract: Timeboxes are graphical user interface widgets that were proposed to specify queries on time course data. As queries can be very easily defined, an exploratory analysis of time course data is greatly facilitated. While timeboxes are effective, they have no provisions for dealing with noisy data or data with fluctuations along the time axis, which is very common in many applications. In particular, this is true for the analysis of gene expression time courses, which are mostly derived from noisy microarray measurements at few unevenly sampled time points. From a data mining point of view the robust handling of data through a sound statistical model is of great importance. We propose probabilistic timeboxes, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining. Since HMMs are a particular class of probabilistic graphical models we call our method Probabilistic Graphical Query Language. Its implementation was realized in the free software package pGQL. We evaluate its effectiveness in exploratory analysis on a yeast sporulation data set. We introduce a new approach to define dynamic, statistical queries on time course data. It supports an interactive exploration of reasonably large amounts of data and enables users without expert knowledge to specify fairly complex statistical models with ease. The expressivity of our approach is by its statistical nature greater and more robust with respect to amplitude and frequency fluctuation than the prior, deterministic timeboxes.

Journal ArticleDOI
TL;DR: A number of features characteristic of biological data, including high levels of measurement variability and correlation between variables, represent an additional challenge and call for specific methods.
Abstract: The beginning of the 21st century has witnessed the generation of spectacular amounts of new information, ranging from marketing data to genomic sequences. As traditional statistical methods are gradually being defeated by both the amount of data and the general absence of underlying hypotheses, data mining procedures are becoming increasingly popular and user-friendly. By combining statistical-, artificial intelligence- and database management tools, those methods are tailored for processing large quantities of information and extracting interesting patterns. Since their first application, data mining procedures have progressively been tweaked to accommodate various types of information, including social science- and biological data. However, a number of features characteristic of biological data, including high levels of measurement variability and correlation between variables, represent an additional challenge and call for specific methods. The goal of this editorial is to highlight the spatial dimension of biological data mining.