scispace - formally typeset
Search or ask a question

Showing papers in "Briefings in Bioinformatics in 2008"


Journal ArticleDOI
TL;DR: The motivation, design principles and priorities that have shaped the development of MEGA are discussed and how MEGA might evolve in the future to assist researchers in their growing need to analyze large data set using new computational methods are discussed.
Abstract: The Molecular Evolutionary Genetics Analysis (MEGA) software is a desktop application designed for comparative analysis of homologous gene sequences either from multigene families or from different species with a special emphasis on inferring evolutionary relationships and patterns of DNA and protein evolution. In addition to the tools for statistical analysis of data, MEGA provides many convenient facilities for the assembly of sequence data sets from files or web-based repositories, and it includes tools for visual presentation of the results obtained in the form of interactive phylogenetic trees and evolutionary distance matrices. Here we discuss the motivation, design principles and priorities that have shaped the development of MEGA. We also discuss how MEGA might evolve in the future to assist researchers in their growing need to analyze large data set using new computational methods.

3,290 citations


Journal ArticleDOI
TL;DR: The initial version of the MAFFT program was developed in 2002 and was updated in 2007 with two new techniques: the PartTree algorithm and the Four-way consistency objective function, which improved the scalability of progressive alignment and the accuracy of ncRNA alignment.
Abstract: The accuracy and scalability of multiple sequence alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive alignment and the latter improved the accuracy of ncRNA alignment. We review these and other techniques that MAFFTuses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

3,278 citations


Journal ArticleDOI
TL;DR: What the original concepts were, what their present status is and how they may be expected to contribute to future system biology approaches are described.
Abstract: Since its beginning as a data collection more than 20 years ago, the TRANSFAC project underwent an evolution to become the basis for a complex platform for the description and analysis of gene regulatory events and networks. In the following, I describe what the original concepts were, what their present status is and how they may be expected to contribute to future system biology approaches.

406 citations


Journal ArticleDOI
TL;DR: Recently developed gene set analysis methods evaluate differential expression patterns of gene groups instead of those of individual genes, which has been quite successful in deriving new information from expression data.
Abstract: Recently developed gene set analysis methods evaluate differential expression patterns of gene groups instead of those of individual genes. This approach especially targets gene groups whose constituents show subtle but coordinated expression changes, which might not be detected by the usual individual gene analysis. The approach has been quite successful in deriving new information from expression data, and a number of methods and tools have been developed intensively in recent years. We review those methods and currently available tools, classify them according to the statistical methods employed, and discuss their pros and cons. We also discuss several interesting extensions to the methods.

331 citations


Journal ArticleDOI
TL;DR: This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison and overview the available programs and evaluation guidelines for genomic/proteomic data.
Abstract: ROC (‘receiver operator characteristics’) analysis is a visual as well as numerical method used for assessing the performance of classification algorithms, such as those used for predicting structures and functions from sequence data. This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison. We overview the available programs and provide evaluation guidelines for genomic/proteomic data, with particular regard to applications to large and heterogeneous databases used in bioinformatics.

313 citations


Journal ArticleDOI
TL;DR: This article provides a review of several recently developed penalized feature selection and classification techniques--which belong to the family of embedded feature selection methods--for bioinformatics studies with high-dimensional input.
Abstract: In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques—which belong to the family of embedded feature selection methods—for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.

255 citations


Journal ArticleDOI
TL;DR: A set of desirable clustering features that are used as evaluation criteria for clustering algorithms are presented and their benefits and drawbacks are outlined as a basis for matching them to biomedical applications.
Abstract: Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.

194 citations


Journal ArticleDOI
TL;DR: This review introduces and describes the concepts related to neural networks, the advantages and caveats to their use, examples of their applications in mass spectrometry and microarray research (with a particular focus on cancer studies), and illustrations from recent literature showing where neural networks have performed well in comparison to other machine learning methods.
Abstract: Applications of genomic and proteomic technologies have seen a major increase, resulting in an explosion in the amount of highly dimensional and complex data being generated. Subsequently this has increased the effort by the bioinformatics community to develop novel computational approaches that allow for meaningful information to be extracted. This information must be of biological relevance and thus correlate to disease phenotypes of interest. Artificial neural networks are a form of machine learning from the field of artificial intelligence with proven pattern recognition capabilities and have been utilized in many areas of bioinformatics. This is due to their ability to cope with highly dimensional complex datasets such as those developed by protein mass spectrometry and DNA microarray experiments. As such, neural networks have been applied to problems such as disease classification and identification of biomarkers. This review introduces and describes the concepts related to neural networks, the advantages and caveats to their use, examples of their applications in mass spectrometry and microarray research (with a particular focus on cancer studies), and illustrations from recent literature showing where neural networks have performed well in comparison to other machine learning methods. This should form the necessary background knowledge and information enabling researchers with an interest in these methodologies, but not necessarily from a machine learning background, to apply the concepts to their own datasets, thus maximizing the information gain from these complex biological systems.

173 citations


Journal ArticleDOI
TL;DR: Although it is clear that there have been some improvements in methods for predicting interacting sites, several major bottlenecks remain and community standards for testing, training and performance measures are necessary for progress in the field.
Abstract: The identification of protein-protein interaction sites is an essential intermediate step for mutant design and the prediction of protein networks. In recent years a significant number of methods have been developed to predict these interface residues and here we review the current status of the field. Progress in this area requires a clear view of the methodology applied, the data sets used for training and testing the systems, and the evaluation procedures. We have analysed the impact of a representative set of features and algorithms and highlighted the problems inherent in generating reliable protein data sets and in the posterior analysis of the results. Although it is clear that there have been some improvements in methods for predicting interacting sites, several major bottlenecks remain. Proteins in complexes are still under-represented in the structural databases and in particular many proteins involved in transient complexes are still to be crystallized. We provide suggestions for effective feature selection, and make it clear that community standards for testing, training and performance measures are necessary for progress in the field.

170 citations


Journal ArticleDOI
TL;DR: Stochastic simulation methods are systematically reviewed in order to guide the researcher and help her find the appropriate method for a specific problem.
Abstract: Computer simulations have become an invaluable tool to study the sometimes counterintuitive temporal dynamics of (bio-)chemical systems In particular, stochastic simulation methods have attracted increasing interest recently In contrast to the well-known deterministic approach based on ordinary differential equations, they can capture effects that occur due to the underlying discreteness of the systems and random fluctuations in molecular numbers Numerous stochastic, approximate stochastic and hybrid simulation methods have been proposed in the literature In this article, they are systematically reviewed in order to guide the researcher and help her find the appropriate method for a specific problem

169 citations


Journal ArticleDOI
TL;DR: The unanimous agreement that cellular processes are (largely) governed by interactions between proteins has led to enormous community efforts culminating in overwhelming information relating to these proteins; to the regulation of their interactions, to the way in which they interact and to the function which is determined by these interactions.
Abstract: The unanimous agreement that cellular processes are (largely) governed by interactions between proteins has led to enormous community efforts culminating in overwhelming information relating to these proteins; to the regulation of their interactions, to the way in which they interact and to the function which is determined by these interactions. These data have been organized in databases and servers. However, to make these really useful, it is essential not only to be aware of these, but in particular to have a working knowledge of which tools to use for a given problem; what are the tool advantages and drawbacks; and no less important how to combine these for a particular goal since usually it is not one tool, but some combination of tool-modules that is needed. This is the goal of this review.


Journal ArticleDOI
TL;DR: It is shown that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.
Abstract: Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10 000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28 000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.

Journal ArticleDOI
TL;DR: VisANT is a visual mining software which integrates, mines and displays hierarchical information to capture functional hierarchies and adaptation to environmental change and to discover pathways and processes embedded in known data, but not currently recognizable.
Abstract: The essence of a living cell is adaptation to a changing environment, and a central goal of modern cell biology is to understand adaptive change under normal and pathological conditions. Because the number of components is large, and processes and conditions are many, visual tools are useful in providing an overview of relations that would otherwise be far more difficult to assimilate. Historically, representations were static pictures, with genes and proteins represented as nodes, and known or inferred correlations between them (links) represented by various kinds of lines. The modern challenge is to capture functional hierarchies and adaptation to environmental change, and to discover pathways and processes embedded in known data, but not currently recognizable. Among the tools being developed to meet this challenge is VisANT (freely available at http://visant.bu.edu) which integrates, mines and displays hierarchical information. Challenges to integrating modeling (discrete or continuous) and simulation capabilities into such visual mining software are briefly discussed.

Journal ArticleDOI
TL;DR: The features of BioMoby that make it distinct from other Semantic Web Service and interoperability initiatives, and that have been instrumental to its deployment and use by a wide community of bioinformatics service providers are highlighted.
Abstract: The BioMoby project was initiated in 2001 from within the model organism database community. It aimed to standardize methodologies to facilitate information exchange and access to analytical resources, using a consensus driven approach. Six years later, the BioMoby development community is pleased to announce the release of the 1.0 version of the interoperability framework, registry Application Programming Interface and supporting Perl and Java code-bases. Together, these provide interoperable access to over 1400 bioinformatics resources worldwide through the BioMoby platform, and this number continues to grow. Here we highlight and discuss the features of BioMoby that make it distinct from other Semantic Web Service and interoperability initiatives, and that have been instrumental to its deployment and use by a wide community of bioinformatics service providers. The standard, client software, and supporting code libraries are all freely available at http://www.biomoby.org/.

Journal ArticleDOI
TL;DR: Based on a comparison of the performance of operon predictions on Escherichia coli and Bacillus subtilis, it is concluded that there is still room for improvement in operon prediction methods and it is expected that existing and newly generated genomics and transcriptomics data will further improve accuracy of oper on prediction methods.
Abstract: For most organisms, computational operon predictions are the only source of genome-wide operon information. Operon prediction methods described in literature are based on (a combination of) the following five criteria: (i) intergenic distance, (ii) conserved gene clusters, (iii) functional relation, (iv) sequence elements and (v) experimental evidence. The performance estimates of operon predictions reported in literature cannot directly be compared due to differences in methods and data used in these studies. Here, we survey the current status of operon prediction methods. Based on a comparison of the performance of operon predictions on Escherichia coli and Bacillus subtilis we conclude that there is still room for improvement. We expect that existing and newly generated genomics and transcriptomics data will further improve accuracy of operon prediction methods.

Journal ArticleDOI
TL;DR: This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases.
Abstract: A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a lot of information in public sequence databases is not linked to formal taxonomic names. This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases. The structure of these links can also be exploited using the PageRank algorithm to rank the results of searches on biodiversity databases. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers [such as Digital Object Identifiers (DOIs) and Life Science Identifiers (LSIDs)], and the implementation of services that link those identifiers.

Journal ArticleDOI
TL;DR: It is concluded how text mining techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.
Abstract: The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation.

Journal ArticleDOI
TL;DR: Significance analysis of microarray for gene-set reduction (SAM-GSR) can be used for an analytical reduction of gene sets to their core subsets and the importance of distinguishing between 'self-contained' versus 'competitive' methods is emphasized.
Abstract: Gene-set analysis aims to identify differentially expressed gene sets (pathways) by a phenotype in DNA microarray studies. We review here important methodological aspects of gene-set analysis and illustrate them with varying performance of several methods proposed in the literature. We emphasize the importance of distinguishing between ‘self-contained’ versus ‘competitive’ methods, following Goeman and Buhlmann. We also discuss reducing a gene set to its subset, consisting of ‘core members’ that chiefly contribute to the statistical significance of the differential expression of the initial gene set by phenotype. Significance analysis of microarray for gene-set reduction (SAM-GSR) can be used for an analytical reduction of gene sets to their core subsets. We apply SAM-GSR on a microarray dataset for identifying biological gene sets (pathways) whose gene expressions are associated with p53 mutation in cancer cell lines. Codes to implement SAM-GSR in the statistical package R can be downloaded from http://www.ualberta.ca/~yyasui/homepage.html.

Journal ArticleDOI
TL;DR: The Beta Workbench is introduced, a scalable tool built on top of the newly defined BlenX language to model, simulate and analyse biological systems and a comparison with related approaches is provided.
Abstract: We introduce the Beta Workbench (BWB), a scalable tool built on top of the newly defined BlenX language to model, simulate and analyse biological systems. We show the features and the incremental modelling process supported by the BWB on a running example based on the mitogen-activated kinase pathway. Finally, we provide a comparison with related approaches and some hints for future extensions.

Journal ArticleDOI
TL;DR: Progress inMultiscale modeling is described and illustrated with reference to the heart Physiome Project which aims to understand cardiac arrhythmias in terms of structure-function relations from proteins up to cells, tissues and organs.
Abstract: Multiscale modeling is required for linking physiological processes operating at the organ and tissue levels to signal transduction networks and other subcellular processes. Several XML markup languages, including CellML, have been developed to encode models and to facilitate the building of model repositories and general purpose software tools. Progress in this area is described and illustrated with reference to the heart Physiome Project which aims to understand cardiac arrhythmias in terms of structure-function relations from proteins up to cells, tissues and organs.

Journal ArticleDOI
TL;DR: This review discusses the present understanding of protein domain promiscuity, its evolution and its role in cellular function and describes the biological mechanisms of proteindomain mobility.
Abstract: A substantial fraction of eukaryotic proteins contains multiple domains, some of which show a tendency to occur in diverse domain architectures and can be considered mobile (or ‘promiscuous’). These promiscuous domains are typically involved in protein–protein interactions and play crucial roles in interaction networks, particularly those contributing to signal transduction. They also play a major role in creating diversity of protein domain architecture in the proteome. It is now apparent that promiscuity is a volatile and relatively fast-changing feature in evolution, and that only a few domains retain their promiscuity status throughout evolution. Many such domains attained their promiscuity status independently in different lineages. Only recently, we have begun to understand the diversity of protein domain architectures and the role the promiscuous domains play in evolution of this diversity. However, many of the biological mechanisms of protein domain mobility remain shrouded in mystery. In this review, we discuss our present understanding of protein domain promiscuity, its evolution and its role in cellular function.

Journal ArticleDOI
TL;DR: A Web 2.0-based Scientific Social Community (SSC) model is proposed, which can foster collaboration and harness collective intelligence to create and discover new knowledge and promote a web services-based pipeline featuring web services for computer-to-computer data exchange as users add value.
Abstract: Enabling deft data integration from numerous, voluminous and heterogeneous data sources is a major bioinformatic challenge. Several approaches have been proposed to address this challenge, including data warehousing and federated databasing. Yet despite the rise of these approaches, integration of data from multiple sources remains problematic and toilsome. These two approaches follow a user-to-computer communication model for data exchange, and do not facilitate a broader concept of data sharing or collaboration among users. In this report, we discuss the potential of Web 2.0 technologies to transcend this model and enhance bioinformatics research. We propose a Web 2.0-based Scientific Social Community (SSC) model for the implementation of these technologies. By establishing a social, collective and collaborative platform for data creation, sharing and integration, we promote a web services-based pipeline featuring web services for computer-to-computer data exchange as users add value. This pipeline aims to simplify data integration and creation, to realize automatic analysis, and to facilitate reuse and sharing of data. SSC can foster collaboration and harness collective intelligence to create and discover new knowledge. In addition to its research potential, we also describe its potential role as an e-learning platform in education. We discuss lessons from information technology, predict the next generation of Web (Web 3.0), and describe its potential impact on the future of bioinformatics studies.

Journal ArticleDOI
TL;DR: The theory of codon models and their application to the detection of positive selection are outlined, laying a foundation for further advances in the modeling of coding sequence evolution.
Abstract: Probabilistic models of sequence evolution are in widespread use in phylogenetics and molecular sequence evolution. These models have become increasingly sophisticated and combined with statistical model comparison techniques have helped to shed light on how genes and proteins evolve. Models of codon evolution have been particularly useful, because, in addition to providing a significant improvement in model realism for protein-coding sequences, codon models can also be designed to test hypotheses about the selective pressures that shape the evolution of the sequences. Such models typically assume a phylogeny and can be used to identify sites or lineages that have evolved adaptively. Recently some of the key assumptions that underlie phylogenetic tests of selection have been questioned, such as the assumption that the rate of synonymous changes is constant across sites or that a single phylogenetic tree can be assumed at all sites for recombining sequences. While some of these issues have been addressed through the development of novel methods, others remain as caveats that need to be considered on a case-by-case basis. Here, we outline the theory of codon models and their application to the detection of positive selection. We review some of the more recent developments that have improved their power and utility, laying a foundation for further advances in the modeling of coding sequence evolution.

Journal ArticleDOI
TL;DR: This review provides a brief overview of approaches to literature mining as they relate to drug discovery, and offers an illustrative case study of a 'lightweight' approach the authors have implemented within an industrial context.
Abstract: The drug discovery enterprise provides strong drivers for data integration. While attention in this arena has tended to focus on integration of primary data from omics and other large platform technologies contributing to drug discovery and development, the scientific literature remains a major source of information valuable to pharmaceutical enterprises, and therefore tools for mining such data and integrating it with other sources are of vital interest and economic impact. This review provides a brief overview of approaches to literature mining as they relate to drug discovery, and offers an illustrative case study of a ‘lightweight’ approach we have implemented within an industrial context.

Journal ArticleDOI
TL;DR: A general approach is introduced that provides the foundations for a structured formal engineering of large-scale models of biochemical networks, using signal transduction as the main example.
Abstract: Quantitative models of biochemical networks (signal transduction cascades, metabolic pathways, gene regulatory circuits) are a central component of modern systems biology. Building and managing these complex models is a major challenge that can benefit from the application of formal methods adopted from theoretical computing science. Here we provide a general introduction to the field of formal modelling, which emphasizes the intuitive biochemical basis of the modelling process, but is also accessible for an audience with a background in computing science and/or model engineering. We show how signal transduction cascades can be modelled in a modular fashion, using both a qualitative approach--qualitative Petri nets, and quantitative approaches--continuous Petri nets and ordinary differential equations (ODEs). We review the major elementary building blocks of a cellular signalling model, discuss which critical design decisions have to be made during model building, and present a number of novel computational tools that can help to explore alternative modular models in an easy and intuitive manner. These tools, which are based on Petri net theory, offer convenient ways of composing hierarchical ODE models, and permit a qualitative analysis of their behaviour. We illustrate the central concepts using signal transduction as our main example. The ultimate aim is to introduce a general approach that provides the foundations for a structured formal engineering of large-scale models of biochemical networks.

Journal ArticleDOI
TL;DR: Recent examples are shown that highlight the utility of these tools in recognizing remote homologies between pairs of protein structures and in assigning putative biochemical functions to newly determined targets from structural genomics projects.
Abstract: The Protein Data Bank Japan (PDBj) curates, edits and distributes protein structural data as a member of the worldwide Protein Data Bank (wwPDB) and currently processes � 25^30% of all deposited data in the world. Structural information is enhanced by the addition of biological and biochemical functional data as well as experimental details extracted from the literature and other databases. Several applications have been developed at PDBj for structural biology and biomedical studies: (i) a Java-based molecular graphics viewer, jV; (ii) display of electron density maps for the evaluation of structure quality; (iii) an extensive database of molecular surfaces for functional sites, eF-site, as well as a search service for similar molecular surfaces, eF-seek; (iv) identification of sequence and structural neighbors; (v) a graphical user interface to all known protein folds with links to the above applications, Protein Globe. Recent examples are shown that highlight the utility of these tools in recognizing remote homologies between pairs of protein structures and in assigning putative biochemical functions to newly determined targets from structural genomics projects.

Journal ArticleDOI
TL;DR: This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences.
Abstract: In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data. Programmatic access to services, for data and processes, means that compositions of services can be made that represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows depends on being able to know what services exist and where to find those services. The large number of services and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but only if they too can be found and understood. Thus appropriate curation, including semantic mark-up, would enable processes to be found, maintained and consequently used more easily. This broader view on semantic annotation is vital for full data integration that is necessary for the modern scientific analyses in biology. This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences.

Journal ArticleDOI
TL;DR: This review provides an introduction to current CI methods, their application to biological problems, and concludes with a commentary about the anticipated impact of these approaches in bioinformatics.
Abstract: Biology, chemistry and medicine are faced by tremendous challenges caused by an overwhelming amount of data and the need for rapid interpretation. Computational intelligence (CI) approaches such as artificial neural networks, fuzzy systems and evolutionary computation are being used with increasing frequency to contend with this problem, in light of noise, non-linearity and temporal dynamics in the data. Such methods can be used to develop robust models of processes either on their own or in combination with standard statistical approaches. This is especially true for database mining, where modeling is a key component of scientific understanding. This review provides an introduction to current CI methods, their application to biological problems, and concludes with a commentary about the anticipated impact of these approaches in bioinformatics.

Journal ArticleDOI
TL;DR: The major concepts behind repeat detecting software essential for informed tool selection are introduced and issues such as parameter settings and program bias are reflected using examples from the currently available range of programs to provide an integrated comparison and practical guide to microsatellite detecting programs.
Abstract: Short tandem repeats, specifically microsatellites, are widely used genetic markers, associated with human genetic diseases, and play an important role in various regulatory mechanisms and evolution. Despite their importance, much is yet unknown about their mutational dynamics. The increasing availability of genome data has led to several in silico studies of microsatellite evolution which have produced a vast range of algorithms and software for tandem repeat detection. Documentation of these tools is often sparse, or provided in a format that is impenetrable to most biologists without informatics background. This article introduces the major concepts behind repeat detecting software essential for informed tool selection. We reflect on issues such as parameter settings and program bias, as well as redundancy filtering and efficiency using examples from the currently available range of programs, to provide an integrated comparison and practical guide to microsatellite detecting programs.