scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Integrative Bioinformatics in 2008"


Journal ArticleDOI
TL;DR: This paper presents a novel bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local database management system by using a Java-based system architecture and object-relational mapping (ORM) technology.
Abstract: This paper presents a novel bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local database management system. It stands out from other approaches by providing up-to-date integrated knowledge, platform and database independence as well as high usability and customization. This open source software can be used as a general infrastructure for integrative bioinformatics research and development. The advantages of the approach are realized by using a Java-based system architecture and object-relational mapping (ORM) technology. Finally, a practical application of the system is presented within the emerging area of medical bioinformatics to show the usefulness of the approach. The BioDWH data warehouse software is available for the scientific community at http://sourceforge.net/projects/biodwh/.

39 citations


Journal ArticleDOI
TL;DR: The results indicate that parsing the connection subgraph directly is much more effective than parsing individual paths separately, and it is shown that using a bidirectional parsing algorithm, in most cases, allows for searching twice as long paths as using a unidirectional search strategy.
Abstract: We describe a method for querying vertex- and edge-labeled graphs using context-free grammars to specify the class of interesting paths. We introduce a novel problem, finding the connection subgraph induced by the set of matching paths between given two vertices or two sets of vertices. Such a subgraph provides a concise summary of the relationship between the vertices. We also present novel algorithms for parsing subgraphs directly without enumerating all the individual paths. We evaluate experimentally the presented parsing algorithms on a set of real graphs derived from publicly available biomedical databases and on randomly generated graphs. The results indicate that parsing the connection subgraph directly is much more effective than parsing individual paths separately. Furthermore, we show that using a bidirectional parsing algorithm, in most cases, allows for searching twice as long paths as using a unidirectional search strategy.

33 citations


Journal ArticleDOI
TL;DR: ReMatch is a web-based, user-friendly tool that constructs stoichiometric network models for metabolic flux analysis, integrating user-developed models into a database collected from several comprehensive metabolic data resources, including KEGG, MetaCyc and CheBI.
Abstract: ReMatch is a web-based, user-friendly tool that constructs stoichiometric network models for metabolic flux analysis, integrating user-developed models into a database collected from several comprehensive metabolic data resources, including KEGG, MetaCyc and CheBI. Particularly, ReMatch augments the metabolic reactions of the model with carbon mappings to facilitate (13)C metabolic flux analysis. The construction of a network model consisting of biochemical reactions is the first step in most metabolic modelling tasks. This model construction can be a tedious task as the required information is usually scattered to many separate databases whose interoperability is suboptimal, due to the heterogeneous naming conventions of metabolites in different databases. Another, particularly severe data integration problem is faced in (13)C metabolic flux analysis, where the mappings of carbon atoms from substrates into products in the model are required. ReMatch has been developed to solve the above data integration problems. First, ReMatch matches the imported user-developed model against the internal ReMatch database while considering a comprehensive metabolite name thesaurus. This, together with wild card support, allows the user to specify the model quickly without having to look the names up manually. Second, ReMatch is able to augment reactions of the model with carbon mappings, obtained either from the internal database or given by the user with an easy-touse tool. The constructed models can be exported into 13C-FLUX and SBML file formats. Further, a stoichiometric matrix and visualizations of the network model can be generated. The constructed models of metabolic networks can be optionally made available to the other users of ReMatch. Thus, ReMatch provides a common repository for metabolic network models with carbon mappings for the needs of metabolic flux analysis community. ReMatch is freely available for academic use at http://www.cs.helsinki.fi/group/sysfys/software/rematch/.

15 citations


Journal ArticleDOI
TL;DR: This study has discovered that in the majority of cases, Affymetrix probesets on Human GeneChips do not measure one unique block of transcription, and that in a number of probesets the mismatch probes are an informative diagnostic of expression, rather than providing a measure of background contamination.
Abstract: We have developed a computational pipeline to analyse large surveys of Affymetrix GeneChips, for example NCBI's Gene Expression Omnibus. GEO samples data for many organisms, tissues and phenotypes. Because of this experimental diversity, any observed correlations between probe intensities can be associated either with biology that is robust, such as common co-expression, or with systematic biases associated with the GeneChip technology. Our bioinformatics pipeline integrates the mapping of probes to exons, quality control checks on each GeneChip which identifies flaws in hybridization quality, and the mining of correlations in intensities between groups of probes. The output from our pipeline has enabled us to identify systematic biases in GeneChip data. We are also able to use the pipeline as a discovery tool for biology. We have discovered that in the majority of cases, Affymetrix probesets on Human GeneChips do not measure one unique block of transcription. Instead we see numerous examples of outlier probes. Our study has also identified that in a number of probesets the mismatch probes are an informative diagnostic of expression, rather than providing a measure of background contamination. We report evidence for systematic biases in GeneChip technology associated with probe-probe interactions. We also see signatures associated with post-transcriptional processing of RNA, such as alternative polyadenylation.

14 citations


Journal ArticleDOI
TL;DR: This paper presents an approach for constructing visualisations of two overlapping networks, based on a restricted three dimensional representation, which aims to achieve both drawing aesthetics for each individual network, and highlighting the intersection part by them.
Abstract: Biological data is often structured in the form of complex interconnected networks such as protein interaction and metabolic networks. In this paper, we investigate a new problem of visualising such overlapping biological networks. Two networks overlap if they share some nodes and edges. We present an approach for constructing visualisations of two overlapping networks, based on a restricted three dimensional representation. More specifically, we use three parallel two dimensional planes placed in three dimensions to represent overlapping networks: one for each network (the top and the bottom planes) and one for the overlapping part (in the middle plane). Our method aims to achieve both drawing aesthetics (or conventions) for each individual network, and highlighting the intersection part by them. Using three biological datasets, we evaluate our visualisation design with the aim to test whether overlapping networks can support the visual analysis of heterogeneous and yet interconnected networks.

14 citations


Journal ArticleDOI
TL;DR: The goal is to re-annotate TFBMs by possibly switching their strands and shifting them a few positions in order to maximize the information content of the resulting adjusted PFM, and it is shown that MoRAine significantly improves the corresponding sequence logos.
Abstract: BACKGROUND A precise experimental identification of transcription factor binding motifs (TFBMs), accurate to a single base pair, is time-consuming and diffcult. For several databases, TFBM annotations are extracted from the literature and stored 5' --> 3' relative to the target gene. Mixing the two possible orientations of a motif results in poor information content of subsequently computed position frequency matrices (PFMs) and sequence logos. Since these PFMs are used to predict further TFBMs, we address the question if the TFBMs underlying a PFM can be re-annotated automatically to improve both the information content of the PFM and subsequent classification performance. RESULTS We present MoRAine, an algorithm that re-annotates transcription factor binding motifs. Each motif with experimental evidence underlying a PFM is compared against each other such motif. The goal is to re-annotate TFBMs by possibly switching their strands and shifting them a few positions in order to maximize the information content of the resulting adjusted PFM. We present two heuristic strategies to perform this optimization and subsequently show that MoRAine significantly improves the corresponding sequence logos. Furthermore, we justify the method by evaluating specificity, sensitivity, true positive, and false positive rates of PFM-based TFBM predictions for E. coli using the original database motifs and the MoRAine-adjusted motifs. The classification performance is considerably increased if MoRAine is used as a preprocessing step. CONCLUSIONS MoRAine is integrated into a publicly available web server and can be used online or downloaded as a stand-alone version from http://moraine.cebitec. uni-bielefeld.de.

9 citations


Journal ArticleDOI
TL;DR: The LexA protein is a transcriptional repressor of the bacterial SOS DNA repair system, which comprises a set of DNA repair and cellular survival genes that are induced in response to DNA damage, and its varied DNA binding motifs have been characterized and reported in the Escherichia coli, Bacillus subtilis, rhizobia family members, marine magnetotactic bacterium.
Abstract: Summary The LexA protein is a transcriptional repressor of the bacterial SOS DNA repair system, which comprises a set of DNA repair and cellular survival genes that are induced in response to DNA damage. Its varied DNA binding motifs have been characterized and reported in the Escherichia coli, Bacillus subtilis, rhizobia family members, marine magnetotactic bacterium, Salmonella typhimurium and recently in Mycobacterium tuberculosis and this motifs information has been used in our theoretical analysis to detect its novel regulated genes in radio-resistant Deinococcus radiodurans genome. This bacterium showed presence of SOS-box like consensus sequence in the upstream sequences of 3166 genes with >60% motif score similarity percentage (MSSP) on both strands. Attempts to identify LexA-binding sites and the composition of the putative SOS regulon in D. radiodurans have been unsuccessful so far. To resolve the problem we performed theoretical analysis with modifications on reported data set of genes related to DNA repair (61 genes), stress response (145 genes) and some unusual predicted operons (21 clusters). Expression of some of the predicted SOS-box regulated operon members then was examined through the previously reported microarray data which confirm the expression of only single predicted operon i.e. DRB0143 (AAA superfamily NTPase related to 5-methylcytosine specific restriction enzyme subunit McrB) and DRB0144 (homolog of the McrC subunit of the McrBC restriction modification system). The methodology involved weight matrix construction through CONSENSUS algorithm using information of conserved upstream sequences of eight known genes including dinB, tagC, lexA, recA, uvrB, yneA of B. subtilis while lexA and recA of D. radiodurans through phylogenetic footprinting method and later detection of similar conserved SOS-box like LexA binding motifs through both RSAT & PoSSuMsearch programs. The resultant DNA consensus sequence had highly conserved 14 bp SOS-box like binding site.

8 citations


Journal ArticleDOI
TL;DR: The data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system are presented.
Abstract: Summary The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from dierent reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database AraCyc) which has been established in the ONDEX data integration system. We also present a comparison between dierent methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

8 citations


Journal ArticleDOI
TL;DR: The annotations are the glue for integration of patterns of gene expression in GEMS as well as in other biomolecular databases, and extend GEMS expression patterns integration to a wide range of bioinformatics resources.
Abstract: Summary The Gene Expression Management System (GEMS) is a database system for patterns of gene expression. These patterns result from systematic whole-mount fluorescent in situ hybridization studies on zebrafish embryos. GEMS is an integrative platform that addresses one of the important challenges of developmental biology: how to integrate genetic data that underpin morphological changes during embryogenesis. Our motivation to build this system was by the need to be able to organize and compare multiple patterns of gene expression at tissue level. Integration with other developmental and biomolecular databases will further support our understanding of development. The GEMS operates in concert with a database containing a digital atlas of zebrafish embryo; this digital atlas of zebrafish development has been conceived prior to the expansion of the GEMS. The atlas contains 3D volume models of canonical stages of zebrafish development in which in each volume model element is annotated with an anatomical term. These terms are extracted from a formal anatomical ontology, i.e. the Developmental Anatomy Ontology of Zebrafish (DAOZ). In the GEMS, anatomical terms from this ontology together with terms from the Gene Ontology (GO) are also used to annotate patterns of gene expression and in this manner providing mechanisms for integration and retrieval . The annotations are the glue for integration of patterns of gene expression in GEMS as well as in other biomolecular databases. At the one hand, zebrafish anatomy terminology allows gene expression data within GEMS to be integrated with phenotypical data in the 3D atlas of zebrafish development. At the other hand, GO terms extend GEMS expression patterns integration to a wide range of bioinformatics resources.

7 citations


Journal ArticleDOI
TL;DR: The functionality to display multiple microarray datasets simultaneously in Bluejay is developed, in order to provide researchers with a comprehensive view of their datasets linked to a graphical representation of gene function.
Abstract: Summary The need for novel methods of visualizing microarray data is growing. New perspectives are beneficial to finding patterns in expression data. The Bluejay genome browser provides an integrative way of visualizing gene expression datasets in a genomic context. We have now developed the functionality to display multiple microarray datasets simultaneously in Bluejay, in order to provide researchers with a comprehensive view of their datasets linked to a graphical representation of gene function. This will enable biologists to obtain valuable insights on expression patterns, by allowing them to analyze the expression values in relation to the gene locations as well as to compare expression profiles of related genomes or of dierent experiments for the same genome.

5 citations


Journal ArticleDOI
TL;DR: The versatile Cytoscape plugin DomainGraph is developed that allows for the visual analysis of protein domain interaction networks and their integration with exon expression data.
Abstract: Summary Proteins and their interactions are essential for the functioning of all organisms and for understanding biological processes. Alternative splicing is an important molecular mechanism for increasing the protein diversity in eukaryotic cells. Splicing events that alter the protein structure and the domain composition can be responsible for the regulation of protein interactions and the functional diversity of dierent tissues. Discovering the occurrence of splicing events and studying protein isoforms have become feasible using Aymetrix Exon Arrays. Therefore, we have developed the versatile Cytoscape plugin DomainGraph that allows for the visual analysis of protein domain interaction networks and their integration with exon expression data. Protein domains aected by alternative splicing are highlighted and splicing patterns can be compared.

Journal ArticleDOI
TL;DR: A Bayes-Random Fields framework which is capable of integrating unlimited data sources for discovering relevant network architecture of large-scale networks and reveals the varied characteristic of different types of data and refelct their discriminative ability in terms of identifying direct gene interactions.
Abstract: We present a Bayes-Random Fields framework which is capable of integrating unlimited data sources for discovering relevant network architecture of large-scale networks. The random field potential function is designed to impose a cluster constraint, teamed with a full Bayesian approach for incorporating heterogenous data sets. The probabilistic nature of our framework facilitates robust analysis in order to minimize the influence of noise inherent in the data on the inferred structure in a seamless and coherent manner. This is later proved in its applications to both large-scale synthetic data sets and Saccharomyces Cerevisiae data sets. The analytical and experimental results reveal the varied characteristic of different types of data and refelct their discriminative ability in terms of identifying direct gene interactions.

Journal ArticleDOI
TL;DR: This work introduces a unifying notational framework for systems biology models and high-throughput data in order to allow new integrations on the systemic scale like the use of in silico predictions to support the mining of gene expression datasets.
Abstract: Summary The paradigmatic shift occurred in biology that led first to high-throughput experimental techniques and later to computational systems biology must be applied also to the analysis paradigm of the relation between local models and data to obtain an effective prediction tool. In this work we introduce a unifying notational framework for systems biology models and high-throughput data in order to allow new integrations on the systemic scale like the use of in silico predictions to support the mining of gene expression datasets. Using the framework, we propose two applications concerning the use of system level models to support the differential analysis of microarray expression data. We tested the potentialities of the approach with a specific microarray experiment on the phosphate system in Saccharomyces cerevisiae and a computational model of the PHO pathway that supports the systems biology concepts.

Journal ArticleDOI
TL;DR: The GOblet web service is extended to integrate also pathway annotations, and the data analysis pipeline is extended and upgraded with improved summaries, and added term enrichment and clustering algorithms.
Abstract: Summary The functional annotation of genomic data has become a major task for the ever-growing number of sequencing projects. In order to address this challenge, we recently developed GOblet, a free web service for the annotation of anonymous sequences with Gene Ontology (GO) terms. However, to overcome limitations of the GO terminology, and to aid in understanding not only single components but as well systemic interactions between the individual components, we have now extended the GOblet web service to integrate also pathway annotations. Furthermore, we extended and upgraded the data analysis pipeline with improved summaries, and added term enrichment and clustering algorithms. Finally, we are now making GOblet available as a stand-alone application for high-throughput processing on local machines. The advantages of this frequently requested feature is that a) the user can avoid restrictions of our web service for uploading and processing large amounts of data, and that b) confidential data can be analysed without insecure transfer to a public web server. The stand-alone version of the web service has been implemented using platform independent Tcl-scripts, which can be run with just a single runtime file utilizing the Starkit technology. The GOblet web service and the stand-alone application are freely available at http://goblet.molgen.mpg.de.

Journal ArticleDOI
TL;DR: A theoretical basis for the evaluation of the effciency of quarantine measure is developed in a SIR model with time delay and the procedure can be readily generalized and applied to a more realistic social network to determine the proper closure measure in future epidemics.
Abstract: Summary A theoretical basis for the evaluation of the eciency of quarantine measure is developed in a SIR model with time delay. In this model, the eectiveness of the closure of public places such as schools in disease control, modeled as a high degree node in a social network, is evaluated by considering the eect of the time delay in the identification of the infected. In the context of the SIR model, the relation between the number of infectious individuals who are identified with time delay and then quarantined and those who are not identified and continue spreading the virus are investigated numerically. The social network for the simulation is modeled by a scale free network. Closure measures are applied to those infected nodes with high degrees. The eectiveness of the measure can be controlled by the present value of the critical degree KC: only those nodes with degree higher than KC will be quarantined. The cost CQ incurred for the closure measure is assumed to be proportional to the total links rendered inactive as a result of the measure, and generally decreases with KC, while the medical cost CQ incurred for virus spreading increases with KC. The total social cost (CM +CQ) will have a minimum at a critical K , which depends on the ratio of medical cost coecient M and closure cost coecient Q. Our simulation results demonstrate a mathematical procedure to evaluate the eciency of quarantine measure. Although the numerical work is based on a scale free network, the procedure can be readily generalized and applied to a more realistic social network to determine the proper closure measure in future epidemics.

Journal ArticleDOI
TL;DR: This work proposes an algorithm by iteratively picking out pairs of gene expression patterns which have the largest dissimilarities which can be used as preprocessing to initialize centers for clustering methods, like K-means.
Abstract: Traditional analysis of gene expression profiles use clustering to find groups of coexpressed genes which have similar expression patterns. However clustering is time consuming and could be diffcult for very large scale dataset. We proposed the idea of Discovering Distinct Patterns (DDP) in gene expression profiles. Since patterns showing by the gene expressions reveal their regulate mechanisms. It is significant to find all different patterns existing in the dataset when there is little prior knowledge. It is also a helpful start before taking on further analysis. We propose an algorithm for DDP by iteratively picking out pairs of gene expression patterns which have the largest dissimilarities. This method can also be used as preprocessing to initialize centers for clustering methods, like K-means. Experiments on both synthetic dataset and real gene expression datasets show our method is very effective in finding distinct patterns which have gene functional significance and is also effcient.

Journal ArticleDOI
TL;DR: A new look at the transcription start is presented in which the authors can see transcription factors binding to both sides of the TSS as an essential requirement and suggest that mutations close to the T SS on the coding side can be fatal even if preserves the codon table.
Abstract: Summary A new look at the transcription start is presented in which we can see transcription factors binding to both sides of the TSS as an essential requirement. Naturally the factor binding to the downstream region must be removed so that transcription process can continue. The presence of a number of distinct transcription factors also can be used to explain selective activation of various genes. The transcription start site by itself plays only a minor role in the whole process. We also suggest that mutations close to the TSS on the coding side can be fatal even if preserves the codon table.

Journal ArticleDOI
TL;DR: An automated data conversion method for biomolecular simulations between molecular dynamics and quantum mechanics/molecular mechanics models is presented and developed around an XML data representation called BioSimML (Biomolecular Simulation Markup Language).
Abstract: Biomolecular modelling has provided computational simulation based methods for investigating biological processes from quantum chemical to cellular levels. Modelling such microscopic processes requires atomic description of a biological system and conducts in fine timesteps. Consequently the simulations are extremely computationally demanding. To tackle this limitation, different biomolecular models have to be integrated in order to achieve high-performance simulations. The integration of diverse biomolecular models needs to convert molecular data between different data representations of different models. This data conversion is often non-trivial, requires extensive human input and is inevitably error prone. In this paper we present an automated data conversion method for biomolecular simulations between molecular dynamics and quantum mechanics/molecular mechanics models. Our approach is developed around an XML data representation called BioSimML (Biomolecular Simulation Markup Language). BioSimML provides a domain specific data representation for biomolecular modelling which can effciently support data interoperability between different biomolecular simulation models and data formats.

Journal ArticleDOI
TL;DR: Three different pattern classification methods are compared on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made.
Abstract: With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information We find that the use of a linear classifier was enough to discriminate a protein pair at the family level However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem We find that the use of nonlinear classifiers achieve significantly higher accuracies In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made Feature selection points to a "knowledge gap" in currently available functional annotations We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins

Journal ArticleDOI
TL;DR: An implementation of the program within a parallel framework to investigate population genetic structure with multi-locus genotyping data, using an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations is described.
Abstract: Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.

Journal ArticleDOI
TL;DR: Three different pattern classification methods are compared on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made.
Abstract: SummaryWith a large amount of information relating to proteins accumulating in databases widelyavailable online, it is of interest to apply machine learning techniques that, by extractingunderlying statistical regularities in the data, make predictions about the functional andevolutionary characteristics of unseen proteins. Such predictions can help in achieving areduction in the space over which experiment designers need to search in order to improveour understanding of the biochemical properties. Previously it has been suggested that anintegration of features computable by comparing a pair of proteins can be achieved by anartificial neural network, hence predicting the degree to which they may be evolutionaryrelated and homologous.We compiled two datasets of pairs of proteins, each pair being characterised by sevendistinct features. We performed an exhaustive search through all possible combinations offeatures, for the problem of separating remote homologous from analogous pairs, we notethat significant performance gain was obtained by the inclusion of sequence and structureinformation. We find that the use of a linear classifier was enough to discriminate a proteinpair at the family level. However, at the superfamily level, to detect remote homologouspairs was a relatively harder problem. We find that the use of nonlinear classifiers achievesignificantly higher accuracies.In this paper, we compare three di erent pattern classification methods on two problemsformulated as detecting evolutionary and functional relationships between pairs of proteins,and from extensive cross validation and feature selection based studies quantify the averagelimits and uncertainties with which such predictions may be made. Feature selection pointsto a “knowledge gap” in currently available functional annotations. We demonstrate howthe scheme may be employed in a framework to associate an individual protein with anexisting family of evolutionarily related proteins.