scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 1995"


Journal ArticleDOI
TL;DR: Improvements brought about by predicting all the sequences of a set of aligned proteins belonging to the same family are reported by improving the success rate in the prediction of the secondary structure of proteins.
Abstract: Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts 82.2% of residues for 74% of co-predicted amino acids. Predictions are available by Email to deleage@ibcp.fr or on a Web page (http:@www.ibcp.fr/predict.html).

1,468 citations


Journal ArticleDOI
TL;DR: CAIC is an application for the Apple Macintosh which allows the valid analysis of comparative (multi-species) data sets that include continuous variables and can be analysed validly in standard statistical packages to test hypotheses about correlated evolution among traits.
Abstract: CAIC is an application for the Apple Macintosh which allows the valid analysis of comparative (multi-species) data sets that include continuous variables. Comparison among species is the most common technique for testing hypotheses of how organisms are adapted to their environments, but standard statistical tests like regression should not be used with species data. Such tests assume independence of data points, but related species often share traits by common descent rather than through independent adaptation. CAIC uses a phylogeny of the species in the data set to partition the variance among species into independent comparisons (technically, linear contrasts), each comparison being made at a different node in the phylogeny. There are two partitioning procedures--one used when all variables are continuous, the other when one variable is discrete. The resulting comparisons can be analysed validly in standard statistical packages to test hypotheses about correlated evolution among traits, to estimate parameters such as allometric exponents, and to compare rates of evolution. Previous versions of the package have already been used widely; this version is simpler to use and works on a wider range of machines. The package and manual are freely available by anonymous ftp or from the authors.

1,177 citations


Journal ArticleDOI
TL;DR: DnaSP, DNA sequence polymorphism, is an interactive computer program for the analysis of DNA polymorphism from nucleotide sequence data that calculates several measures of DNA sequence variation within and between populations, linkage disequilibrium parameters and Tajima's D statistic.
Abstract: DnaSP, DNA sequence polymorphism, is an interactive computer program for the analysis of DNA polymorphism from nucleotide sequence data. The program, addressed to molecular population geneticists, calculates several measures of DNA sequence variation within and between populations, linkage disequilibrium parameters and Tajima's D statistic. The program, which is written in Visual Basic v. 3.0 and runs on an IBM-compatible PC under Windows, can handle a large number of sequences of up to thousands of nucleotides each.

248 citations


Journal ArticleDOI
TL;DR: The paper describes the Macintosh program MitoProt, which is suitable for studying mitochondrion-related proteins, and supplies a series of parameters that permit theoretical evaluation of mitochondrial targeting sequences, as well as calculation of the most hydrophobic fragment of 17 residues in the sequence, and a new parameter called mesohydrophobicity.
Abstract: The paper describes the Macintosh program MitoProt, which is suitable for studying mitochondrion-related proteins. MitoProt supplies a series of parameters that permit theoretical evaluation of mitochondrial targeting sequences, as well as calculation of the most hydrophobic fragment of 17 residues in the sequence, and a new parameter called mesohydrophobicity. The last two calculations are important for predicting the putative importability of a protein into mitochondria. Taken together, targeting sequence and hydrophobicity characteristics enable one to predict whether a given protein could be mitochondrial when no previous information on the nature of the sequence is available.

188 citations


Journal ArticleDOI
TL;DR: Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing large-scale sequence similarities on a single page of PostScript graphics.
Abstract: Miropeats displays DNA sequence similarity information graphically. The program discovers regions of similarity amongst any set of DNA sequences and then draws a graphic that summarizes the length, location and relative orientations of any repeated sequences. Sequence similarity searching is a very general tool that forms the basis of many different biological sequence analyses but it is limited by the verbosity of traditional alignment presentation styles. Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing large-scale sequence similarities on a single page of PostScript graphics. Miropeats has been applied estensively to help understand shotgun assembly projects, to check cosmid overlaps and to perform inter-genomic comparisons.

150 citations


Journal ArticleDOI
TL;DR: It is proved that performance improves remarkably when using a tree-based iterative method, which iteratively refines an alignment whenever two subalignments are merged in aTree-based way.
Abstract: Multiple sequence alignment is an important problem in the biosciences. To date, most multiple alignment systems have employed a tree-based algorithm, which combines the results of two-way dynamic programming in a tree-like order of sequence similarity. The alignment quality is not, however, high enough when the sequence similarity is low. Once an error occurs in the alignment process, that error can never be corrected. Recently, an effective new class of algorithms has been developed. These algorithms iteratively apply dynamic programming to partially aligned sequences to improve their alignment quality. The iteration corrects any errors that may have occurred in the alignment process. Such an iterative strategy requires heuristic search methods to solve practical alignment problems. Incorporating such methods yields various iterative algorithms. This paper reports our comprehensive comparison of iterative algorithms. We proved that performance improves remarkably when using a tree-based iterative method, which iteratively refines an alignment whenever two subalignments are merged in a tree-based way. We propose a tree-dependent, restricted partitioning technique to efficiently reduce the execution time of iterative algorithms.

131 citations


Journal ArticleDOI
TL;DR: The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed and accompanied by MATRIX SEARCH, a program which can find potential transcription factorbinding sites in DNA sequences using the IMD database.
Abstract: The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.

122 citations


Journal ArticleDOI
TL;DR: It is argued that for many problems in this setting, parameterized computational complexity rather than NP-completeness is the appropriate tool for studying apparent intractability and a new result is described for the Longest Common Subsequence problem.
Abstract: Many computational problems in biology involve parameters for which a small range of values cover important applications. We argue that for many problems in this setting, parameterized computational complexity rather than NP-completeness is the appropriate tool for studying apparent intractability. At issue in the theory of parameterized complexity is whether a problem can be solved in time O(n alpha) for each fixed parameter value, where alpha is a constant independent of the parameter. In addition to surveying this complexity framework, we describe a new result for the Longest Common Subsequence problem. In particular, we show that the problem is hard for W[t] for all t when parameterized by the number of strings and the size of the alphabet. Lower bounds on the complexity of this basic combinatorial problem imply lower bounds on more general sequence alignment and consensus discovery problems. We also describe a number of open problems pertaining to the parameterized complexity of problems in computational biology where small parameter values are important.

120 citations


Journal ArticleDOI
TL;DR: The present article proposes two new graphical representations as examples of such methods, the random walk plot is designed to show the base composition in a compact form, whereas the gap plot visualizes positional correlations.
Abstract: Genomic sequence analysis is usually performed with the help of specialized software packages written for molecular biologists. The scope of such pre-programmed techniques is quite limited. Because DNA sequences contain a large amount of information, analysis of such sequences without underlying assumptions may provide additional insights. The present article proposes two new graphical representations as examples of such methods. The random walk plot is designed to show the base composition in a compact form, whereas the gap plot visualizes positional correlations. The random walk plot represents the DNA sequence as a curve, a random walk, in a plane. The four possible moves, left/right and up/down, are used to encode the four possible bases. Gap plots provide a tool to exhibit various features in a sequence. They visualize the periodic patterns within a sequence, both with regard to a single type of base or between two types of bases.

114 citations


Journal ArticleDOI
TL;DR: Software packages NUPARM and NUCGEN, are described, which can be used to understand sequence directed structural variations in nucleic acids, by analysis and generation of non-uniform structures.
Abstract: Software packages NUPARM and NUCGEN, are described, which can be used to understand sequence directed structural variations in nucleic acids, by analysis and generation of non-uniform structures. A set of local inter basepair parameters (viz. tilt, roll, twist, shift, slide and rise) have been defined, which use geometry and coordinates of two successive basepairs only and can be used to generate polymeric structures with varying geometries for each of the 16 possible dinucleotide steps. Intra basepair parameters, propeller, buckle, opening and the C6...C8 distance can also be varied, if required, while the sugar phosphate backbone atoms are fixed in some standard conformation ill each of the nucleotides. NUPARM can be used to analyse both DNA and RNA structures, with single as well as double stranded helices. The NUCGEN software generates double helical models with the backbone fixed in B-form DNA, but with appropriate modifications in the input data, it can also generate A-form DNA ar rd RNA duplex structures.

102 citations


Journal ArticleDOI
TL;DR: WIMOVAC (Windows Intuitive Model of Vegetation response to Atmosphere and Climate Change) is designed to facilitate the modelling of various aspects of plant photosynthesis with particular emphasis on the effects of global climate change.
Abstract: The ability to predict net carbon exchange and production of vegetation in response to predicted atmospheric and climate change is critical to assessing the potential impacts of these changes. Mathematical models provide an important tool in the study of whole plant, canopy and ecosystem responses to global environmental change. Because this requires prediction beyond experience, mechanistic rather than empirical models are needed. The uniformity and strong understanding of the photosynthetic process, which is the primary point of response of plant production to global atmospheric change, provides a basis for such an approach. Existing modelling systems have been developed primarily for expert modellers and have not been easily accessible to experimentalists, managers and students. Here we describe a modular modelling system operating within Windows to provide this access. WIMOVAC ( Windows Intuitive Model of Vegetation response to Atmosphere and Climate Change) is designed to facilitate the modelling of various aspects of plant photosynthesis with particular emphasis on the effects of global climate change. WIMOVAC has been designed to run on IBM PC-compatible computers running Microsoft Windows. The package allows the sophisticated control of the simulation processes for photosynthesis through a standardized Windows user interface and provides automatically formatted results as either tabulated data or as a range of customizable graphs. WIMOVAC has been written in Microsoft Visual Basic, to facilitate the rapid development of user-friendly modules within the familiar Windows framework, while allowing a structured development. The highly interactive nature of controls adopted by WIMOVAC makes it suitable for research, management and educational purposes.

Journal ArticleDOI
TL;DR: This paper proposes an iterative multiple sequence alignment method which optimizes a weighted sum-of-pairs score, in which the weights given to individual sequence pairs are adjusted to compensate for the biased contributions.
Abstract: Most multiple sequence alignment programs explicitly or implicitly try to optimize some score associated with the resulting alignment. Although the sum-of-pairs score is currently most widely used, it is inappropriate when the phylogenetic relationships among the sequences to be aligned are not evenly distributed, since the contributions of densely populated groups dominate those of minor members. This paper proposes an iterative multiple sequence alignment method which optimizes a weighted sum-of-pairs score, in which the weights given to individual sequence pairs are adjusted to compensate for the biased contributions. A simple method that rapidly calculates such a set of weights for a given phylogenetic tree is presented. The multiple sequence alignment is refined through partitioning and realignment restricted to the edges of the tree. Under this restriction, profile-based fast and rigorous group-to-group alignment is achieved at each iteration, rendering the overall computational cost virtually identical to that using an unweighted score. Consistency of nearly 90% was attained between structural and sequence alignments of multiple divergent globins, confirming the effectiveness of this strategy in improving the quality of multiple sequence alignment.

Journal ArticleDOI
TL;DR: The notion of 'regulatory potential' for the degree to which any region of the sequences is similar to the real eukaryotic promoter is introduced, which allows much better recognition accuracy than does the approach based on detection of the TATA box.
Abstract: A method for identification of eukaryotic promoters by localization of binding sites for transcription factors has been suggested. The binding sites for a range of transcription factors have been found to be distributed unevenly. Based on these distributions, we have constructed a weight matrix of binding site localization. On the basis of the weight matrix we have, in turn, designed an algorithm for promoter recognition. To increase the accuracy of the method, we have developed a routine that breaks any promoter sample into subsamples. The method to be reported on allows much better recognition accuracy than does the approach based on detection of the TATA box. In particular, the overprediction error is three times lower following our method. The program FunSiteP recognizes promoters from newly uncovered sequences and tentatively identifies the functional class the promoters must belong to. We have introduced the notion of 'regulatory potential' for the degree to which any region of the sequences is similar to the real eukaryotic promoter. By making use of the potential, we have revealed putative transcription start sites and extended regions of transcription regulation.

Journal ArticleDOI
TL;DR: These algorithms are useful for local deformations of linear molecules, exact ring closure in cyclic molecules and molecular embedding for short chains, and possible applications include structure prediction, protein folding, conformation energy analysis and 3D molecular matching and docking.
Abstract: We present algorithms for 3-D manipulation and conformational analysis of molecular chains, when bond lengths, bond angles and related dihedral angles remain fixed. These algorithms are useful for local deformations of linear molecules, exact ring closure in cyclic molecules and molecular embedding for short chains. Other possible applications include structure prediction, protein folding, conformation energy analysis and 3D molecular matching and docking. The algorithms are applicable to all serial molecular chains and make no assumptions about their geometry. We make use of results on direct and inverse kinematics from robotics and mechanics literature and show the correspondence between kinematics and conformational analysis of molecules. In particular, we pose these problems algebraically and compute all the solutions making use of the structure of these equations and matrix computations. The algorithms have been implemented and perform well in practice. In particular, they take tens of milliseconds on current workstations for local deformations and chain closures on molecular chains consisting of six or fewer rotatable dihedral angles.

Journal ArticleDOI
TL;DR: A practical program, called sim2, for building local alignments of two sequences, each of which may be hundreds of kilobases long, which facilitates contig-building by providing a complete view of the related sequences, so differences can be analyzed and inconsistencies resolved.
Abstract: This paper presents a practical program, called sim2, for building local alignments of two sequences, each of which may be hundreds of kilobases long. sim2 first constructs n best non-intersecting chains of 'fragments', such as all occurrences of identical 5-tuples in each of two DNA sequences, for any specified n > or = 1. Each chain is then refined by delivering an optimal alignment in a region delimited by the chain. sim2 requires only space proportional to the size of the input sequences and the output alignments, and the same source code runs on Unix machines, on Macintoshes, on PCs, and on DEC Alpha PCs. We also describe an application of sim2 for aligning long DNA sequences from Escherichia coli. sim2 facilitates contig-building by providing a complete view of the related sequences, so differences can be analyzed and inconsistencies resolved. Examples are shown using the alignment display and editing functions from the software tool ChromoScope.

Journal ArticleDOI
TL;DR: A new technique developed in Computer Vision and Robotics for the efficient recognition of partially occluded articulated objects is adapted, based on an extension and generalization of the Geometric Hashing and Generalized Hough Transform paradigm for rigid object recognition.
Abstract: The generation of binding modes between two molecules, also known as molecular docking, is a key problem in rational drug design and biomolecular recognition. Docking a ligand, e.g., a drug molecule or a protein molecule, to a protein receptor, involves recognition of molecular surfaces as molecules interact at their surface. Recent studies report that the activity of many molecules induces conformational transitions by 'hinge-bending', which involves movements of relatively rigid parts with respect to each other. In ligand-receptor binding, relative rotational movements of molecular substructures about their common hinges have been observed. For automatically predicting flexible molecular interactions, we adapt a new technique developed in Computer Vision and Robotics for the efficient recognition of partially occluded articulated objects. These type of objects consist of rigid parts which are connected by rotary joints (hinges). Our approach is based on an extension and generalization of the Geometric Hashing and Generalized Hough Transform paradigm for rigid object recognition. Unlike other techniques which match each part individually, our approach exploits forcefully and efficiently enough the fact that the different rigid parts do belong to the same flexible molecule. We show experimental results obtained by an implementation of the algorithm for rigid and flexible docking. While the 'correct', crystal-bound complex is obtained with a small RMSD, additional, predictive 'high scoring' binding modes are generated as well. The diverse applications and implications of this general, powerful tool are discussed.

Journal ArticleDOI
TL;DR: MIST is a very user-friendly, flexible and yet powerful program, with the mathematical details regarding models, simulations and calculations hidden from the user, which makes it suitable for scientists and students with limited computer experience.
Abstract: The Metabolic Interactive Simulation Tool, MIST, is a software package, running under Microsoft Windows 3.1, which can be used for dynamic simulations, stoichiometric calculations and control analysis of metabolic pathways. The pathways can be of any complexity and are defined by the user in a simple, interactive way. The user-defined enzymatic rate equations can be compiled either by an external or an internal compiler. Simulations of pathways compiled by an external compiler run significantly faster, but since these compilers are commercial software, they are not distributed together with MIST. The simulations are performed by numerical integration of a set of ordinary differential equations. The integration can be done by either an explicit fourth-order Runge-Kutta algorithm or a semi-implicit third-order Runge-Kutta algorithm, both with adjustable step size. The second algorithm can be used if the set of differential equations is stiff. Vector-based drawing facilities are included in the program, with which results can be presented in graphs. Results of simulations, including graphics, can be stored in files. MIST is a very user-friendly, flexible and yet powerful program, with the mathematical details regarding models, simulations and calculations hidden from the user. This makes it suitable for scientists and students with limited computer experience.

Journal ArticleDOI
TL;DR: The Lrp motif is identified, based on 23 gene sequences, which is similar to a previously identified motif based on a smaller data set, and to a consensus sequence of experimentally defined binding sites.
Abstract: We describe a relatively simple method for the identification of common motifs in DNA sequences that are known to share a common function. The input sequences are unaligned and there is no information regarding the position or orientation of the motif. Often such data exists for protein-binding regions, where genetic or molecular information that defines the binding region is available, but the specific recognition site within it is unknown. The method is based on the principle of 'divide and conquer': we first search for dominant submotifs and then build full-length motifs around them. This method has several useful features: (i) it screens all submotifs so that the results are independent of the sequence order in the data; (ii) it allows the submotifs to contain spacers; (iii) it identifies an existing motif even if the data contains 'noise'; (iv) its running time depends linearly on the total length of the input. The method is demonstrated on two groups of protein-binding sequences: a well-studied group of known CRP-binding sequences, and a relatively newly identified group of genes known to be regulated by Lrp. The Lrp motif that we identify, based on 23 gene sequences, is similar to a previously identified motif based on a smaller data set, and to a consensus sequence of experimentally defined binding sites. Individual Lrp sites are evaluated and compared in regard to their regulation mode.

Journal ArticleDOI
TL;DR: Three new programs (ICAtools) to cluster genomic DNA fragments, large EST projects, and entire DNA databases are discussed: ICAass, N2tool, and ICAmatches.
Abstract: DNA sequence clustering is an effective aid of the comprehension, summarization and compression of DNA sequence databases. Previous work created programs suitable for the comparison and clustering of cDNA sequences but new enhanced programs have been written to cluster genomic DNA fragments, large EST projects, and entire DNA databases. Three new programs (ICAtools) are discussed: ICAass, N2tool, and ICAmatches. ICAass has been used to compress the EMBL database by hiding or removing sequences with various degrees of redundancy. It also has the fastest database querying mode. N2tool provides fast and sensitive clustering of genomic fragment databases on the basis of small areas of local similarity. N2tool has proven utility in the discovery of contaminating vector or other artefactual sequence when the potential contaminant is not otherwise known. ICAmatches is a new cluster analysis program that uses a novel alignment style to present multiple alignment summaries. All the tools are convenient to use because they share a common memory-frugal index format and accept most DNA sequence formats directly.

Journal ArticleDOI
TL;DR: A multivariate analysis method called co-inertia analysis was used to determine the main relationships between two data tables having identical rows, which showed that heavy, aromatic amino-acids tend to be avoided, except when they are needed for structural or functional reasons.
Abstract: A multivariate analysis method called co-inertia analysis was used to determine the main relationships between two data tables having identical rows. This method is available in the ADE multivariate analysis package for Macintosh micro-computers. It was applied to two data sets, one containing the amino-acid composition of 999 E. coli proteins, and the other the values of 402 physico-chemical properties for the 20 natural amino-acids. There were strong relationships between amino-acid physico-chemical properties and the composition of proteins. The first common factor was hydrophobicity; it is linked to the biological environment of proteins, either in the cytoplasm (or outside the cell), or in the nonpolar environment of the phospholipid bilayer of biological membranes. The second factor linked the expressivity of protein genes and the propensity of amino-acids to form alpha helix/beta sheets. The third factor showed that heavy, aromatic amino-acids tend to be avoided, except when they are needed for structural or functional reasons. These results are discussed in terms of selective pressure acting on amino-acid composition of proteins.

Journal ArticleDOI
TL;DR: A computer program named MSEQ, based on graph theory has been implemented to aid the sequencing of peptides from collision-activated decomposition (CAD) spectra, able to differentiate isobaric amino acids such as leucine and isoleucine when the side-chain fragmentation appears in the spectrum.
Abstract: A computer program named MSEQ, based on graph theory has been implemented to aid the sequencing of peptides from collision-activated decomposition (CAD) spectra. Input data required by this program are: the molecular weight of the peptide, the list of the masses of the daughter ions and the masses of the N- and C-terminal groups. The output comprises a list of the most likely sequences with their respective scores and the assignments of the daughter ions. A set of probabilities for each fragment ion was computed from hundreds of CAD spectra obtained from our mass spectrometer. To date many peptides have been sequenced in our laboratory with the help of this program, and in most of them the real sequence ranks among the five top sequences. The program is able to differentiate isobaric amino acids such as leucine and isoleucine when the side-chain fragmentation appears in the spectrum. A criterion is used to discard those sequences that match the spectrum poorly from the earliest steps. The program is fast and consumes no memory.

Journal ArticleDOI
TL;DR: A codification structure, entirely interfaced with the main packages for biomolecule database management, associated with a new search algorithm to retrieve quickly a sequence in a database, applicable to both nucleic acid and protein sequences and is used to find patterns in databanks or large sets of sequences.
Abstract: We present here a codification structure, entirely interfaced with the main packages for biomolecule database management, associated with a new search algorithm to retrieve quickly a sequence in a database. This system is derived from a method previously proposed for homology search in databanks with a preprocessed codification of an entire database in which all the overlapping subsequences of a specific length in a sequence were converted into a code and stored in a hash-coding file. This new algorithm is designed for an improved use of the codification. It is based on the recognition of the rarest strings which characterize the query sequence and the intersection of sorted lists read in the codification structure. The system is applicable to both nucleic acid and protein sequences and is used to find patterns in databanks or large sets of sequences. A few examples of applications are given. In addition, the comparison of our method with existing ones shows that this new approach speeds up the search for query patterns in large data sets.

Journal ArticleDOI
TL;DR: The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of 'neutral' bases at a perceived reading frame transition point to make the putative exon candidate frame consistent.
Abstract: This paper presents an algorithm for detecting and ``correcting`` sequencing errors that occur in DNA coding regions. The types of sequencing error addressed include insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy sequence data more informative, reducing the need for high-redundancy sequencing for gene identification and characterization purposes. The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of ``neutral`` bases at a perceived reading frame transition point to make the putative exon candidate frame consistent. The authors have implemented the algorithm as a front-end subsystem of the GRAIL DNA sequence analysis system to construct a version which is very error tolerant and also intend to use this as a testbed for further development of sequencing error-correction technology. On a test set consisting of 68 Human DNA sequences with 1% randomly generated indels in coding regions, the algorithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases. With this subsystem in place, GRAIL correctly predicted 89% of the coding messages withmore » 10% false message on the ``corrected`` sequences, compared to 69% correctly predicted coding messages and 11% falsely predicted messages on the ``corrupted`` sequences using standard GRAIL II method. The method uses a dynamic programming algorithm, and runs in time and space linear to the size of the input sequence.« less

Journal ArticleDOI
TL;DR: The results validate the neural network topologies used for the prediction of protein secondary structures and highlight the relevance of the input information in determining the limit of their performance.
Abstract: In this work we describe a parallel system consisting of feed-forward neural networks supervised by a local genetic algorithm. The system is implemented in a transputer architecture and is used to predict the secondary structures of globular proteins. This method allows a wide search in the parameter space of the neural networks and the determination of their optimal topology for the predictive task. Different neural network topologies are selected by the genetic algorithm on the basis of minimal values of mean square errors on the testing set. When the alpha-helix, beta-strand and random coil motifs of secondary structures are discriminated, the maximal efficiency obtained is 0.62, with correlation coefficients of 0.35, 0.31 and 0.37 respectively. This level of accuracy is similar to that previously attained by means of neural networks without hidden layers and using single protein sequences as input. The results validate the neural network topologies used for the prediction of protein secondary structures and highlight the relevance of the input information in determining the limit of their performance.

Journal ArticleDOI
TL;DR: A method is described for the representation of a bird's-eye view of similarity relationships between large numbers of proteins, which has the advantage of easy detection of the existence of multidomain proteins and diverged families as well as closely related proteins.
Abstract: A method is described for the representation of a bird's-eye view of similarity relationships between large numbers of proteins. With the aid of single-linkage clustering, proteins are clustered into groups on the basis of various types of similarity such as sequence similarity estimated between all the protein pairs. Proteins in a group are directly or indirectly connected to all proteins in the same group by similarities higher than a given threshold and show no similarity higher than the threshold to any proteins outside the group. Thus, all the proteins directly or indirectly related to a protein can be selected out of a large number of proteins by the clustering. Recursion of this clustering of proteins in each group leads to further classification of the proteins. The similarity relationships in each group are visually represented by a similarity matrix. This representation has the advantage of easy detection of the existence of multidomain proteins and diverged families as well as closely related proteins. Such as exhaustive approach to similarity relationships of proteins will be useful for revealing functional/structural/evolutionary units in proteins.

Journal ArticleDOI
TL;DR: This system has supported the building of a YAC map of human chromosome 22 at the Sanger Centre, where use of Alu-PCR product markers is a major component in determining clone overlap and where the authors have an on-going effort to accumulate data from various sources.
Abstract: SAM (system for assembling markers) is a system which supports man-machine problem solving for iteratively ordering a set of markers. SAM aids the user in partially ordering a set of markers based on incomplete and uncertain data. As data is added and modified, SAM aids the user in updating the previously assembled maps. The input is a file of clones and for each clone, a list of the markers contained within it. The objective is to order the set of markers such that the markers contained in each clone are consecutive. The user directs the map building by selecting functions to assemble a region of markers, order the clones to fit the order of the markers and position new markers within an ordered set of markers. The user can edit the input data, edit the assembled map and add clones to the map based on their marker content. The results are displayed graphically and can be saved in a solution file. Based on the partial map, the user designs new experiments or edits the existing data to fill gaps and resolve ambiguities. When a previously assembled map is loaded into SAM, it is automatically updated with the new or altered data. SAM treats all markers as points, but has special features for multiple copy and long markers so that they can be used in the map building process. This system has supported the building of a YAC map of human chromosome 22 at the Sanger Centre, where use of Alu-PCR product markers is a major component in determining clone overlap and where we have an on-going effort to accumulate data from various sources. SAM is also being used at various other laboratories.

Journal ArticleDOI
TL;DR: The variability measure implicit in the core structures is compared with measures of sequence variability, using a procedure for measuring sequence variability that helps correct for the biased sampling in the databanks and finds, somewhat surprisingly, that sequence variation does not appear to correlate with structural variation.
Abstract: As the database of three-dimensi onal protein structures expands, it becomes possible to classify related structures into families. Some of these families, such as the globins, have enough members to allow statistical analysis of conserved features. Previously, we have shown that a probabilistic representation based on means and variances can be useful for defining structural cores for large families. These cores contain the subset of atoms that are in essentially the same relative positions in all members of the family. In addition to defining a core, our method creates an ordered list of atoms, ranked by their structural variation. In applying our core-finding procedure to the globins, we find that helices A, B,G and Hform a structural core with low variance. These helices fold early in the folding pathway, and superimpose well with helices in the helixlurn-helix repressor protein family. The non-core helices (F and the parts of other helices that interact with it) are associated with the functional differences among the globins, and are encoded within a separate exon. We have also compared the variablity measure implicit in our core structures with measures of sequence variability, using a procedure for measuring sequence variability that helps correct for the biased sampling in the databanks. We find, somewhat surprisingly, that sequence variation does not appear to correlate with structural variation.

Journal ArticleDOI
TL;DR: An RNA secondary structure prediction method using a highly parallel computer and based on a parallel combinatorial method which calculates the free energy of a molecule as the sum of the free energies of all the physically possible hydrogen bonds.
Abstract: An RNA secondary structure prediction method using a highly parallel computer is reported. We focus on finding thermodynamically stable structures of a single-stranded RNA molecule. Our approach is based on a parallel combinatorial method which calculates the free energy of a molecule as the sum of the free energies of all the physically possible hydrogen bonds. Our parallel algorithm finds many highly stable structures all at once, while most of the conventional prediction methods find only the most stable structure. The important idea in our algorithm is search tree pruning, with dynamic load balancing across the processor elements in a parallel computer. Software tools for visualization and classification of secondary structures are also presented using the sequence of cadang-cadang coconut viroid as an example. Our software system runs on CM-5.

Journal ArticleDOI
A. Jülich1
TL;DR: The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE.
Abstract: the BLAST sequence comparison programs have been ported to a variety of parallel computers - the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used

Journal ArticleDOI
TL;DR: A package for the creation and processing of multiple sequence alignment is described, with no limit on the lengths of the processed nucleotide or amino acid sequences, and the number of sequences in the alignment is also unlimited.
Abstract: A package for the creation and processing of multiple sequence alignment is described. There is no limit on the lenghts of the processed nucleotide or amino acid sequences, and the number of sequences in the alignment is also unlimited. The main groups of functions are: a semi-automatic alignment editor; a wide set of functions for technical processing of alignments; nucleotide alignment mapping and translation; and similarity search functions. A user-friendly interface and a set of generally used file actions provide a special operational subsystem for everyday tasks