scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 1992"


Journal ArticleDOI
TL;DR: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented, by means of an approximate peptide-based sequence comparison algorithm, which is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fastenough to generate a matrix from a specific family or class of proteins in minutes.
Abstract: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based sequence comparison algorithm, the set sequences are clustered at the 85% identity level. The closest relating pairs of sequences are aligned, and observed amino acid exchanges tallied in a matrix. The raw mutation frequency matrix is processed in a similar way to that described by Dayhoff et al. (1978), and so the resulting matrices may be easily used in current sequence analysis applications, in place of the standard mutation data matrices, which have not been updated for 13 years. The method is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fast enough to generate a matrix from a specific family or class of proteins in minutes. Differences observed between our 250 PAM mutation data matrix and the matrix calculated by Dayhoff et al. are briefly discussed.

6,355 citations


Journal ArticleDOI
TL;DR: The CLUSTAL package of multiple sequence alignment programs has been completely rewritten and many new features added, the main new features are the ability to store and reuse old alignments and to calculate phylogenetic trees after alignment.
Abstract: The CLUSTAL package of multiple sequence alignment programs has been completely rewritten and many new features added. The new software is a single program called CLUSTAL V, which is written in C and can be used on any machine with a standard C compiler. The main new features are the ability to store and reuse old alignments and the ability to calculate phylogenetic trees after alignment. The program is simple to use, completely menu driven and on-line help is provided.

2,385 citations


Journal ArticleDOI
TL;DR: An algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space is described, which allows longer sequences to be aligned and allows optimization within wider bands, which can include longer gaps.
Abstract: We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the two sequences and W is the width of the band. The basic algorithm can be used to calculate either local or global alignment scores. Local alignments are produced by finding the beginning and end of a best local alignment in the band, and then applying the global alignment algorithm between those points. This algorithm has been incorporated into the FASTA program package, where it has decreased the amount of memory required to calculate local alignments from O(NW) to O(N) and decreased the time required to calculate optimized scores for every sequence in a protein sequence database by 40%. On computers with limited memory, such as the IBM-PC, this improvement both allows longer sequences to be aligned and allows optimization within wider bands, which can include longer gaps.

176 citations


Journal ArticleDOI
TL;DR: The algorithm provides the most likely optimal alignment and a comprehensive list of the alignment dilemma, and duality between automatism and interactivity is provided.
Abstract: Original algorithms for simultaneous alignment of protein sequences are presented, including sequence clustering and within- or between-groups multiple alignment. The way of matching similar regions is fundamentally new. Complete matches are formed by segments more similar than expected by random, according to a given probability limit. Any classic or user-defined score matrix can be used to express the similarity between the residues. The algorithm seeks for complete matches common to all the sequences without performing pairwise alignment and regardless of gap weighting. An automatic screening delineates all the similar regions (boxes) that may be defined for a given maximal shift between the sequences. The shift can be large enough to allow the matching of any region of a sequence with any region of another one. It can also be short and used to refine the alignment around anchor points. The algorithm provides the most likely optimal alignment and a comprehensive list of the alignment dilemma. Duality between automatism and interactivity is provided. Depending on the problem complexity, a final alignment is obtained fully automatically or requires some interactive handling to discriminate alternative pathways.

114 citations


Journal ArticleDOI
TL;DR: The description of the computational procedures and use of the computer program with suitable examples from mosquito control programmes are discussed.
Abstract: Probit analysis calculations are highly useful in biology and related sciences. Since the statistical calculations and tests required are quite involved, the use of an automatic computer program is desirable. The description of the computational procedures and use of the computer program with suitable examples from mosquito control programmes are discussed.

111 citations


Journal ArticleDOI
TL;DR: The ability to search databases of blocks by 'on-the-fly' conversion to scoring matrices provides a new tool for detection and evaluation of distant relationships.
Abstract: A program has been developed that provides molecular biologists with multiple tools for searching databases, yet uses a very simple interface. PATMAT can use protein or (translated) DNA sequences, patterns or blocks of aligned proteins as queries of databases consisting of amino acid or nucleotide sequences, patterns or blocks. The ability to search databases of blocks by 'on-the-fly' conversion to scoring matrices provides a new tool for detection and evaluation of distant relationships. PATMAT uses a pull-down, menu-driven interface to carry out its multiple searching, extraction and viewing functions. Each query or database type is recognized, reported, and the appropriate search carried out, with matches and alignments reported in windows as they occur. Any of the high scoring matches can be exported to a file, viewed and recalled as a query using only a few keystrokes or mouse selections. Searches of multiple database files are carried out by user selection within a window. PATMAT runs under DOS; the searching engine also runs under UNIX.

95 citations


Journal ArticleDOI
TL;DR: The parallel method provides rapid, high-resolution alignments for users of the software toolkit for pairwise sequence comparison, as illustrated here by a comparison of the chloroplast genomes of tobacco and liverwort.
Abstract: The local similarity problem is to determine the similar regions within two given sequences. We recently developed a dynamic programming algorithm for the local similarity problem that requires only space proportional to the sum of the two sequence lengths, whereas earlier methods use space proportional to the product of the lengths. In this paper, we describe how to parallelize the new algorithm and present results of experimental studies on an Intel hypercube. The parallel method provides rapid, high-resolution alignments for users of our software toolkit for pairwise sequence comparison, as illustrated here by a comparison of the chloroplast genomes of tobacco and liverwort.

87 citations


Journal ArticleDOI
TL;DR: The computer program PROFILEGRAPH, a graphical interactive tool for the analysis of amino acid sequences, is described, which allows the user to combine any amino acid specific parameter with a selection of several possible types of analysis and to plot the resulting graph in one of several windows on the screen.
Abstract: The computer program PROFILEGRAPH, a graphical interactive tool for the analysis of amino acid sequences, is described. The main task of the program is to integrate a variety of sliding-window methods into a single user-friendly shell. The program allows the user to combine any amino acid specific parameter with a selection of several possible types of analysis and to plot the resulting graph in one of several windows on the screen. It is also possible to calculate the moment of the amino acid specific parameter for a given secondary structure and to display both the absolute moment value and the moment angle relative to a reference residue. Also included are several utilities that facilitate visual analysis of protein primary structures like, for example, helical-wheel diagrams. It is possible to adapt the majority of published sliding-window analysis procedures for use with PROFILEGRAPH.

69 citations


Journal ArticleDOI
TL;DR: This paper shows how to use principal coordinates analysis to find low-dimensional representations of distance matrices derived from aligned sets of sequences, which represents the known patterns of relationship between the sequences.
Abstract: Ordination is a powerful method for analysing complex data sets but has been largely ignored in sequence analysis. This paper shows how to use principal coordinates analysis to find low-dimensional representations of distance matrices derived from aligned sets of sequences. The method takes a matrix of Euclidean distances between all pairs of sequence and finds a coordinate space where the distances are exactly preserved. The main problem is to find a measure of distance between aligned sequences that is Euclidean. The simplest distance function is the square root of the percentage difference (as measured by identities) between two sequences, where one ignores any positions in the alignment where there is a gap in any sequence. If one does not ignore positions with a gap, the distances cannot be guaranteed to be Euclidean but the deleterious effects are trivial. Two examples of using the method are shown. A set of 226 aligned globins were analysed and the resulting ordination very successfully represents the known patterns of relationship between the sequences. In the other example, a set of 610 aligned 5S rRNA sequences were analysed. Sequence ordinations complement phylogenetic analyses. They should not be viewed as a complete alternative.

68 citations


Journal ArticleDOI
TL;DR: A new approach to search for common patterns in many sequences is presented, where one sequence from the set of sequences to be compared is considered as a 'basic' one and all its similarities with other sequences are found and multiple similarities are reconstructed using these data.
Abstract: A new approach to search for common patterns in many sequences is presented. The idea is that one sequence from the set of sequences to be compared is considered as a 'basic' one and all its similarities with other sequences are found. Multiple similarities are then reconstructed using these data. This approach allows one to search for similar segments which can differ in both substitutions and deletions/insertions. These segments can be situated at different positions in various sequences. No regions of complete or strong similarity within the segments are required. The other parts of the sequences can have no similarity at all. The only requirement is that the similar segments can be found in all the sequences (or in the majority of them, given the common segments are present in the basic sequence). Working time of an algorithm presented is proportional to n.L2 when n sequences of length L are analyzed. The algorithm proposed is implemented as programs for the IBM-PC and IBM/370. Its applications to the analysis of biopolymer primary structures as well as the dependence of the results on the choice of basic sequence are discussed.

60 citations


Journal ArticleDOI
TL;DR: It was found that the expected frequency and overlapping properties determine most of the variance, and a new solution to the problem of word overlap is proposed.
Abstract: An exact expression for the variance of random frequency that a given word has in text generated by a Markov chain is presented. The result is applied to periodic Markov chains, which describe the protein-coding DNA sequences better than simple Markov chains. A new solution to the problem of word overlap is proposed. It was found that the expected frequency and overlapping properties determine most of the variance. The expectation and variance of counts for triplets are compared with experimental counts in Escherichia coli coding sequences.

Journal ArticleDOI
TL;DR: This work examines the use of internal controls for estimating the expected initial copy number of the target in a polymerase chain reaction (PCR) using an extended branching-process model and provides an algorithm for conducting the statistical analysis.
Abstract: We examine the use of internal controls for estimating the expected initial copy number of the target in a polymerase chain reaction (PCR). We base our investigation on an extended branching-process model. In terms of that model, we delineate the necessary assumptions for this methodology to yield approximately unbiased answers, and we provide means for testing some of those assumptions. We show how to design a series of PCRs to attain optimal precision of the estimate. We provide an algorithm for conducting the statistical analysis of the data, including a formula for a confidence interval for the unknown expected initial copy number.

Journal ArticleDOI
TL;DR: A program OBSTRUCT has been developed to obtain the largest possible subset according to specific constraints from a set of protein sequences whose tertiary structures have been determined crystallographically.
Abstract: A program OBSTRUCT has been developed to obtain the largest possible subset according to specific constraints from a set of protein sequences whose tertiary structures have been determined crystallographically. The user can request a range in sequence similarity level and/or structural resolution. The program optionally includes sequences with known three-dimensional folds elicited from NMR data.

Journal ArticleDOI
TL;DR: An artificial neural network was used to cluster proteins into families and it is suggested that this novel approach may be a useful tool to organize the search for homologies in large macromolecular databases.
Abstract: An artificial neural network was used to cluster proteins into families. The network, composed of 7 x 7 neurons, was trained with the Kohonen unsupervised learning algorithm using, as inputs, matrix patterns derived from the bipeptide composition of 447 proteins, belonging to 13 different families. As a result of the training, and without any a priori indication of the number or composition of the expected families, the network self-organized the activation of its neurons into topologically ordered maps in which almost all the proteins (96.7%) were correctly clustered into the corresponding families. In a second computational experiment, a similar network was trained with one family of the previous learning set (76 cytochrome c sequences). The new neural map clustered these proteins into 25 different neurons (five in the first experiment), wherein phylogenetically related sequences were positioned close to each other. This result shows that the network can adapt the clustering resolution to the complexity of the learning set, a useful feature when working with an unknown number of clusters. Although the learning stage is time consuming, once the topological map is obtained, the classification of new proteins is very fast. Altogether, our results suggest that this novel approach may be a useful tool to organize the search for homologies in large macromolecular databases.

Journal ArticleDOI
TL;DR: The cumulative logit model is eminently suitable for the analysis of ordinal response data and it also allows adjustment of confounding and assessment of effect modification based on modest sample size.
Abstract: Incorrect statistical methods are often used for the analysis of ordinal response data. Such data are frequently summarized into mean scores for comparisons, a fallacious practice because ordinal data are inherently not equidistant. The ubiquitous Pearson chi-square test is invalid because it ignores the ranking of ordinal data. Although some of the non-parametric statistical methods take into account the ordering of ordinal data, these methods do not accommodate statistical adjustment of confounding or assessment of effect modification, two overriding analytic goals in virtually all etiologic inference in biology and medicine. The cumulative logit model is eminently suitable for the analysis of ordinal response data. This multivariate method not only considers the ranked order inherent in ordinal response data, but it also allows adjustment of confounding and assessment of effect modification based on modest sample size. A non-technical account of the cumulative logit model is given and its applications are illustrated by two research examples. The SAS programs for the data analysis of the research examples are available from the author.

Journal ArticleDOI
TL;DR: The design and implementation of the current system is described, a 'client/server' network of Sun, IBM, DEC and Apple servers, gateways and workstations, to provide online computing support to the Human Genome Mapping Project in the UK.
Abstract: This paper presents an overview of computing and networking facilities developed by the Medical Research Council to provide online computing support to the Human Genome Mapping Project (HGMP) in the UK. The facility is connected to a number of other computing facilities in various centres of genetics and molecular biology research excellence, either directly via high-speed links or through national and international wide-area networks. The paper describes the design and implementation of the current system, a 'client/server' network of Sun, IBM, DEC and Apple servers, gateways and workstations. A short outline of online computing services currently delivered by this system to the UK human genetics research community is also provided. More information about the services and their availability could be obtained by a direct approach to the UK HGMP-RC.

Journal ArticleDOI
TL;DR: An algorithm is presented which approximates the total partition function by a Boltzmann-weighted summation of optimal and suboptimal secondary structures atSeveral temperatures at several temperatures.
Abstract: Dynamic programming algorithms are able to predict optimal and suboptimal secondary structures of RNA. These suboptimal or alternative secondary structures are important for the biological function of RNA. The distribution of secondary structures present in solution is governed by the thermodynamic equilibrium between the different structures. An algorithm is presented which approximates the total partition function by a Boltzmann-weighted summation of optimal and suboptimal secondary structures at several temperatures. A clear representation of the equilibrium distribution of secondary structures is derived from a two-dimensional bonding matrix with base-pairing probability as the third dimension. The temperature dependence of the equilibrium distribution gives the denaturation behavior of the nucleic acid, which may be compared to experimental optical denaturation curves after correction for the hypochromicities of the different base-pairs. Similarly, temperature-induced mobility changes detected in temperature-gradient gel electrophoresis of nucleic acids may be interpreted on the basis of the temperature dependence of the equilibrium distribution. Results are illustrated for natural circular and synthetic linear potato spindle tuber viroid RNA respectively, and are compared to experimental data.

Journal ArticleDOI
TL;DR: A set of programs has been written to quantify the similarities between large numbers of cDNA sequences and this information is used to cluster similar sequences together.
Abstract: A set of programs has been written to quantify the similarities between large numbers of cDNA sequences. This information is used to cluster similar sequences together. The main program can cluster thousands of cDNA sequences per day using a novel, computationally inexpensive algorithm. The clustering information is kept in a small index file so that disk storage requirements are negligible. Using this index file, subsidiary programs create various views and statistical summaries of the entire cDNA sequence collection.

Journal ArticleDOI
TL;DR: It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original four-letter alphabet and grows exponentially with increasing l.
Abstract: Upon searching local similarities in long sequences, the necessity of a 'rapid' similarity search becomes acute. Quadratic complexity of dynamic programming algorithms forces the employment of filtration methods that allow elimination of the sequences with a low similarity level. The paper is devoted to the theoretical substantiations of the filtration method based on the statistical distance between texts. The notion of the filtration efficiency is introduced and the efficiency of several filters is estimated. It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original four-letter alphabet and grows exponentially with increasing l. The formula that allows one to estimate the filtration parameters is presented.

Journal ArticleDOI
TL;DR: A computer program is written to construct survival curves by the corrected group prognostic curves approach to Cox's proportional hazards regression model, and it is coded in the Interactive Matrix Language of SAS.
Abstract: Cox's proportional hazards regression model is a useful statistical tool for the analysis of 'survival data' from longitudinal studies. This multivariate method compares the 'survival experience' between two or more exposure groups while allowing for simultaneous adjustment of confounding due to one or more covariates. In addition to the summary regression statistics, further insight on the exposure--response relationship can be gained by visually examining the covariates-adjusted survival curves in the respective comparison groups. Covariates-adjusted survival curves are usually computed by the 'average covariate method'. This method is, however, subject to potential drawbacks. A method that avoids these drawbacks is to estimate adjusted survival curves by the corrected group prognostic curves approach. We have written a computer program to construct survival curves by the latter method. The program is coded in the Interactive Matrix Language of SAS.

Journal ArticleDOI
TL;DR: It is shown in this paper that the rigor and efficiency of dynamic programming algorithms carry over to the map comparison algorithms, and algorithms for restriction map comparison that deal with two types of map errors are presented.
Abstract: For most sequence comparison problems there is a corresponding map comparison algorithm. While map data may appear to be incompatible with dynamic programming, we show in this paper that the rigor and efficiency of dynamic programming algorithms carry over to the map comparison algorithms. We present algorithms for restriction map comparison that deal with two types of map errors: (i) closely spaced sites for different enzymes can be ordered incorrectly, and (ii) closely spaced sites for the same enzyme can be mapped as a single site. The new algorithms are a natural extension of a previous map comparison model. Dynamic programming algorithms for computing optimal global and local alignments under the new model are described. The new algorithms take about the same order of time as previous map comparison algorithms. Programs implementing some of the new algorithms are used to find similar regions within the Escherichia coli restriction map of Kohara et al.

Journal ArticleDOI
TL;DR: The applications of concepts derived from fractal geometry to biological problems are described and methods and algorithms drawn from a wide range of literature are drawn, including some non-biological sources.
Abstract: The applications of concepts derived from fractal geometry to biological problems are described. Three major applications are identified: modelling of structures; investigation of theoretical problems; and the measurement of complexity. The review concentrates on methods and algorithms, including potential problems, which can be used with biological problems. These algorithms are drawn from a wide range of literature, including some non-biological sources.

Journal ArticleDOI
TL;DR: A new protein sequence analysis package, ADSP, is described, of which the SOMAP Screen-Oriented Multiple Alignment Procedure forms an integral part, which incorporates a powerful method for compound feature analysis.
Abstract: A new protein sequence analysis package, ADSP, is described, of which the SOMAP Screen-Oriented Multiple Alignment Procedure forms an integral part. ADSP (Algorithms and Data Structures for Protein sequence analysis) incorporates facilities to generate potent pattern-recognition discriminators and offers four algorithms with which to scan any NBRF format sequence database: the package has been designed, in particular, to interface with the OWL composite sequence database, one of the largest, distributed non-redundant sources of sequence data of its kind. The system incorporates a powerful method for compound feature analysis, which provides the basis for characterizing and predicting the occurrence of complete protein superfamilies and for pinpointing the emergence of related sub-families. Used iteratively, the approach allows diagnostic performance to be rigorously refined and its efficacy to be assessed both qualitatively and quantitatively, and results in the generation of refined structural or functional features suitable for entry into a database: this compilation of characteristic signatures is distinct from, but complementary to, widely used compendia of pattern templates such as PROSITE.


Journal ArticleDOI
TL;DR: GCWIND is a microcomputer (IBM-PC compatible) program for the identification of protein-coding open reading frames to provide an immediate representation of those regions within the sequence that have coding potential.
Abstract: GCWIND is a microcomputer (IBM-PC compatible) program for the identification of protein-coding open reading frames. The program is similar to the FRAME program, but the latter has only been implemented for a specialized graphics package. The base compositions (%G+C) for each of the three possible reading phases through the DNA sequence are displayed separately, together with the positions of potential translation initiation and termination codons (on the leading and complementary strands), to provide an immediate representation of those regions within the sequence that have coding potential.

Journal ArticleDOI
TL;DR: A multiple sequence alignment editor is described which runs on a VAX/VMS system and can exchange data with a number of other programs, including those of the Genetics Computer Group (GCG).
Abstract: A multiple sequence alignment editor is described which runs on a VAX/VMS system and can exchange data with a number of other programs, including those of the Genetics Computer Group (GCG). Up to 199 sequences can be aligned. The quality of the alignment can be easily judged during its development because the display attributes to each character are determined by the way it matches the other sequences. Four methods are available for calculating the highlighting to emphasize different aspects of the relationships of the sequences and up to four styles of highlighting can be used at the same time. Laser printer output is suitable for publication without modification.

Journal ArticleDOI
TL;DR: This work refers to graph theory and proposes an algorithm to enumerate all the strings that are solutions of a sequence from all its subsequences having the same length, and introduces another algorithm that produces a signature for each solution string.
Abstract: The problem tackled here concerns the feasibility of DNA sequencing using hybridization methods. We establish algorithms for and computational limitations to the reconstruction of a sequence from all its subsequences having the same length: in other words, the building of a string that contains all the words of a given set, and only these ones. Generally there are several possible strings. We refer to graph theory and propose an algorithm to enumerate all the strings that are solutions. We then carried out stimulations using real DNA sequences. They provided some necessary conditions and give some upper bounds to the length of the sequence to recover in relation with the length of oligonucleotides. To avoid limiting ourselves to problems that admit a unique solution, we introduce another algorithm that produces a signature for each solution string. Each signature can be tested to determine which one belongs to the correct sequence.

Journal ArticleDOI
TL;DR: A program for IBM-compatible microcomputers is introduced which combines several complementary analyses of species-station-tables generated in ecological field investigations and the essential reasoning behind the application of community studies is presented briefly.
Abstract: A program for IBM-compatible microcomputers is introduced which combines several complementary analyses of species-station-tables generated in ecological field investigations. The scope of the program encompasses table editing functions, routines for community delimitation by cluster analysis and procedures for the analysis of properties related both to stations (e.g. diversities) and species (e.g. abundance statistics and association indices). The essential reasoning behind the application of community studies is presented briefly, as well as the multi-step analytical approach implemented in the program.

Journal ArticleDOI
TL;DR: A number of functional pseudoknots that have been reported before can be identified and predicted from their sequences by the proposed method, which is significantly more stable than those that can be formed from a large set of scrambled sequences.
Abstract: The RNA pseudoknot has been proposed as a significant structural motif in a wide range of biological processes of RNAs. A pseudoknot involves intramolecular pairing of bases in a hairpin loop with bases outside the stem of the loop to form a second stem and loop region. In this study, we propose a method for searching and predicting pseudoknots that are likely to have functional meaning. In our procedure, the orthodox hairpin structure involved in the pseudoknot is required to be both statistically significant and relatively stable to the others in the sequence. The bases outside the stem of the hairpin loop in the predicted pseudoknot are not entangled with any formation of a highly stable secondary structure in the sequence. Also, the predicted pseudoknot is significantly more stable than those that can be formed from a large set of scrambled sequences under the assumption that the energy contribution from a pseudoknot is proportional to the size of second loop region and planar energy contribution from second stem region. A number of functional pseudoknots that have been reported before can be identified and predicted from their sequences by our method.

Journal ArticleDOI
TL;DR: This finite-difference computer model is designed to simulate complex diffusion/reaction events in bacterial films, and was originally designed for modelling the events in dental plaque leading to tooth decay, but should find application in other fields.
Abstract: This finite-difference computer model is designed to simulate complex diffusion/reaction events in bacterial films. It is modular, each module mirroring closely a particular physical, chemical or biochemical factor. It is capable of handling > 20 diffusing/reacting species, but can be easily expanded or simplified to match particular systems. It was originally designed for modelling the events in dental plaque leading to tooth decay, but should find application in other fields. It allows for ion-exchange interactions with, for example, fixed charges on bacterial surfaces, which can act as pH and cation buffer sites. pH-dependent utilization of substrate is modelled implicitly, combining Michaelis-Menten kinetics with diffusion in a single iterative procedure. Advantages are given for computing diffusion of all other species explicitly using single-species diffusion coefficients, with charge-coupling by means of the algorithm Q-COUPLE. Activity corrections and enzyme pH-dependence are included. Chemical equilibria and mineral deposition/dissolution are computed iteratively node by node. The program is tested against some problems having analytical solutions, and an example is given of its application to demineralization of teeth as a result of bacterial action in dental plaque.