Home
/
Authors
/
William R. Pearson

Author

William R. Pearson

Other affiliations: University of Virginia Health System, California Institute of Technology, European Bioinformatics Institute ...read more

Bio: William R. Pearson is an academic researcher from University of Virginia. The author has contributed to research in topics: Sequence alignment & Gene. The author has an hindex of 49, co-authored 113 publications receiving 28796 citations. Previous affiliations of William R. Pearson include University of Virginia Health System & California Institute of Technology.

Topics: Sequence alignment, Gene, Sequence database, Peptide sequence, Sequence analysis ...read more

Papers published on a yearly basis

2023
2022
2019
2018
2017
2016
2015
2014
2013
2012
2010
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1988
1987
1985
1983
1982
1981
1979
1978
1977
1976
1974
1918
1916

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Improved tools for biological sequence comparison.

[...]

William R. Pearson¹, David J. Lipman•Institutions (1)

University of Virginia¹

01 Apr 1988-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.

...read moreread less

Abstract: We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

...read moreread less

12,432 citations

Journal Article•DOI•

Rapid and sensitive protein similarity searches

[...]

David J. Lipman¹, William R. Pearson²•Institutions (2)

National Institutes of Health¹, University of Virginia²

22 Mar 1985-Science

TL;DR: An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases and increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution.

...read moreread less

Abstract: An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

...read moreread less

3,902 citations

Book Chapter•DOI•

Rapid and Sensitive Sequence Comparison with FASTP and FASTA.

[...]

William R. Pearson

01 Jan 1990-Methods in Enzymology

TL;DR: FASTA and FASTP were designed to identify protein sequences that have descended from a common ancestor and they have proved very useful for this task, but it is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family.

...read moreread less

Abstract: The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM-PC microcomputer and unambiguously detect proteins that shared a common ancestor billions of years in the past. FASTA is both fast and selective because it initially considers only amino acid identities. Its sensitivity is increased not only by using the PAM250 matrix to score and rescore regions with large numbers of identities but also by joining initial regions. The results of searches with FASTA compare favorably with results using NWS-based programs that are 100 times slower. FASTA is slightly less sensitive but considerably more selective. It is not clear that NWS-based programs would be more successful in finding distantly related members of the G-protein-coupled receptor family. The joining step by FASTA to calculate the initn score is especially useful for sequences that share regions of sequence similarity that are separated by variable-length loops. FASTP and FASTA were designed to identify protein sequences that have descended from a common ancestor, and they have proved very useful for this task. In many cases, a FASTA sequence search will result in a list of high scoring library sequences that are homologous to the query sequence, or the search will result in a list of sequences with similarity scores that cannot be distinguished from the bulk of the library. In either case, the question of whether there are sequences in the library that are clearly related to the query sequence has been answered unambiguously. Unfortunately, the results often will not be so clear-cut, and careful analysis of similarity scores, statistical significance, the actual aligned residues, and the biological context are required. In the course of analyzing the G-protein-coupled receptor family, several proteins were found that, because of a high initn score and a low init1 score that increased almost 2-fold with optimization, appeared to be members of this family which were not previously recognized. RDF2 analysis showed borderline z values, and only a careful examination of the sequence alignments that focused on the conserved residues provided convincing evidence that the high scores were fortuitous. As sequence comparison methods become more powerful by becoming more sensitive, they become more likely to mislead, and even greater care is required.

...read moreread less

2,334 citations

Journal Article•DOI•

Hereditary differences in the expression of the human glutathione transferase active on trans-stilbene oxide are due to a gene deletion.

[...]

Janeric Seidegård¹, William R. Vorachek, Ronald W. Pero, William R. Pearson•Institutions (1)

Lund University¹

01 Oct 1988-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This article measured human liver mRNA levels by using mouse and human cDNA clones that encode class-mu and class-alpha glutathione transferase (GT; EC 2.5.1.18) mRNA levels.

...read moreread less

Abstract: Glutathione transferase (GT; EC 2.5.1.18) mRNA levels were measured in human liver samples by using mouse and human cDNA clones that encode class-mu and class-alpha GT. Although all the RNA samples examined contained class-alpha GT mRNA, class-mu GT mRNA was found only in individuals whose peripheral leukocytes expressed GT activity on the substrate trans-stilbene oxide. The mouse class-mu cDNA clone was used to identify a human class-mu GT cDNA clone, lambda GTH411. The amino acid sequence of the GT encoded by lambda GTH411 is identical with the 23 residues determined for the human liver GT-mu isoenzyme and shares 76-81% identity with mouse and rat class-mu GT isoenzymes. The mouse and human class-mu GT cDNA inserts hybridize with multiple BamHI and EcoRI restriction fragments in the human genome. One of these hybridizing fragments is missing in the DNA of individuals who lack GT activity on trans-stilbene oxide. Hybridizations with nonoverlapping subfragments of lambda GTH411 suggest that there are at least three class-mu genes in the human genome. One of these genes appears to be deleted in individuals lacking GT activity on trans-stilbene oxide.

...read moreread less

738 citations

Book Chapter•DOI•

Flexible sequence similarity searching with the FASTA3 program package.

[...]

William R. Pearson¹•Institutions (1)

University of Virginia¹

01 Jan 2000-Methods of Molecular Biology

TL;DR: The FasTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments.

...read moreread less

Abstract: The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments. Traditionally, sequence similarity searches have sought to ask one question: "Is my query sequence homologous to anything in the database?" Both FASTA and BLAST can provide reliable answers to this question with their statistical estimates; if the expectation value E is < 0.001-0.01 and you are not doing hundreds of searches a day, the answer is probably yes. In general, the most effective search strategies follow these rules: 1. Whenever possible, compare at the amino acid level, rather than the nucleotide level. Search first with protein sequences (blastp, fasta3, and ssearch3), then with translated DNA sequences (fastx, blastx), and only at the DNA level as a last resort (Table 5). 2. Search the smallest database that is likely to contain the sequence of interest (but it must contain many unrelated sequences for accurate statistical estimates). 3. Use sequence statistics, rather than percent identity or percent similarity, as your primary criterion for sequence homology. 4. Check that the statistics are likely to be accurate by looking for the highest-scoring unrelated sequence, using prss3 to confirm the expectation, and searching with shuffled copies of the query sequence [randseq, searches with shuffled sequences should have E approx 1.0]. 5. Consider searches with different gap penalties and other scoring matrices. Searches with long query sequences against full-length sequence libraries will not change dramatically when BLOSUM62 is used instead of BLOSUM50 (20), or a gap penalty of -14/-2 is used in place of -12/-2. However, shallower or more stringent scoring matrices are more effective at uncovering relationships in partial sequences (3,18), and they can be used to sharpen dramatically the scope of the similarity search. However, as illustrated in the last section, the E value is only the first step in characterizing a sequence relationship. Once one has confidence that the sequences are homologous, one should look at the sequence alignments and percent identities, particularly when searching with lower quality sequences. When sequence alignments are very short, the alignment should become more significant when a shallower scoring matrix is used, e.g., BLOSUM62 rather than BLOSUM50 (remember to change the gap penalties). Homology can be reliably inferred from statistically significant similarity. Whereas homology implies common three-dimensional structure, homology need not imply common function. Orthologous sequences usually have similar functions, but paralogous sequences often acquire very different functional roles. Motif databases, such as PROSITE (21), can provide evidence for the conservation of critical functional residues. However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology.

...read moreread less

649 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Basic Local Alignment Search Tool

[...]

Stephen F. Altschul¹, Warren Gish¹, Webb Miller², Eugene W. Myers³, David J. Lipman¹ - Show less +1 more•Institutions (3)

National Institutes of Health¹, Pennsylvania State University², University of Arizona³

01 Oct 1990-Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

88,255 citations

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

Journal Article•DOI•

Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

[...]

Julie D. Thompson, Desmond G. Higgins, Toby J. Gibson

11 Nov 1994-Nucleic Acids Research

TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.

...read moreread less

Abstract: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.

...read moreread less

63,427 citations

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

Journal Article•DOI•

The Protein Data Bank

[...]

Helen M. Berman¹, John D. Westbrook, Zukang Feng, Gary L. Gilliland, Talapady N. Bhat, Helge Weissig, Ilya N. Shindyalov, Philip E. Bourne - Show less +4 more•Institutions (1)

Rutgers University¹

01 Jan 2000-Nucleic Acids Research

TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.

...read moreread less

Abstract: The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

...read moreread less

34,239 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse