scispace - formally typeset
Search or ask a question
Author

Alejandro A. Schäffer

Other affiliations: Rice University, Bell Labs
Bio: Alejandro A. Schäffer is an academic researcher from National Institutes of Health. The author has contributed to research in topics: Population & Cancer. The author has an hindex of 74, co-authored 249 publications receiving 92583 citations. Previous affiliations of Alejandro A. Schäffer include Rice University & Bell Labs.


Papers
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal Article
TL;DR: A variety of algorithmic improvements are described, which synthesize biological principles with computer science techniques, to effectively restructure the time-consuming computations in genetic linkage analysis.
Abstract: Linkage analysis using maximum-likelihood estimation is a powerful tool for locating genes. As available data sets have grown, the computation required for analysis has grown exponentially and become a significant impediment. Others have previously shown that parallel computation is applicable to linkage analysis and can yield order-of-magnitude improvements in speed. In this paper, we demonstrate that algorithmic modifications can also yield order-of-magnitude improvements, and sometimes much more. Using the software package LINKAGE, we describe a variety of algorithmic improvements that we have implemented, demonstrating both how these techniques are applied and their power. Experiments show that these improvements speed up the programs by an order of magnitude, on problems of moderate and large size. All improvements were made only in the combinatorial part of the code, without restoring to parallel computers. These improvements synthesize biological principles with computer science techniques, to effectively restructure the time-consuming computations in genetic linkage analysis.

1,380 citations

Journal ArticleDOI
TL;DR: The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST, and the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition.
Abstract: PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758± 0.005 to 0.895 ± 0.003. This does not include the benefits from four modifications we included in the ‘baseline’ version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a positionspecific scoring system tuned to that sequence’s amino acid composition. The use of compositionbased statistics is particularly beneficial for largescale automated applications of PSI-BLAST.

1,307 citations

Journal ArticleDOI
TL;DR: Mutations in genes encoding the IL10R subunit proteins were found in patients with early-onset enterocolitis, involving hyperinflammatory immune responses in the intestine, and resulted in disease remission in one patient.
Abstract: Background The molecular cause of inflammatory bowel disease is largely unknown. Methods We performed genetic-linkage analysis and candidate-gene sequencing on samples from two unrelated consanguineous families with children who were affected by early-onset inflammatory bowel disease. We screened six additional patients with early-onset colitis for mutations in two candidate genes and carried out functional assays in patients' peripheral-blood mononuclear cells. We performed an allogeneic hematopoietic stem-cell transplantation in one patient. Results In four of nine patients with early-onset colitis, we identified three distinct homozygous mutations in genes IL10RA and IL10RB, encoding the IL10R1 and IL10R2 proteins, respectively, which form a heterotetramer to make up the interleukin-10 receptor. The mutations abrogate interleukin-10–induced signaling, as shown by deficient STAT3 (signal transducer and activator of transcription 3) phosphorylation on stimulation with interleukin-10. Consistent with this...

1,269 citations

Journal ArticleDOI
TL;DR: Mutations in STAT3 underlie sporadic and dominant forms of the hyper-IgE syndrome, an immunodeficiency syndrome involving increased innate immune response, recurrent infections, and complex somatic features.
Abstract: Background The hyper-IgE syndrome (or Job's syndrome) is a rare disorder of immunity and connective tissue characterized by dermatitis, boils, cyst-forming pneumonias, elevated serum IgE levels, retained primary dentition, and bone abnormalities. Inheritance is autosomal dominant; sporadic cases are also found. Methods We collected longitudinal clinical data on patients with the hyper-IgE syndrome and their families and assayed the levels of cytokines secreted by stimulated leukocytes and the gene expression in resting and stimulated cells. These data implicated the signal transducer and activator of transcription 3 gene (STAT3) as a candidate gene, which we then sequenced. Results We found increased levels of proinflammatory gene transcripts in unstimulated peripheral-blood neutrophils and mononuclear cells from patients with the hyper-IgE syndrome, as compared with levels in control cells. In vitro cultures of mononuclear cells from patients that were stimulated with lipopolysaccharide, with or without ...

1,098 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations

Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations

Journal ArticleDOI
TL;DR: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences.
Abstract: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.

13,223 citations