scispace - formally typeset
Search or ask a question

Showing papers by "Alejandro A. Schäffer published in 2006"


Journal ArticleDOI
TL;DR: A new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked, at least four times faster than the old on the human genome.
Abstract: The DUST module has been used within BLAST for many years to mask low-complexity sequences. In this paper, we present a new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked. The new rule masks every nucleotide masked by the old rule and occasionally masks more. The new masking rule corrects two related deficiencies with the old rule. First, the new rule is symmetric with respect to reversing the sequence. Second, the new rule is not context sensitive; the decision to mask a subsequence does not depend on what sequences flank it. The new implementation is at least four times faster than the old on the human genome. We show that both the percentage of additional bases masked and the effect on MegaBLAST outputs are very small.

431 citations


Journal ArticleDOI
TL;DR: It is shown that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy, and is useful in other studies of translated search algorithms.
Abstract: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.

409 citations


Journal ArticleDOI
TL;DR: WindowMasker (WM) is developed, a software tool that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself, which is orders of magnitude faster than RepeatMasker/Maskeraid.
Abstract: Motivation: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. Results: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. Availability: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. Contact: richa@helix.nih.gov Supplementary information: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf.

253 citations


Journal ArticleDOI
01 Jul 2006-Blood
TL;DR: The clinical and molecular phenotype of human AP-3 deficiency is extended and further insights into the role of theAP-3 complex for the innate immune system are provided.

119 citations


Journal ArticleDOI
01 Dec 2006-Genomics
TL;DR: In this article, the authors performed linkage mapping with microsatellites in a large multigeneration pedigree of domestic cats and detected tight linkage for dilute on cat chromosome C1 (θ=0.08, LOD=10.81).

95 citations


Journal ArticleDOI
TL;DR: It is concluded that high parity among men and later menopause among women may be markers for increased life span, and understanding the biological and/or social factors mediating these relationships may provide insights into mechanisms underlying successful aging.
Abstract: Background. The relationship between parity and life span is uncertain, with evidence of both positive and negative relationships being reported previously. We evaluated this issue by using genealogical data from an Old Order Amish community in Lancaster, Pennsylvania, a population characterized by large nuclear families, homogeneous lifestyle, and extensive genealogical records. Methods. The analysis was restricted to the set of 2015 individuals who had children, were born between 1749 and 1912, and survived until at least age 50 years. Pedigree structures and birth and death dates were extracted from Amish genealogies, and the relationship between parity and longevity was examined using a variance component framework. Results. Life span of fathers increased in linear fashion with increasing number of children (0.23 years per additional child; p ¼ .01), while life span of mothers increased linearly up to 14 children (0.32 years per additional child; p ¼ .004) but decreased with each additional child beyond 14 (p ¼ .0004). Among women, but not men, a later age at last birth was associated with longer life span (p ¼ .001). Adjusting for age at last birth obliterated the correlation between maternal life span and number of children, except among mothers with ultrahigh (.14 children) parity. Conclusions. We conclude that high parity among men and later menopause among women may be markers for increased life span. Understanding the biological and/or social factors mediating these relationships may provide insights into mechanisms underlying successful aging.

90 citations


Journal ArticleDOI
TL;DR: A association analysis suggested protection from severe disease by caspase-10 V410I in 63 families with ALPS Ia due to dominant Fas mutations (P<0.05), challenging the earlier suggestion that homozygosity for V 410I alone causes ALPS.
Abstract: Autoimmune lymphoproliferative syndrome (ALPS) is characterized by lymphadenopathy, elevated numbers of T cells with αβ-T cell receptors but neither CD4 nor CD8 co-receptors, and impaired lymphocyte apoptosis in vitro. Defects in the Fas receptor are the most common cause of ALPS (ALPS Ia), but in rare cases other apoptosis proteins have been implicated, including caspase-10 (ALPS II). We investigated the role of variants of caspase-10 in ALPS. Of 32 unrelated probands with ALPS who did not have Fas defects, two were heterozygous for the caspase-10 missense mutation I406L. Like the previously reported ALPS II-associated mutation L285F, I406L impaired apoptosis when transfected alone and dominantly inhibited apoptosis mediated by wild type caspase-10 in a co-transfection assay. Other variants in caspase-10, V410I and Y446C, were found in 3.4 and 1.6% of chromosomes in Caucasians, and in 0.5 and <0.5% of African Americans, respectively. In contrast to L285F and I406L, these variants had no dominant negative effect in co-transfection assays into the H9 lymphocytic cell line. We found healthy individuals homozygous for V410I, challenging the earlier suggestion that homozygosity for V410I alone causes ALPS. Moreover, an association analysis suggested protection from severe disease by caspase-10 V410I in 63 families with ALPS Ia due to dominant Fas mutations (P<0.05). Thus, different genetic variations in caspase-10 can produce contrasting phenotypic effects.

61 citations


Journal ArticleDOI
TL;DR: A version of the BLAST protein database search program, modified to employ this new measure of sequence similarity, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.
Abstract: Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.

61 citations


Journal ArticleDOI
TL;DR: A collection of 32 families with at least one CVID case and a second case of either CVID or IgAD has a peak multipoint LOD score under heterogeneity of 0.32, supporting the existence of a disease-causing gene for autosomal-dominant CVID/IgAD on chromosome 4q.
Abstract: The phenotype of common variable immunodeficiency (CVID) is characterized by recurrent infections owing to hypogammaglobulinemia, with deficiency in immunoglobulin (Ig)G and at least one of IgA or IgM. Family studies have shown a genetic association between CVID and selective IgA deficiency (IgAD), the latter being a milder disorder compatible with normal health. Approximately 20-25% of CVID cases are familial, if one includes families with at least one case of CVID and one of IgAD. Nijenhuis et al described a five-generation family with six cases of CVID, five cases of IgAD, and three cases of dysgammaglobulinemia. We conducted a genome-wide scan on this family seeking genetic linkage. One interval on chromosome 4q gives a peak multipoint LOD score of 2.70 using a strict model that treats only the CVID patients and one obligate carrier with dysgammaglobulinemia as affected. Extending the definition of likely affected to include IgAD boosts the peak multipoint LOD score to 3.38. The linkage interval spans at least from D4S2361 to D4S1572. We extended our study to a collection of 32 families with at least one CVID case and a second case of either CVID or IgAD. We used the same dominant penetrance model and genotyped and analyzed nine markers on 4q. The 32 families have a peak multipoint LOD score under heterogeneity of 0.96 between markers D4S423 and D4S1572 within the suggested linkage interval of the first family, and an estimated proportion of linked families (alpha) of 0.32, supporting the existence of a disease-causing gene for autosomal-dominant CVID/IgAD on chromosome 4q.

52 citations


Journal ArticleDOI
TL;DR: A novel SMA gene candidate, LIX1, is identified in an approximately140-kb deletion on feline chromosome A1q in a region of conserved synteny to human chromosome 5q15, where the predicted secondary structure is compatible with a role in RNA metabolism.
Abstract: The leading genetic cause of infant mortality is spinal muscular atrophy (SMA), a clinically and genetically heterogeneous group of disorders. Previously we described a domestic cat model of autosomal recessive, juvenile-onset SMA similar to human SMA type III. Here we report results of a whole-genome scan for linkage in the feline SMA pedigree using recently developed species-specific and comparative mapping resources. We identified a novel SMA gene candidate, LIX1, in an approximately140-kb deletion on feline chromosome A1q in a region of conserved synteny to human chromosome 5q15. Though LIX1 function is unknown, the predicted secondary structure is compatible with a role in RNA metabolism. LIX1 expression is largely restricted to the central nervous system, primarily in spinal motor neurons, thus offering explanation of the tissue restriction of pathology in feline SMA. An exon sequence screen of 25 human SMA cases, not otherwise explicable by mutations at the SMN1 locus, failed to identify comparable LIX1 mutations. Nonetheless, a LIX1-associated etiology in feline SMA implicates a previously undetected mechanism of motor neuron maintenance and mandates consideration of LIX1 as a candidate gene in human SMA when SMN1 mutations are not found.

48 citations


Journal ArticleDOI
TL;DR: Evidence of a CVID locus on chromosome 16q with autosomal dominant inheritance is presented and the peak (model-based) LOD score for the best marker D16S518 is 2.83, and the NPL score using the same markers peaks at the same location with a value of 3.38.
Abstract: Common variable immunodeficiency (CVID) is an antibody deficiency syndrome that often co-occurs in families with selective IgA deficiency (IgAD). Vorechovský et al. (Am J Hum Genet 64:1096-1109, 1999; J Immunol 164:4408-4416, 2000) ascertained and genotyped 101 multiplex IgAD families and used them to identify and fine map the IGAD1 locus on chromosome 6p. We analyzed the original genotype data in a subset of families with at least one case of CVID and present evidence of a CVID locus on chromosome 16q with autosomal dominant inheritance. The peak (model-based) LOD score for the best marker D16S518 is 2.83 at theta=0.07, and a 4-marker LOD score under heterogeneity peaks at 3.00 with alpha=0.68. The (model-free) NPL score using the same markers peaks at the same location with a value of 3.38 (P=0.0001).

Journal ArticleDOI
TL;DR: The assignment of 140 new markers to the equine radiation hybrid (RH) map, and the anchoring of 24 of these markers to horse chromosomes by FISH are described, which have a three-fold increase in the number of mapped markers compared to previous maps of these chromosomes.
Abstract: A comparative approach that utilizes information from more densely mapped or sequenced genomes is a proven and efficient means to increase our knowledge of the structure of the horse genome. Human chromosome 2 (HSA2), the second largest human chromosome, comprising 243 Mb, and containing 1246 known genes, corresponds to all or parts of three equine chromosomes. This report describes the assignment of 140 new markers (78 genes and 62 microsatellites) to the equine radiation hybrid (RH) map, and the anchoring of 24 of these markers to horse chromosomes by FISH. The updated equine RH maps for ECA6p, ECA15, and ECA18 resulting from this work have one, two, and three RH linkage groups, respectively, per chromosome/chromosome-arm. These maps have a three-fold increase in the number of mapped markers compared to previous maps of these chromosomes, and an increase in the average marker density to one marker per 1.3 Mb. Comparative maps of ECA6p, ECA15, and ECA18 with human, chimpanzee, dog, mouse, rat, and chicken genomes reveal blocks of conserved synteny across mammals and vertebrates.

Journal ArticleDOI
TL;DR: The B44 allele may exert a protective role in ALPS, and among the healthier, mutation-bearing individuals, transmission of HLA B44 was significantly overrepresented (nominal P<0.0074) as compared to transmission in patients with severe clinical features of ALPS.